AI Email Organizer Reddit

AI Email Organizer Reddit — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Self-supervised learning

    Self-supervised learning

    Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving them requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples, where one sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects. During SSL, the model learns in two steps. First, the task is solved based on an auxiliary or pretext classification task using pseudo-labels, which help to initialize the model parameters. Next, the actual task is performed with supervised or unsupervised learning. Self-supervised learning has produced promising results in recent years, and has found practical application in fields such as audio processing, and is being used by Facebook and others for speech recognition. == Pseudo-labels == Pseudo-labels are automatically generated labels that a model assigns to unlabeled data based on its own predictions. They are widely used in self-supervised and semi-supervised learning, where ground-truth annotations are limited or unavailable. By treating predicted labels as surrogate ground truth, learning algorithms can make use of large quantities of unlabeled data in the training process. Pseudo-labeling also plays an important role in systems that must adapt to concept drift, where the statistical properties of the data change over time. In these scenarios, the model may detect that an incoming instance deviates from previously learned behavior. The system then generates a classification result for that instance, and this predicted class is used as a pseudo-label for updating or retraining model components that are becoming outdated. This approach enables continuous adaptation in dynamic environments without requiring manual annotation. In many adaptive learning pipelines, pseudo-labels are chosen when the classifier produces sufficiently confident predictions, reducing the risk of propagating errors. These pseudo-labeled instances are then incorporated into training to refresh or evolve the model's understanding of emerging data patterns, particularly when existing components show signs of “aging” due to drift or distributional shifts. This strategy reduces reliance on manual labeling while helping maintain long-term model performance. == Types == === Autoassociative self-supervised learning === Autoassociative self-supervised learning is a specific category of self-supervised learning where a neural network is trained to reproduce or reconstruct its own input data. In other words, the model is tasked with learning a representation of the data that captures its essential features or structure, allowing it to regenerate the original input. The term "autoassociative" comes from the fact that the model is essentially associating the input data with itself. This is often achieved using autoencoders, which are a type of neural network architecture used for representation learning. Autoencoders consist of an encoder network that maps the input data to a lower-dimensional representation (latent space), and a decoder network that reconstructs the input from this representation. The training process involves presenting the model with input data and requiring it to reconstruct the same data as closely as possible. The loss function used during training typically penalizes the difference between the original input and the reconstructed output (e.g. mean squared error). By minimizing this reconstruction error, the autoencoder learns a meaningful representation of the data in its latent space. === Contrastive self-supervised learning === For a binary classification task, training data can be divided into positive examples and negative examples. Positive examples are those that match the target. For example, if training a classifier to identify birds, the positive training data would include images that contain birds. Negative examples would be images that do not. Contrastive self-supervised learning uses both positive and negative examples. The loss function in contrastive learning is used to minimize the distance between positive sample pairs, while maximizing the distance between negative sample pairs. An early example uses a pair of 1-dimensional convolutional neural networks to process a pair of images and maximize their agreement. Contrastive Language-Image Pre-training (CLIP) allows joint pretraining of a text encoder and an image encoder, such that a matching image-text pair have image encoding vector and text encoding vector that span a small angle (having a large cosine similarity). InfoNCE (Noise-Contrastive Estimation) is a method to optimize two models jointly, based on Noise Contrastive Estimation (NCE). Given a set X = { x 1 , … x N } {\displaystyle X=\left\{x_{1},\ldots x_{N}\right\}} of N {\displaystyle N} random samples containing one positive sample from p ( x t + k ∣ c t ) {\displaystyle p\left(x_{t+k}\mid c_{t}\right)} and N − 1 {\displaystyle N-1} negative samples from the 'proposal' distribution p ( x t + k ) {\displaystyle p\left(x_{t+k}\right)} , it minimizes the following loss function: L N = − E X [ log ⁡ f k ( x t + k , c t ) ∑ x j ∈ X f k ( x j , c t ) ] {\displaystyle {\mathcal {L}}_{\mathrm {N} }=-\mathbb {E} _{X}\left[\log {\frac {f_{k}\left(x_{t+k},c_{t}\right)}{\sum _{x_{j}\in X}f_{k}\left(x_{j},c_{t}\right)}}\right]} === Non-contrastive self-supervised learning === Non-contrastive self-supervised learning (NCSSL) uses only positive examples. Counterintuitively, NCSSL converges on a useful local minimum rather than reaching a trivial solution, with zero loss. For the example of binary classification, it would trivially learn to classify each example as positive. Effective NCSSL requires an extra predictor on the online side that does not back-propagate on the target side. === Joint-Embedding and Predictive Architectures === A major class of self-supervised learning moves beyond contrastive pairs, instead maximizing the agreement between views while preventing collapse through statistical constraints. Rooted in Deep Canonical Correlation Analysis (Deep CCA), this approach includes Joint-Embedding Architectures (JEA) like Barlow Twins and VICReg, which enforce covariance constraints to learn invariant representations without negative sampling. Deep Latent Variable Path Modelling (DLVPM) generalizes this to multimodal systems, using path models to enforce correlation and orthogonality across diverse data types. In 2022 Yann LeCun introduced Joint-Embedding Predictive Architectures (JEPA) as a step towards decision making, reasoning, and autonomous human intelligence in machines, including self-improvement through autonomous learning. Founded in representation learning, LeCun included the concept of a “world model” in JEPA which aims to enable machines to replicate human intellect by providing machines with a concept for the world in which they exist. Unlike autoencoders, JEPAs operate entirely in latent space, avoiding pixel-level noise to focus on semantic structure. Rather than just learning invariance, JEPAs learn by predicting masked latent representations from visible context. JEPA has been applied to domains such as image analysis, audio processing, and motion in images and video. == Comparison with other forms of machine learning == SSL belongs to supervised learning methods insofar as the goal is to generate a classified output from the input. At the same time, however, it does not require the explicit use of labeled input-output pairs. Instead, correlations, metadata embedded in the data, or domain knowledge present in the input are implicitly and autonomously extracted from the data. These supervisory signals, extracted from the data, can then be used for training. SSL is similar to unsupervised learning in that it does not require labels in the sample data. Unlike unsupervised learning, however, learning is not done using inherent data structures. Semi-supervised learning combines supervised and unsupervised learning, requiring only a small portion of the learning data be labeled. In transfer learning, a model designed for one task is reused on a different task. Training an autoencoder intrinsically constitutes a self-supervised process, because the output pattern needs to become an optimal reconstruction of the input pattern itself. However, in current jargon, the term 'self-supervised' often refers to tasks based on a pretext-task training setup

    Read more →
  • Tanagra (machine learning)

    Tanagra (machine learning)

    Tanagra is a free suite of machine learning software for research and academic purposes developed by Ricco Rakotomalala at the Lumière University Lyon 2, France. Tanagra supports several standard data mining tasks such as: Visualization, Descriptive statistics, Instance selection, feature selection, feature construction, regression, factor analysis, clustering, classification and association rule learning. Tanagra is an academic project. It is widely used in French-speaking universities. Tanagra is frequently used in real studies and in software comparison papers. == History == The development of Tanagra was started in June 2003. The first version was distributed in December 2003. Tanagra is the successor of Sipina, another free data mining tool which is intended only for supervised learning tasks (classification), especially the interactive and visual construction of decision trees. Sipina is still available online and is maintained. Tanagra is an "open source project" as every researcher can access the source code and add their own algorithms, as long as they agree and conform to the software distribution license. The main purpose of the Tanagra project is to give researchers and students a user-friendly data mining software, conforming to the present norms of the software development in this domain (especially in the design of its GUI and the way to use it), and allowing the analyzation of either real or synthetic data. From 2006, Ricco Rakotomalala made an important documentation effort. A large number of tutorials are published on a dedicated website. They describe the statistical and machine learning methods and their implementation with Tanagra on real case studies. The use of other free data mining tools on the same problems is also widely described. The comparison of the tools enables readers to understand the possible differences in the presentation of results. == Description == Tanagra works similarly to current data mining tools. The user can design visually a data mining process in a diagram. Each node is a statistical or machine learning technique, the connection between two nodes represents the data transfer. But unlike the majority of tools which are based on the workflow paradigm, Tanagra is very simplified. The treatments are represented in a tree diagram. The results are displayed in an HTML format. This makes it is easy to export the outputs in order to visualize the results in a browser. It is also possible to copy the result tables to a spreadsheet. Tanagra makes a good compromise between statistical approaches (e.g. parametric and nonparametric statistical tests), multivariate analysis methods (e.g. factor analysis, correspondence analysis, cluster analysis, regression) and machine learning techniques (e.g. neural network, support vector machine, decision trees, random forest).

    Read more →
  • Variable kernel density estimation

    Variable kernel density estimation

    In statistics, adaptive or "variable-bandwidth" kernel density estimation is a form of kernel density estimation in which the size of the kernels used in the estimate are varied depending upon either the location of the samples or the location of the test point. It is a particularly effective technique when the sample space is multi-dimensional. == Rationale == Given a set of samples, { x → i } {\displaystyle \lbrace {\vec {x}}_{i}\rbrace } , we wish to estimate the density, P ( x → ) {\displaystyle P({\vec {x}})} , at a test point, x → {\displaystyle {\vec {x}}} : P ( x → ) ≈ W n h D {\displaystyle P({\vec {x}})\approx {\frac {W}{nh^{D}}}} W = ∑ i = 1 n w i {\displaystyle W=\sum _{i=1}^{n}w_{i}} w i = K ( x → − x → i h ) {\displaystyle w_{i}=K\left({\frac {{\vec {x}}-{\vec {x}}_{i}}{h}}\right)} where n is the number of samples, K is the "kernel", h is its width and D is the number of dimensions in x → {\displaystyle {\vec {x}}} . The kernel can be thought of as a simple, linear filter. Using a fixed filter width may mean that in regions of low density, all samples will fall in the tails of the filter with very low weighting, while regions of high density will find an excessive number of samples in the central region with weighting close to unity. To fix this problem, we vary the width of the kernel in different regions of the sample space. There are two methods of doing this: balloon and pointwise estimation. In a balloon estimator, the kernel width is varied depending on the location of the test point. In a pointwise estimator, the kernel width is varied depending on the location of the sample. For multivariate estimators, the parameter, h, can be generalized to vary not just the size, but also the shape of the kernel. This more complicated approach will not be covered here. == Balloon estimators == A common method of varying the kernel width is to make it inversely proportional to the density at the test point: h = k [ n P ( x → ) ] 1 / D {\displaystyle h={\frac {k}{\left[nP({\vec {x}})\right]^{1/D}}}} where k is a constant. If we back-substitute the estimated PDF, and assuming a Gaussian kernel function, we can show that W is a constant: W = k D ( 2 π ) D / 2 {\displaystyle W=k^{D}(2\pi )^{D/2}} A similar derivation holds for any kernel whose normalising function is of the order hD, although with a different constant factor in place of the (2 π)D/2 term. This produces a generalization of the k-nearest neighbour algorithm. That is, a uniform kernel function will return the KNN technique. There are two components to the error: a variance term and a bias term. The variance term is given as: e 1 = P ∫ K 2 n h D {\displaystyle e_{1}={\frac {P\int K^{2}}{nh^{D}}}} . The bias term is found by evaluating the approximated function in the limit as the kernel width becomes much larger than the sample spacing. By using a Taylor expansion for the real function, the bias term drops out: e 2 = h 2 n ∇ 2 P {\displaystyle e_{2}={\frac {h^{2}}{n}}\nabla ^{2}P} An optimal kernel width that minimizes the error of each estimate can thus be derived. == Use for statistical classification == The method is particularly effective when applied to statistical classification. There are two ways we can proceed: the first is to compute the PDFs of each class separately, using different bandwidth parameters, and then compare them as in Taylor. Alternatively, we can divide up the sum based on the class of each sample: P ( j , x → ) ≈ 1 n ∑ i = 1 , c i = j n w i {\displaystyle P(j,{\vec {x}})\approx {\frac {1}{n}}\sum _{i=1,c_{i}=j}^{n}w_{i}} where ci is the class of the ith sample. The class of the test point may be estimated through maximum likelihood.

    Read more →
  • Evolutionary algorithm

    Evolutionary algorithm

    Evolutionary algorithms (EA) reproduce essential elements of biological evolution in a computer algorithm in order to solve "difficult" problems, at least approximately, for which no exact or satisfactory solution methods are known. They are metaheuristics and population-based bio-inspired algorithms and evolutionary computation, which itself are part of the field of computational intelligence. The mechanisms of biological evolution that an EA mainly imitates are reproduction, mutation, recombination and selection. Candidate solutions to the optimization problem play the role of individuals in a population, and the fitness function determines the quality of the solutions (see also loss function). Evolution of the population then takes place after the repeated application of the above operators. Evolutionary algorithms often perform well approximating solutions to all types of problems because they ideally do not make any assumption about the underlying fitness landscape. Techniques from evolutionary algorithms applied to the modeling of biological evolution are generally limited to explorations of microevolution (microevolutionary processes) and planning models based upon cellular processes. In most real applications of EAs, computational complexity is a prohibiting factor. In fact, this computational complexity is due to fitness function evaluation. Fitness approximation is one of the solutions to overcome this difficulty. However, seemingly simple EA can solve often complex problems; therefore, there may be no direct link between algorithm complexity and problem complexity. == Generic definition == The following is an example of a generic evolutionary algorithm: Randomly generate the initial population of individuals, the first generation. Evaluate the fitness of each individual in the population. Check, if the goal is reached and the algorithm can be terminated. Select individuals as parents, preferably of higher fitness. Produce offspring with optional crossover (mimicking reproduction). Apply mutation operations on the offspring. Select individuals preferably of lower fitness for replacement with new individuals (mimicking natural selection). Return to 2 == Types == Similar techniques differ in genetic representation and other implementation details, and the nature of the particular applied problem. Genetic algorithm – This is the most popular type of EA. One seeks the solution of a problem in the form of strings of numbers (traditionally binary, although the best representations are usually those that reflect something about the problem being solved), by applying operators such as recombination and mutation (sometimes one, sometimes both). This type of EA is often used in optimization problems. Genetic programming – Here the solutions are in the form of computer programs, and their fitness is determined by their ability to solve a computational problem. There are many variants of Genetic Programming: Cartesian genetic programming Gene expression programming Grammatical evolution Linear genetic programming Multi expression programming Evolutionary programming – Similar to evolution strategy, but with a deterministic selection of all parents. Evolution strategy (ES) – Works with vectors of real numbers as representations of solutions, and typically uses self-adaptive mutation rates. The method is mainly used for numerical optimization, although there are also variants for combinatorial tasks. CMA-ES Natural evolution strategy Differential evolution – Based on vector differences and is therefore primarily suited for numerical optimization problems. Coevolutionary algorithm – Similar to genetic algorithms and evolution strategies, but the created solutions are compared on the basis of their outcomes from interactions with other solutions. Solutions can either compete or cooperate during the search process. Coevolutionary algorithms are often used in scenarios where the fitness landscape is dynamic, complex, or involves competitive interactions. Neuroevolution – Similar to genetic programming but the genomes represent artificial neural networks by describing structure and connection weights. The genome encoding can be direct or indirect. Learning classifier system – Here the solution is a set of classifiers (rules or conditions). A Michigan-LCS evolves at the level of individual classifiers whereas a Pittsburgh-LCS uses populations of classifier-sets. Initially, classifiers were only binary, but now include real, neural net, or S-expression types. Fitness is typically determined with either a strength or accuracy based reinforcement learning or supervised learning approach. Quality–Diversity algorithms – QD algorithms simultaneously aim for high-quality and diverse solutions. Unlike traditional optimization algorithms that solely focus on finding the best solution to a problem, QD algorithms explore a wide variety of solutions across a problem space and keep those that are not just high performing, but also diverse and unique. == Theoretical background == The following theoretical principles apply to all or almost all EAs. === No free lunch theorem === The no free lunch theorem of optimization states that all optimization strategies are equally effective when the set of all optimization problems is considered. Under the same condition, no evolutionary algorithm is fundamentally better than another. This can only be the case if the set of all problems is restricted. This is exactly what is inevitably done in practice. Therefore, to improve an EA, it must exploit problem knowledge in some form (e.g. by choosing a certain mutation strength or a problem-adapted coding). Thus, if two EAs are compared, this constraint is implied. In addition, an EA can use problem specific knowledge by, for example, not randomly generating the entire start population, but creating some individuals through heuristics or other procedures. Another possibility to tailor an EA to a given problem domain is to involve suitable heuristics, local search procedures or other problem-related procedures in the process of generating the offspring. This form of extension of an EA is also known as a memetic algorithm. Both extensions play a major role in practical applications, as they can speed up the search process and make it more robust. === Convergence === For EAs in which, in addition to the offspring, at least the best individual of the parent generation is used to form the subsequent generation (so-called elitist EAs), there is a general proof of convergence under the condition that an optimum exists. Without loss of generality, a maximum search is assumed for the proof: From the property of elitist offspring acceptance and the existence of the optimum it follows that per generation k {\displaystyle k} an improvement of the fitness F {\displaystyle F} of the respective best individual x ′ {\displaystyle x'} will occur with a probability P > 0 {\displaystyle P>0} . Thus: F ( x 1 ′ ) ≤ F ( x 2 ′ ) ≤ F ( x 3 ′ ) ≤ ⋯ ≤ F ( x k ′ ) ≤ ⋯ {\displaystyle F(x'_{1})\leq F(x'_{2})\leq F(x'_{3})\leq \cdots \leq F(x'_{k})\leq \cdots } I.e., the fitness values represent a monotonically non-decreasing sequence, which is bounded due to the existence of the optimum. From this follows the convergence of the sequence against the optimum. Since the proof makes no statement about the speed of convergence, it is of little help in practical applications of EAs. But it does justify the recommendation to use elitist EAs. However, when using the usual panmictic population model, elitist EAs tend to converge prematurely more than non-elitist ones. In a panmictic population model, mate selection (see step 4 of the generic definition) is such that every individual in the entire population is eligible as a mate. In non-panmictic populations, selection is suitably restricted, so that the dispersal speed of better individuals is reduced compared to panmictic ones. Thus, the general risk of premature convergence of elitist EAs can be significantly reduced by suitable population models that restrict mate selection. === Virtual alphabets === With the theory of virtual alphabets, David E. Goldberg showed in 1990 that by using a representation with real numbers, an EA that uses classical recombination operators (e.g. uniform or n-point crossover) cannot reach certain areas of the search space, in contrast to a coding with binary numbers. This results in the recommendation for EAs with real representation to use arithmetic operators for recombination (e.g. arithmetic mean or intermediate recombination). With suitable operators, real-valued representations are more effective than binary ones, contrary to earlier opinion. == Comparison to other concepts == === Biological processes === A possible limitation of many evolutionary algorithms is their lack of a clear genotype–phenotype distinction. In nature, the fertilized egg cell undergoes a complex process known as embryogenesis to become a mature p

    Read more →
  • ChromaDB

    ChromaDB

    Chroma or ChromaDB is open-source data infrastructure tailored to applications with large language models. Its headquarters are in San Francisco. In April 2023, it raised 18 million US dollars as seed funding. ChromaDB has been used in academic studies on artificial intelligence, particularly as part of the tech stack for retrieval-augmented generation.

    Read more →
  • Autoencoder

    Autoencoder

    An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms. Variants exist which aim to make the learned representations assume useful properties. Examples are regularized autoencoders (sparse, denoising and contractive autoencoders), which are effective in learning representations for subsequent classification tasks, and variational autoencoders, which can be used as generative models. Autoencoders are applied to many problems, including facial recognition, feature detection, anomaly detection, and learning the meaning of words. In terms of data synthesis, autoencoders can also be used to randomly generate new data that is similar to the input (training) data. == Mathematical principles == === Definition === An autoencoder is defined by the following components: Two sets: the space of encoded messages Z {\displaystyle {\mathcal {Z}}} ; the space of decoded messages X {\displaystyle {\mathcal {X}}} . Typically X {\displaystyle {\mathcal {X}}} and Z {\displaystyle {\mathcal {Z}}} are Euclidean spaces, that is, X = R m , Z = R n {\displaystyle {\mathcal {X}}=\mathbb {R} ^{m},{\mathcal {Z}}=\mathbb {R} ^{n}} with m > n . {\displaystyle m>n.} Two parametrized families of functions: the encoder family E ϕ : X → Z {\displaystyle E_{\phi }:{\mathcal {X}}\rightarrow {\mathcal {Z}}} , parametrized by ϕ {\displaystyle \phi } ; the decoder family D θ : Z → X {\displaystyle D_{\theta }:{\mathcal {Z}}\rightarrow {\mathcal {X}}} , parametrized by θ {\displaystyle \theta } .For any x ∈ X {\displaystyle x\in {\mathcal {X}}} , we usually write z = E ϕ ( x ) {\displaystyle z=E_{\phi }(x)} , and refer to it as the code, the latent variable, latent representation, latent vector, etc. Conversely, for any z ∈ Z {\displaystyle z\in {\mathcal {Z}}} , we usually write x ′ = D θ ( z ) {\displaystyle x'=D_{\theta }(z)} , and refer to it as the (decoded) message. Usually, both the encoder and the decoder are defined as multilayer perceptrons (MLPs). For example, a one-layer-MLP encoder E ϕ {\displaystyle E_{\phi }} is: E ϕ ( x ) = σ ( W x + b ) {\displaystyle E_{\phi }(\mathbf {x} )=\sigma (Wx+b)} where σ {\displaystyle \sigma } is an element-wise activation function, W {\displaystyle W} is a "weight" matrix, and b {\displaystyle b} is a "bias" vector. === Training an autoencoder === An autoencoder, by itself, is simply a tuple of two functions. To judge its quality, we need a task. A task is defined by a reference probability distribution μ r e f {\displaystyle \mu _{ref}} over X {\displaystyle {\mathcal {X}}} , and a "reconstruction quality" function d : X × X → [ 0 , ∞ ] {\displaystyle d:{\mathcal {X}}\times {\mathcal {X}}\to [0,\infty ]} , such that d ( x , x ′ ) {\displaystyle d(x,x')} measures how much x ′ {\displaystyle x'} differs from x {\displaystyle x} . With those, we can define the loss function for the autoencoder as L ( θ , ϕ ) := E x ∼ μ r e f [ d ( x , D θ ( E ϕ ( x ) ) ) ] {\displaystyle L(\theta ,\phi ):=\mathbb {\mathbb {E} } _{x\sim \mu _{ref}}[d(x,D_{\theta }(E_{\phi }(x)))]} The optimal autoencoder for the given task ( μ r e f , d ) {\displaystyle (\mu _{ref},d)} is then arg ⁡ min θ , ϕ L ( θ , ϕ ) {\displaystyle \arg \min _{\theta ,\phi }L(\theta ,\phi )} . The search for the optimal autoencoder can be accomplished by any mathematical optimization technique, but usually by gradient descent. This search process is referred to as "training the autoencoder". In most situations, the reference distribution is just the empirical distribution given by a dataset { x 1 , . . . , x N } ⊂ X {\displaystyle \{x_{1},...,x_{N}\}\subset {\mathcal {X}}} , so that μ r e f = 1 N ∑ i = 1 N δ x i {\displaystyle \mu _{ref}={\frac {1}{N}}\sum _{i=1}^{N}\delta _{x_{i}}} where δ x i {\displaystyle \delta _{x_{i}}} is the Dirac measure, the quality function is just L 2 {\displaystyle L^{2}} loss: d ( x , x ′ ) = ‖ x − x ′ ‖ 2 2 {\displaystyle d(x,x')=\|x-x'\|_{2}^{2}} , and ‖ ⋅ ‖ 2 {\displaystyle \|\cdot \|_{2}} is the Euclidean norm. Then the problem of searching for the optimal autoencoder is just a least-squares optimization: min θ , ϕ L ( θ , ϕ ) , where L ( θ , ϕ ) = 1 N ∑ i = 1 N ‖ x i − D θ ( E ϕ ( x i ) ) ‖ 2 2 {\displaystyle \min _{\theta ,\phi }L(\theta ,\phi ),\qquad {\text{where }}L(\theta ,\phi )={\frac {1}{N}}\sum _{i=1}^{N}\|x_{i}-D_{\theta }(E_{\phi }(x_{i}))\|_{2}^{2}} === Interpretation === An autoencoder has two main parts: an encoder that maps the message to a code, and a decoder that reconstructs the message from the code. An optimal autoencoder would perform as close to perfect reconstruction as possible, with "close to perfect" defined by the reconstruction quality function d {\displaystyle d} . The simplest way to perform the copying task perfectly would be to duplicate the signal. To suppress this behavior, the code space Z {\displaystyle {\mathcal {Z}}} usually has fewer dimensions than the message space X {\displaystyle {\mathcal {X}}} . Such an autoencoder is called undercomplete. It can be interpreted as compressing the message, or reducing its dimensionality. At the limit of an ideal undercomplete autoencoder, every possible code z {\displaystyle z} in the code space is used to encode a message x {\displaystyle x} that really appears in the distribution μ r e f {\displaystyle \mu _{ref}} , and the decoder is also perfect: D θ ( E ϕ ( x ) ) = x {\displaystyle D_{\theta }(E_{\phi }(x))=x} . This ideal autoencoder can then be used to generate messages indistinguishable from real messages, by feeding its decoder arbitrary code z {\displaystyle z} and obtaining D θ ( z ) {\displaystyle D_{\theta }(z)} , which is a message that really appears in the distribution μ r e f {\displaystyle \mu _{ref}} . If the code space Z {\displaystyle {\mathcal {Z}}} has dimension larger than (overcomplete), or equal to, the message space X {\displaystyle {\mathcal {X}}} , or the hidden units are given enough capacity, an autoencoder can learn the identity function and become useless. However, experimental results found that overcomplete autoencoders might still learn useful features. In the ideal setting, the code dimension and the model capacity could be set on the basis of the complexity of the data distribution to be modeled. A standard way to do so is to add modifications to the basic autoencoder, to be detailed below. == Variations == === Variational autoencoder (VAE) === Variational autoencoders (VAEs) belong to the families of variational Bayesian methods. Despite the architectural similarities with basic autoencoders, VAEs are architected with different goals and have a different mathematical formulation. The latent space is, in this case, composed of a mixture of distributions instead of fixed vectors. Given an input dataset x {\displaystyle x} characterized by an unknown probability function P ( x ) {\displaystyle P(x)} and a multivariate latent encoding vector z {\displaystyle z} , the objective is to model the data as a distribution p θ ( x ) {\displaystyle p_{\theta }(x)} , with θ {\displaystyle \theta } defined as the set of the network parameters so that p θ ( x ) = ∫ z p θ ( x , z ) d z {\displaystyle p_{\theta }(x)=\int _{z}p_{\theta }(x,z)dz} . === Sparse autoencoder (SAE) === Inspired by the sparse coding hypothesis in neuroscience, sparse autoencoders (SAE) are variants of autoencoders, such that the codes E ϕ ( x ) {\displaystyle E_{\phi }(x)} for messages tend to be sparse codes, that is, E ϕ ( x ) {\displaystyle E_{\phi }(x)} is close to zero in most entries. Sparse autoencoders may include more (rather than fewer) hidden units than inputs, but only a small number of the hidden units are allowed to be active at the same time. Encouraging sparsity improves performance on classification tasks. There are two main ways to enforce sparsity. One way is to simply clamp all but the highest-k activations of the latent code to zero. This is the k-sparse autoencoder. The k-sparse autoencoder inserts the following "k-sparse function" in the latent layer of a standard autoencoder: f k ( x 1 , . . . , x n ) = ( x 1 b 1 , . . . , x n b n ) {\displaystyle f_{k}(x_{1},...,x_{n})=(x_{1}b_{1},...,x_{n}b_{n})} where b i = 1 {\displaystyle b_{i}=1} if | x i | {\displaystyle |x_{i}|} ranks in the top k, and 0 otherwise. Backpropagating through f k {\displaystyle f_{k}} is simple: set gradient to 0 for b i = 0 {\displaystyle b_{i}=0} entries, and keep gradient for b i = 1 {\displaystyle b_{i}=1} entries. This is essentially a generalized ReLU function. The other way is a relaxed version of the k-

    Read more →
  • Natarajan dimension

    Natarajan dimension

    In the theory of Probably Approximately Correct Machine Learning, the Natarajan dimension characterizes the complexity of learning a set of functions, generalizing from the Vapnik–Chervonenkis dimension for boolean functions to multi-class functions. Originally introduced as the Generalized Dimension by Natarajan, it was subsequently renamed the Natarajan Dimension by Haussler and Long. == Definition == Let H {\displaystyle H} be a set of functions from a set X {\displaystyle X} to a set Y {\displaystyle Y} . H {\displaystyle H} shatters a set C ⊂ X {\displaystyle C\subset X} if there exist two functions f 0 , f 1 ∈ H {\displaystyle f_{0},f_{1}\in H} such that For every x ∈ C , f 0 ( x ) ≠ f 1 ( x ) {\displaystyle x\in C,f_{0}(x)\neq f_{1}(x)} . For every B ⊂ C {\displaystyle B\subset C} , there exists a function h ∈ H {\displaystyle h\in H} such that for all x ∈ B , h ( x ) = f 0 ( x ) {\displaystyle x\in B,h(x)=f_{0}(x)} and for all x ∈ C − B , h ( x ) = f 1 ( x ) {\displaystyle x\in C-B,h(x)=f_{1}(x)} . The Natarajan dimension of H is the maximal cardinality of a set shattered by H {\displaystyle H} . It is easy to see that if | Y | = 2 {\displaystyle |Y|=2} , the Natarajan dimension collapses to the Vapnik–Chervonenkis dimension. Shalev-Shwartz and Ben-David present comprehensive material on multi-class learning and the Natarajan dimension, including uniform convergence and learnability. Recently, Cohen et al showed that the Natarajan dimension is the dominant term governing agnostic multi-class PAC learnability.

    Read more →
  • Growth function

    Growth function

    The growth function, also called the shatter coefficient or the shattering number, measures the richness of a set family or class of functions. It is especially used in the context of statistical learning theory, where it is used to study properties of statistical learning methods. The term 'growth function' was coined by Vapnik and Chervonenkis in their 1968 paper, where they also proved many of its properties. It is a basic concept in machine learning. == Definitions == === Set-family definition === Let H {\displaystyle H} be a set family (a set of sets) and C {\displaystyle C} a set. Their intersection is defined as the following set-family: H ∩ C := { h ∩ C ∣ h ∈ H } {\displaystyle H\cap C:=\{h\cap C\mid h\in H\}} The intersection-size (also called the index) of H {\displaystyle H} with respect to C {\displaystyle C} is | H ∩ C | {\displaystyle |H\cap C|} . If a set C m {\displaystyle C_{m}} has m {\displaystyle m} elements then the index is at most 2 m {\displaystyle 2^{m}} . If the index is exactly 2m then the set C {\displaystyle C} is said to be shattered by H {\displaystyle H} , because H ∩ C {\displaystyle H\cap C} contains all the subsets of C {\displaystyle C} , i.e.: | H ∩ C | = 2 | C | , {\displaystyle |H\cap C|=2^{|C|},} The growth function measures the size of H ∩ C {\displaystyle H\cap C} as a function of | C | {\displaystyle |C|} . Formally: Growth ⁡ ( H , m ) := max C : | C | = m | H ∩ C | {\displaystyle \operatorname {Growth} (H,m):=\max _{C:|C|=m}|H\cap C|} === Hypothesis-class definition === Equivalently, let H {\displaystyle H} be a hypothesis-class (a set of binary functions) and C {\displaystyle C} a set with m {\displaystyle m} elements. The restriction of H {\displaystyle H} to C {\displaystyle C} is the set of binary functions on C {\displaystyle C} that can be derived from H {\displaystyle H} : H C := { ( h ( x 1 ) , … , h ( x m ) ) ∣ h ∈ H , x i ∈ C } {\displaystyle H_{C}:=\{(h(x_{1}),\ldots ,h(x_{m}))\mid h\in H,x_{i}\in C\}} The growth function measures the size of H C {\displaystyle H_{C}} as a function of | C | {\displaystyle |C|} : Growth ⁡ ( H , m ) := max C : | C | = m | H C | {\displaystyle \operatorname {Growth} (H,m):=\max _{C:|C|=m}|H_{C}|} == Examples == 1. The domain is the real line R {\displaystyle \mathbb {R} } . The set-family H {\displaystyle H} contains all the half-lines (rays) from a given number to positive infinity, i.e., all sets of the form { x > x 0 ∣ x ∈ R } {\displaystyle \{x>x_{0}\mid x\in \mathbb {R} \}} for some x 0 ∈ R {\displaystyle x_{0}\in \mathbb {R} } . For any set C {\displaystyle C} of m {\displaystyle m} real numbers, the intersection H ∩ C {\displaystyle H\cap C} contains m + 1 {\displaystyle m+1} sets: the empty set, the set containing the largest element of C {\displaystyle C} , the set containing the two largest elements of C {\displaystyle C} , and so on. Therefore: Growth ⁡ ( H , m ) = m + 1 {\displaystyle \operatorname {Growth} (H,m)=m+1} . The same is true whether H {\displaystyle H} contains open half-lines, closed half-lines, or both. 2. The domain is the segment [ 0 , 1 ] {\displaystyle [0,1]} . The set-family H {\displaystyle H} contains all the open sets. For any finite set C {\displaystyle C} of m {\displaystyle m} real numbers, the intersection H ∩ C {\displaystyle H\cap C} contains all possible subsets of C {\displaystyle C} . There are 2 m {\displaystyle 2^{m}} such subsets, so Growth ⁡ ( H , m ) = 2 m {\displaystyle \operatorname {Growth} (H,m)=2^{m}} . 3. The domain is the Euclidean space R n {\displaystyle \mathbb {R} ^{n}} . The set-family H {\displaystyle H} contains all the half-spaces of the form: x ⋅ ϕ ≥ 1 {\displaystyle x\cdot \phi \geq 1} , where ϕ {\displaystyle \phi } is a fixed vector. Then Growth ⁡ ( H , m ) = Comp ⁡ ( n , m ) {\displaystyle \operatorname {Growth} (H,m)=\operatorname {Comp} (n,m)} , where Comp is the number of components in a partitioning of an n-dimensional space by m hyperplanes. 4. The domain is the real line R {\displaystyle \mathbb {R} } . The set-family H {\displaystyle H} contains all the real intervals, i.e., all sets of the form { x ∈ [ x 0 , x 1 ] | x ∈ R } {\displaystyle \{x\in [x_{0},x_{1}]|x\in \mathbb {R} \}} for some x 0 , x 1 ∈ R {\displaystyle x_{0},x_{1}\in \mathbb {R} } . For any set C {\displaystyle C} of m {\displaystyle m} real numbers, the intersection H ∩ C {\displaystyle H\cap C} contains all runs of between 0 and m {\displaystyle m} consecutive elements of C {\displaystyle C} . The number of such runs is ( m + 1 2 ) + 1 {\displaystyle {m+1 \choose 2}+1} , so Growth ⁡ ( H , m ) = ( m + 1 2 ) + 1 {\displaystyle \operatorname {Growth} (H,m)={m+1 \choose 2}+1} . == Polynomial or exponential == The main property that makes the growth function interesting is that it can be either polynomial or exponential - nothing in-between. The following is a property of the intersection-size: If, for some set C m {\displaystyle C_{m}} of size m {\displaystyle m} , and for some number n ≤ m {\displaystyle n\leq m} , | H ∩ C m | ≥ Comp ⁡ ( n , m ) {\displaystyle |H\cap C_{m}|\geq \operatorname {Comp} (n,m)} - then, there exists a subset C n ⊆ C m {\displaystyle C_{n}\subseteq C_{m}} of size n {\displaystyle n} such that | H ∩ C n | = 2 n {\displaystyle |H\cap C_{n}|=2^{n}} . This implies the following property of the Growth function. For every family H {\displaystyle H} there are two cases: The exponential case: Growth ⁡ ( H , m ) = 2 m {\displaystyle \operatorname {Growth} (H,m)=2^{m}} identically. The polynomial case: Growth ⁡ ( H , m ) {\displaystyle \operatorname {Growth} (H,m)} is majorized by Comp ⁡ ( n , m ) ≤ m n + 1 {\displaystyle \operatorname {Comp} (n,m)\leq m^{n}+1} , where n {\displaystyle n} is the smallest integer for which Growth ⁡ ( H , n ) < 2 n {\displaystyle \operatorname {Growth} (H,n)<2^{n}} . == Other properties == === Trivial upper bound === For any finite H {\displaystyle H} : Growth ⁡ ( H , m ) ≤ | H | {\displaystyle \operatorname {Growth} (H,m)\leq |H|} since for every C {\displaystyle C} , the number of elements in H ∩ C {\displaystyle H\cap C} is at most | H | {\displaystyle |H|} . Therefore, the growth function is mainly interesting when H {\displaystyle H} is infinite. === Exponential upper bound === For any nonempty H {\displaystyle H} : Growth ⁡ ( H , m ) ≤ 2 m {\displaystyle \operatorname {Growth} (H,m)\leq 2^{m}} I.e, the growth function has an exponential upper-bound. We say that a set-family H {\displaystyle H} shatters a set C {\displaystyle C} if their intersection contains all possible subsets of C {\displaystyle C} , i.e. H ∩ C = 2 C {\displaystyle H\cap C=2^{C}} . If H {\displaystyle H} shatters C {\displaystyle C} of size m {\displaystyle m} , then Growth ⁡ ( H , C ) = 2 m {\displaystyle \operatorname {Growth} (H,C)=2^{m}} , which is the upper bound. === Cartesian intersection === Define the Cartesian intersection of two set-families as: H 1 ⨂ H 2 := { h 1 ∩ h 2 ∣ h 1 ∈ H 1 , h 2 ∈ H 2 } {\displaystyle H_{1}\bigotimes H_{2}:=\{h_{1}\cap h_{2}\mid h_{1}\in H_{1},h_{2}\in H_{2}\}} . Then: Growth ⁡ ( H 1 ⨂ H 2 , m ) ≤ Growth ⁡ ( H 1 , m ) ⋅ Growth ⁡ ( H 2 , m ) {\displaystyle \operatorname {Growth} (H_{1}\bigotimes H_{2},m)\leq \operatorname {Growth} (H_{1},m)\cdot \operatorname {Growth} (H_{2},m)} === Union === For every two set-families: Growth ⁡ ( H 1 ∪ H 2 , m ) ≤ Growth ⁡ ( H 1 , m ) + Growth ⁡ ( H 2 , m ) {\displaystyle \operatorname {Growth} (H_{1}\cup H_{2},m)\leq \operatorname {Growth} (H_{1},m)+\operatorname {Growth} (H_{2},m)} === VC dimension === The VC dimension of H {\displaystyle H} is defined according to these two cases: In the polynomial case, VCDim ⁡ ( H ) = n − 1 {\displaystyle \operatorname {VCDim} (H)=n-1} = the largest integer d {\displaystyle d} for which Growth ⁡ ( H , d ) = 2 d {\displaystyle \operatorname {Growth} (H,d)=2^{d}} . In the exponential case VCDim ⁡ ( H ) = ∞ {\displaystyle \operatorname {VCDim} (H)=\infty } . So VCDim ⁡ ( H ) ≥ d {\displaystyle \operatorname {VCDim} (H)\geq d} if-and-only-if Growth ⁡ ( H , d ) = 2 d {\displaystyle \operatorname {Growth} (H,d)=2^{d}} . The growth function can be regarded as a refinement of the concept of VC dimension. The VC dimension only tells us whether Growth ⁡ ( H , d ) {\displaystyle \operatorname {Growth} (H,d)} is equal to or smaller than 2 d {\displaystyle 2^{d}} , while the growth function tells us exactly how Growth ⁡ ( H , m ) {\displaystyle \operatorname {Growth} (H,m)} changes as a function of m {\displaystyle m} . Another connection between the growth function and the VC dimension is given by the Sauer–Shelah lemma: If VCDim ⁡ ( H ) = d {\displaystyle \operatorname {VCDim} (H)=d} , then: for all m {\displaystyle m} : Growth ⁡ ( H , m ) ≤ ∑ i = 0 d ( m i ) {\displaystyle \operatorname {Growth} (H,m)\leq \sum _{i=0}^{d}{m \choose i}} In particular, for all m > d + 1 {\displaystyle m>d+1} : Growth ⁡ ( H , m ) ≤ ( e m / d ) d = O ( m d ) {\displaystyle \operatorname {Growth} (H,m)\leq (

    Read more →
  • Deep Learning Anti-Aliasing

    Deep Learning Anti-Aliasing

    Deep Learning Anti-Aliasing (DLAA) is a form of spatial anti-aliasing developed by Nvidia. DLAA depends on and requires Tensor Cores available in Nvidia RTX cards. DLAA is similar to Deep Learning Super Sampling (DLSS) in its anti-aliasing method, with one important differentiation being that the goal of DLSS is to increase performance at the cost of image quality, whereas the main priority of DLAA is improving image quality at the cost of performance (irrelevant of resolution upscaling or downscaling). DLAA is similar to temporal anti-aliasing (TAA) in that they are both spatial anti-aliasing solutions relying on past frame data. Compared to TAA, DLAA is substantially better when it comes to shimmering, flickering, and handling small meshes like wires. == Technical overview == DLAA collects game rendering data including raw low-resolution input, motion vectors, depth buffers, and exposure information. This information feeds into a convolutional neural network that processes the image to reduce aliasing while preserving fine detail. The neural network architecture employs an auto-encoder design trained on high-quality reference images. The training dataset includes diverse scenarios focusing on challenging cases like sub-pixel details, high-contrast edges, and transparent surfaces. The network then processes frames in real-time. Unlike traditional anti-aliasing solutions that rely on manually written heuristics, such as TAA, DLAA uses its neural network to preserve fine details while eliminating unwanted visual artifacts. == History == DLAA was initially called and marketed by Nvidia as DLSS 2x. The first game that added support for DLAA was The Elder Scrolls Online, which implemented the feature in 2021. By June 2022, DLAA was only available in six games. This number rose to 17 by February 2023. In June 2023, TechPowerUp reported that "DLAA is seeing sluggish adoption among game developers", and that Nvidia was working on adding DLAA to the quality presets of DLSS to boost adoption. By December 2023, DLAA was supported in 41 games. In early 2025, an update for the Nvidia App added a driver-based DLSS override feature that enables users to activate DLAA even in games that do not support it natively. == Differences between TAA and DLAA == TAA is used in many modern video games and game engines; however, all previous implementations have used some form of manually written heuristics to prevent temporal artifacts such as ghosting and flickering. One example of this is neighborhood clamping which forcefully prevents samples collected in previous frames from deviating too much compared to nearby pixels in newer frames. This helps to identify and fix many temporal artifacts, but deliberately removing fine details in this way is analogous to applying a blur filter, and thus the final image can appear blurry when using this method. DLAA uses an auto-encoder convolutional neural network trained to identify and fix temporal artifacts, instead of manually programmed heuristics as mentioned above. Because of this, DLAA can generally resolve detail better than other TAA and TAAU implementations, while also removing most temporal artifacts. == Differences between DLSS and DLAA == While DLSS handles upscaling with a focus on performance, DLAA handles anti-aliasing with a focus on visual quality. DLAA runs at the given screen resolution with no upscaling or downscaling functionality provided by DLAA. DLSS and DLAA share the same AI-driven anti-aliasing method. As such, DLAA functions like DLSS without the upscaling part. Both are made by Nvidia and require Tensor Cores. However, DLSS and DLAA cannot be enabled at the same time, only one can be selected depending on whether performance or image quality is prioritized. == Reception == TechPowerUp found that "[c]ompared to TAA and DLSS, DLAA is clearly producing the best image quality, especially at lower resolutions", arguing that, while "DLSS was already doing a better job than TAA at reconstructing small objects", "DLAA does an even better job". In a Cyberpunk 2077 performance test, IGN stated that "DLAA provided somewhat similar results [FPS wise] to the normal raster mode in most cases but got significant performance boost with the help of frame generation", a feature not available when using native resolution. Rock Paper Shotgun noted that, while DLAA is "not a completely perfect form of anti-aliasing, as the occasional jaggies are present", it "looks a lot sharper overall [than TAA], and especially in motion." According to PC World, "DLAA offers very good anti-aliasing without losing visual information — alternatives like TAA tend to struggle during motion-filled scenes, where DLAA doesn’t. Furthermore, DLAA’s loss of performance is lower than with conventional anti-aliasing methods."

    Read more →
  • Cartesian genetic programming

    Cartesian genetic programming

    Cartesian genetic programming is a form of genetic programming that uses a graph representation to encode computer programs. It grew from a method of evolving digital circuits developed by Julian F. Miller and Peter Thomson in 1997. The term ‘Cartesian genetic programming’ first appeared in 1999 and was proposed as a general form of genetic programming in 2000. It is called ‘Cartesian’ because it represents a program using a two-dimensional grid of nodes. Miller's keynote explains how CGP works. He edited a book entitled Cartesian Genetic Programming, published in 2011 by Springer. The open source project dCGP implements a differentiable version of CGP developed at the European Space Agency by Dario Izzo, Francesco Biscani and Alessio Mereta able to approach symbolic regression tasks, to find solution to differential equations, find prime integrals of dynamical systems, represent variable topology artificial neural networks and more.

    Read more →
  • Causal Markov condition

    Causal Markov condition

    The Causal Markov (CM) condition states that, conditional on the set of all its direct causes, a node is independent of all variables which are not effects or direct causes of that node. In the event that the structure of a Bayesian network accurately depicts causality, the two conditions are equivalent. This is related to the Markov condition, an assumption made in Bayesian probability theory, that every node in a Bayesian network is conditionally independent of its nondescendants, given its parents. Stated loosely, it is assumed that a node has no bearing on nodes which do not descend from it. In a DAG, this local Markov condition is equivalent to the global Markov condition, which states that d-separations in the graph also correspond to conditional independence relations. This also means that a node is conditionally independent of the entire network, given its Markov blanket. A network may accurately embody the Markov condition without depicting causality, in which case it should not be assumed to embody the causal Markov condition. == Motivation == Statisticians are enormously interested in the ways in which certain events and variables are connected. The precise notion of what constitutes a cause and effect is necessary to understand the connections between them. The central idea behind the philosophical study of probabilistic causation is that causes raise the probabilities of their effects, all else being equal. A deterministic interpretation of causation means that if A causes B, then A must always be followed by B. In this sense, smoking does not cause cancer because some smokers never develop cancer. On the other hand, a probabilistic interpretation simply means that causes raise the probability of their effects. In this sense, changes in meteorological readings associated with a storm do cause that storm, since they raise its probability. (However, simply looking at a barometer does not change the probability of the storm, for a more detailed analysis, see:). == Examples == In a simple view, releasing one's hand from a hammer causes the hammer to fall. However, doing so in outer space does not produce the same outcome, calling into question if releasing one's fingers from a hammer always causes it to fall. A causal graph could be created to acknowledge that both the presence of gravity and the release of the hammer contribute to its falling. However, it would be very surprising if the surface underneath the hammer affected its falling. This essentially states the Causal Markov Condition, that given the existence of gravity the release of the hammer, it will fall regardless of what is beneath it. == Implications == === Dependence and Causation === It follows from the definition that if X and Y are in V and are probabilistically dependent, then either X causes Y, Y causes X, or X and Y are both effects of some common cause Z in V. This definition was seminally introduced by Hans Reichenbach as the Common Cause Principle (CCP). === Screening === It once again follows from the definition that the parents of X screen X from other "indirect causes" of X (parents of Parents(X)) and other effects of Parents(X) which are not also effects of X.

    Read more →
  • Variational autoencoder

    Variational autoencoder

    In machine learning, a variational autoencoder (VAE) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling in 2013. It is part of the families of probabilistic graphical models and variational Bayesian methods. In addition to being seen as an autoencoder neural network architecture, variational autoencoders can also be studied within the mathematical formulation of variational Bayesian methods, connecting a neural encoder network to its decoder through a probabilistic latent space (for example, as a multivariate Gaussian distribution) that corresponds to the parameters of a variational distribution. Thus, the encoder maps each point (such as an image) from a large complex dataset into a distribution within the latent space, rather than to a single point in that space. The decoder has the opposite function, which is to map from the latent space to the input space, again according to a distribution (although in practice, noise is rarely added during the decoding stage). By mapping a point to a distribution instead of a single point, the network can avoid overfitting the training data. Both networks are typically trained together with the usage of the reparameterization trick, although the variance of the noise model can be learned separately. Although this type of model was initially designed for unsupervised learning, its effectiveness has been proven for semi-supervised learning and supervised learning. == Overview of architecture and operation == A variational autoencoder is a generative model with a prior and noise distribution respectively. Usually such models are trained using the expectation-maximization meta-algorithm (e.g. probabilistic PCA, (spike & slab) sparse coding). Such a scheme optimizes a lower bound of the data likelihood, which is usually computationally intractable, and in doing so requires the discovery of q-distributions, or variational posteriors. These q-distributions are normally parameterized for each individual data point in a separate optimization process. However, variational autoencoders use a neural network as an amortized approach to jointly optimize across data points. In that way, the same parameters are reused for multiple data points, which can result in massive memory savings. The first neural network takes as input the data points themselves, and outputs parameters for the variational distribution. As it maps from a known input space to the low-dimensional latent space, it is called the encoder. The decoder is the second neural network of this model. It is a function that maps from the latent space to the input space, e.g. as the means of the noise distribution. It is possible to use another neural network that maps to the variance, however this can be omitted for simplicity. In such a case, the variance can be optimized with gradient descent. To optimize this model, one needs to know two terms: the "reconstruction error", and the Kullback–Leibler divergence (KL-D). Both terms are derived from the free energy expression of the probabilistic model, and therefore differ depending on the noise distribution and the assumed prior of the data, here referred to as p-distribution. For example, a standard VAE task such as IMAGENET is typically assumed to have a gaussianly distributed noise; however, tasks such as binarized MNIST require a Bernoulli noise. The KL-D from the free energy expression maximizes the probability mass of the q-distribution that overlaps with the p-distribution, which unfortunately can result in mode-seeking behaviour. The "reconstruction" term is the remainder of the free energy expression, and requires a sampling approximation to compute its expectation value. More recent approaches replace Kullback–Leibler divergence (KL-D) with various statistical distances, see "Statistical distance VAE variants" below. == Formulation == From the point of view of probabilistic modeling, one wants to maximize the likelihood of the data x {\displaystyle x} by their chosen parameterized probability distribution p θ ( x ) = p ( x | θ ) {\displaystyle p_{\theta }(x)=p(x|\theta )} . This distribution is usually chosen to be a Gaussian N ( x | μ , σ ) {\displaystyle N(x|\mu ,\sigma )} which is parameterized by μ {\displaystyle \mu } and σ {\displaystyle \sigma } respectively, and as a member of the exponential family it is easy to work with as a noise distribution. Simple distributions are easy enough to maximize, however distributions where a prior is assumed over the latents z {\displaystyle z} results in intractable integrals. Let us find p θ ( x ) {\displaystyle p_{\theta }(x)} via marginalizing over z {\displaystyle z} . p θ ( x ) = ∫ z p θ ( x , z ) d z , {\displaystyle p_{\theta }(x)=\int _{z}p_{\theta }({x,z})\,dz,} where p θ ( x , z ) {\displaystyle p_{\theta }({x,z})} represents the joint distribution under p θ {\displaystyle p_{\theta }} of the observable data x {\displaystyle x} and its latent representation or encoding z {\displaystyle z} . According to the chain rule, the equation can be rewritten as p θ ( x ) = ∫ z p θ ( x | z ) p θ ( z ) d z {\displaystyle p_{\theta }(x)=\int _{z}p_{\theta }({x|z})p_{\theta }(z)\,dz} In the vanilla variational autoencoder, z {\displaystyle z} is usually taken to be a finite-dimensional vector of real numbers, and p θ ( x | z ) {\displaystyle p_{\theta }({x|z})} to be a Gaussian distribution. Then p θ ( x ) {\displaystyle p_{\theta }(x)} is a mixture of Gaussian distributions. It is now possible to define the set of the relationships between the input data and its latent representation as Prior p θ ( z ) {\displaystyle p_{\theta }(z)} Likelihood p θ ( x | z ) {\displaystyle p_{\theta }(x|z)} Posterior p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} Unfortunately, the computation of p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} is expensive and in most cases intractable. To speed up the calculus to make it feasible, it is necessary to introduce a further function to approximate the posterior distribution as q ϕ ( z | x ) ≈ p θ ( z | x ) {\displaystyle q_{\phi }({z|x})\approx p_{\theta }({z|x})} with ϕ {\displaystyle \phi } defined as the set of real values that parametrize q {\displaystyle q} . This is sometimes called amortized inference, since by "investing" in finding a good q ϕ {\displaystyle q_{\phi }} , one can later infer z {\displaystyle z} from x {\displaystyle x} quickly without doing any integrals. In this way, the problem is to find a good probabilistic autoencoder, in which the conditional likelihood distribution p θ ( x | z ) {\displaystyle p_{\theta }(x|z)} is computed by the probabilistic decoder, and the approximated posterior distribution q ϕ ( z | x ) {\displaystyle q_{\phi }(z|x)} is computed by the probabilistic encoder. Parametrize the encoder as E ϕ {\displaystyle E_{\phi }} , and the decoder as D θ {\displaystyle D_{\theta }} . == Evidence lower bound (ELBO) == Like many deep learning approaches that use gradient-based optimization, VAEs require a differentiable loss function to update the network weights through backpropagation. For variational autoencoders, the idea is to jointly optimize the generative model parameters θ {\displaystyle \theta } to reduce the reconstruction error between the input and the output, and ϕ {\displaystyle \phi } to make q ϕ ( z | x ) {\displaystyle q_{\phi }({z|x})} as close as possible to p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} . As reconstruction loss, mean squared error and cross entropy are often used. The Kullback–Leibler divergence D K L ( q ϕ ( z | x ) ∥ p θ ( z | x ) ) {\displaystyle D_{KL}(q_{\phi }({z|x})\parallel p_{\theta }({z|x}))} can be used as a loss function to squeeze q ϕ ( z | x ) {\displaystyle q_{\phi }({z|x})} under p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} . This divergence loss expands to D K L ( q ϕ ( z | x ) ∥ p θ ( z | x ) ) = E z ∼ q ϕ ( ⋅ | x ) [ ln ⁡ q ϕ ( z | x ) p θ ( z | x ) ] = E z ∼ q ϕ ( ⋅ | x ) [ ln ⁡ q ϕ ( z | x ) p θ ( x ) p θ ( x , z ) ] = ln ⁡ p θ ( x ) + E z ∼ q ϕ ( ⋅ | x ) [ ln ⁡ q ϕ ( z | x ) p θ ( x , z ) ] . {\displaystyle {\begin{aligned}D_{KL}(q_{\phi }({z|x})\parallel p_{\theta }({z|x}))&=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {q_{\phi }(z|x)}{p_{\theta }(z|x)}}\right]\\&=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {q_{\phi }({z|x})p_{\theta }(x)}{p_{\theta }(x,z)}}\right]\\&=\ln p_{\theta }(x)+\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {q_{\phi }({z|x})}{p_{\theta }(x,z)}}\right].\end{aligned}}} Now, define the evidence lower bound (ELBO): L θ , ϕ ( x ) := E z ∼ q ϕ ( ⋅ | x ) [ ln ⁡ p θ ( x , z ) q ϕ ( z | x ) ] = ln ⁡ p θ ( x ) − D K L ( q ϕ ( ⋅ | x ) ∥ p θ ( ⋅ | x ) ) {\displaystyle L_{\theta ,\phi }(x):=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }({z|x})}}\right]=\ln p_{\theta }(x)-D_{KL}(q_{\phi }({\cdot |x})\parallel p_{\theta }({\cdot |x}))} Maximizing the ELBO θ ∗ , ϕ ∗ = argmax θ , ϕ L θ , ϕ ( x ) {\dis

    Read more →
  • Catie Cuan

    Catie Cuan

    Catie Cuan is an artist, entrepeuneur, and innovator in the field of robotic art and human-robot interaction, where she specializes in choreorobotics, an emerging field at the intersection of choreographic dance and robotics. Catie Cuan is currently one of the academic researchers pioneering the field of choreorobotics and currently holds a post-doctoral fellowship at Stanford University. == Career == Catie Cuan earned a bachelor's degree from the University of California, Berkeley. She graduated with a Ph.D. from the Department of Mechanical Engineering at Stanford University, focusing in robotics. Her most cited publication is about how to improve robotic expressive systems using tools from dance theory, such as the Laban/Bartenieff Movement Analysis. In her most recent research projects, she explores a predictive model of imitation learning for robots moving around humans, a project that advances the field of social robotics. Cuan credits her work in robotics to the experience with her father when he had a stroke and was surrounded by many medical machines, which made her think about how people might feel empowered and hopeful rather than afraid. As a ballet dancer and choreographer, she has performed with the Metropolitan Opera Ballet and the Lyric Opera of Chicago. In 2020, she was the dancer and choreographer of the show Output, which was part of a collaboration with ThoughtWorks Arts and the Pratt Institute. In the production, she danced with an ABB IRB 6700 industrial robot. In 2022, she was named as an IF/THEN ambassador for the American Association for the Advancement of Science. The same year, she was appointed Futurist-in-Residence at the Smithsonian Arts and Industries Building, where she performed at the closing ceremonies of the FUTURES exhibit on July 6, 2022. Cuan has also contributed to product designs, working with IDEO and Dutch interior design firm moooi on their Piro project, which launched a dancing scent diffuser robot during Milan Design Week in June 2022. She is a TED speaker with talks about how to teach robots to dance, and what is coming up for dancing robots in the AI era.

    Read more →
  • Defining length

    Defining length

    In the field of genetic algorithms, a schema (plural: schemata) is a template that represents a subset of potential solutions. These templates use fixed symbols (e.g., `0` or `1`) for specific positions and a wildcard or "don't care" symbol (often `#` or ``) for others. The defining length of a schema, denoted as L(H), measures the distance between the outermost fixed positions in the template. According to the Schema theorem, a schema with a shorter defining length is less likely to be disrupted by the genetic operator of crossover. As a result, short schemata are considered more robust and are more likely to be propagated to the next generation. In genetic programming, where solutions are often represented as trees, the defining length is the number of links in the minimum tree fragment that includes all the non-wildcard symbols within a schema H. == Example == The defining length is calculated by subtracting the position of the first fixed symbol from the position of the last one. Using 1-based indexing for a string of length 5: The schema `1##0#` has its first fixed symbol (`1`) at position 1 and its last fixed symbol (`0`) at position 4. Its defining length is 4 − 1 = 3. The schema `00##0` has its first fixed symbol at position 1 and its last at position 5. Its defining length is 5 − 1 = 4. The schema `##0##` has only one fixed symbol at position 3. The first and last fixed positions are the same, so its defining length is 3 − 3 = 0.

    Read more →
  • Operational taxonomic unit

    Operational taxonomic unit

    An operational taxonomic unit (OTU) is an operational definition used to classify groups of closely related individuals. The term was originally introduced in 1963 by Robert R. Sokal and Peter H. A. Sneath in the context of numerical taxonomy, where an "operational taxonomic unit" is simply the group of organisms currently being studied. In this sense, an OTU is a pragmatic definition to group individuals by similarity, equivalent to but not necessarily in line with classical Linnaean taxonomy or modern evolutionary taxonomy. Nowadays, however, the term is commonly used in a different context and refers to clusters of (uncultivated or unknown) organisms, grouped by DNA sequence similarity of a specific taxonomic marker gene (originally coined as mOTU; molecular OTU). In other words, OTUs are pragmatic proxies for "species" at different taxonomic levels, in the absence of traditional systems of biological classification as are available for macroscopic organisms. For several years, OTUs have been the most commonly used units of diversity, especially when analysing small subunit 16S (for prokaryotes) or 18S rRNA (for eukaryotes) marker gene sequence datasets. == Molecular OTU by clustering of marker gene sequences == In the approach represented by DNA barcoding, a particular locus is chosen to be used as the marker gene for classification. This locus should be universally present in the scope selected, variable enough to be different among close-related species, and be flanked by conservative sequences that allow for easy amplification and detection. There are databases containing sequences for such marker genes from many different species, allowing for comparison. (Sometimes only using one locus does not provide sufficient resolution, so multiple marker genes are used. This is the case for plants, where rbcL+matK is common.) Sequences obtained this way can be clustered according to their similarity to one another, and operational taxonomic units are defined based on the similarity threshold set by the researcher. The exact threshold depends on the taxa in question and the mutational rates of the selected locus in the taxon. 97–99% are commonly used, but "it is now recognized to be somewhat arbitrary as sequence variation within and among species varies across taxa". 100% similarity (fully identical) is also common, also known as single variants. It remains debatable how well this commonly used method recapitulates true microbial species phylogeny or ecology. Although OTUs can be calculated differently when using different algorithms or thresholds, research by Schmidt et al. (2014) demonstrated that 16S-derived microbial OTUs were generally ecologically consistent across habitats and several clustering approaches. The number of OTUs defined may be inflated due to errors in DNA sequencing. === OTU clustering approaches === There are three main approaches to clustering OTUs: De novo, for which the clustering is based on similarities between sequencing reads. Closed-reference, for which the clustering is performed against a reference database of sequences. Open-reference, where clustering is first performed against a reference database of sequences, then any remaining sequences that could not be mapped to the reference are clustered de novo. Using a reference provides taxonomic context for the OTUs found. Alternatively, taxonomic context can be found after the construction of clusters by comparing representative sequences from clusters against a reference database. There are also specialized classifiers for this purpose which are much faster than naive comparison using BLAST. === OTU clustering algorithms === Hierarchical clustering algorithms (HCA): uclust & cd-hit & ESPRIT Bayesian clustering: CROP == Molecular OTU by other methods == In addition to similarity-based grouping, marker gene sequences can be sorted into OTUs using molecular phylogeny, k-mer composition, or hybrid methods combining these methods with similarity. There are also Bayesian tree-less methods and machine learning approaches. Using phylogeny often involves manually assigning terminal clades or single nodes to an OTU, so this is usually only done for refinement. Genome skimming can be used to obtain high-copy DNA without the need to choose marker genes or to design PCR primers for the chosen genes. It can provide fairly good coverage of organelle DNA and repetitive elements such as ribosomal DNA, both of which can be used like marker genes in OTU analysis. Whole-genome sequencing is more expensive and involves the production and processing of more data. By considering the entire genome, many (sometimes over 100) marker genes can be used at the same time, producing highly resolved phylogenies that correctly identify problematic taxa. It is also possible to use entire genomes for OTU assignment. For example, genomes from different bacterial species almost always have an average nucleotide identity lower than 95%, a fact that can be used to define new OTUs (and likely new species).

    Read more →