AI Assistant Zia Spark Icon

AI Assistant Zia Spark Icon — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Class activation mapping

    Class activation mapping

    Class activation mapping methods are explainable AI (XAI) techniques used to visualize the regions of an input image that are the most relevant for a particular task, especially image classification, in convolutional neural networks (CNNs). These methods generate heatmaps by weighting the feature maps from a convolutional layer according to their relevance to the target class. In the field of artificial intelligence, generically defined as "the effort to automate intellectual tasks normally performed by humans", machine learning and deep learning were created. They both use statistical and computational methods to learn patterns from data, reducing the need for manually coded rules. Machine learning models are trained on input data and the known respective answers, learning the underlying patterns or structures present in the data. Traditional Machine learning algorithms employ manually designed feature sets, posing a direct link between machine learning designers and employed features. Deep learning is a subfield of machine learning, based on the concept of successive layers of representation, in which the data is progressively unfolded in different ways, to extract relevant and informative patterns in data analysis. Deep learning algorithms are defined as feature learning algorithms automatically learning hierarchical feature representations from raw data, extracting increasingly abstract features through multiple layers. CNNs are a specific architecture of deep learning models, designed to process spatially structured data, such as images, exploiting a series of convolution, non-linear activation and pooling operations to extract relevant features, contained in the so-called feature maps from input data. CNNs have demonstrated to be highly effective in a variety of computer vision and image processing tasks. CNNs (and deep learning models more broadly) are described as black boxes due to their complex and non-transparent internal layers of representation. The need for clearer indications on its internal working and decision-making process gave birth to XAI techniques. Among the proposed XAI techniques for computer vision tasks, Class activation mapping methods can show which pixels in an input image are important to the predicted logit for a class of interest, in a classification task. Class activation mapping methods were originally developed for class-discriminative scenarios to visualize which parts of the input image influenced the classification decision, namely to visually highlight the regions of those feature maps that contribute most strongly to the prediction of a given class. More advanced versions of these methods are not limited to image classification tasks, but have been extended also to several vision-related tasks, such as object detection, image captioning, visual question answering and image segmentation. == Background == The following methods laid the groundwork for the class activation maps approaches, forming the conceptual basis of using gradients to highlight class-discriminative regions. === Class model visualization and saliency maps for convolutional neural networks === The class model visualization and image-specific saliency maps approaches have been presented in the foundational work "Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps" by Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman and it generalizes the deconvnet method by Zeiler and Fergus. Class model visualization synthesizes an artificial input image that strongly activates the output neurons associated with a target class. Given a trained, fixed model, this method starts with a zero-initialized image, backpropagates the gradients from the class score to the image pixels, updates the image pixels increasing the specific class scores and it repeats the pixel updating process, showing an encoded (idealized version) prototype of the class of interest. Image-specific class saliency visualization method provides a visual explanation by highlighting the most relevant pixels in an image for predicting a certain class C of interest. This is done by computing the gradient of the class score with respect to the input image, I 0 , {\displaystyle I_{0},} w = ∂ S C ∂ I | I 0 {\displaystyle w=\left.{\frac {\partial S_{C}}{\partial I}}\right|_{I_{0}}} approximating the model locally (around I 0 {\displaystyle I_{0}} ) as linear, using a first-order Taylor expansion: S C ( I ) ≈ w C T I + b {\displaystyle S_{C}(I)\approx w_{C}^{T}I+b} . The magnitude of w C {\displaystyle w_{C}} , the gradient, indicates the importancy of the pixels: larger gradients suggest greater influence on the prediction. Once the gradient is known, the saliency map is defined as the maximum absolute gradient across the color channels: M i j = m a x C | ∂ S C ∂ I i j C | {\displaystyle M_{ij}=max_{C}\left|{\frac {\partial S_{C}}{\partial I_{ij}^{C}}}\right|} resulting in an saliency map (i.e. heatmap). === Guided backpropagation === The concept of guided backpropagation can be traced for the first time in the paper by Springenberg et al. "Striving For Simplicity: The All Convolutional Net" and also this method builds upon the work by Zeiler and Fergus "Visualizing and Understanding Convolutional Networks". Guided backpropagation core is to understand what a CNN is learning, by visualizing the patterns that activate more strongly individual neurons (or filters), in architectures which do not rely on max-pooling layer. When propagating gradients back through a rectified linear unit (ReLU), guided backpropagation passes the gradient if and only if the input to the ReLU was positive (forward pass) and the output gradient is positive (backward signal), tackling both inactive neurons, negative gradients and suppressing the noise. The result displays sharper, high-resolution visualizations of what each neuron is responding to. Guided backpropagation represents a simple and practical method for model interpretability, helping understand how and where neural networks detect semantic concepts across layers. Moreover, it can be applied to any network architecture, due to its working principle. == Base versions == Class activation mapping and gradient-weighted class activation mapping are the original and most widely used methods for visual explanations in convolutional neural networks. These methods serve as the foundation for many later developments in explainable AI. Notation: In this article, the symbols i and j represent integer indices that disappear inside sums or averages, while x and y are the continuous (or up-sampled integer) coordinates of the final heat-map that is plotted. === Class activation mapping (CAM) === Class activation mapping (CAM) was the first, and the original, version of CAM methods, and it gave the name to the whole category. The approach was firstly introduced by Zhou et al. in their seminal work "Learning Deep Features for Discriminative Localization". This approach achieves class-specific heatmaps by modifying image classification CNN architectures, replacing fully-connected layers with convolutional layers and a final global average pooling layer. Its main scope is to localize and highlight discriminative regions of an input image that a CNN uses to identify a particular class, without needing explicit bounding box annotations. ==== Global average pooling (GAP) ==== Global average pooling (GAP) represents the key element in the original CAM approach. It is a dimensionality reduction technique and, similarly to other pooling layers, it allows the downsampling of the feature maps, calculating representative values for a specific region of the feature map. The particularity of GAP is that it calculates a single value for an entire feature map, significantly reducing the model dimensions. ==== Mathematical description ==== The mathematical description considers as its key the combination of convolutional and GAP layers. In CAM, it is mandatory to have the GAP layer after the last convolutional layer and before the final linear classifier layer. This last element of the architecture connects the output logits (the network predictions) y C {\displaystyle y^{C}} , to the GAP values, with its respective fine-tuned weights, w k C {\displaystyle w_{k}^{C}} . Considering A k {\displaystyle A^{k}} as the last feature maps of the last convolutional layer, GAP produces one value for each feature map, by averaging all the matrix elements (i, j) of the feature map: F k = 1 m n ∑ i = 1 m ∑ j = 1 n A i j k {\displaystyle F^{k}={\frac {1}{mn}}\sum _{i=1}^{m}\sum _{j=1}^{n}A_{ij}^{k}} with A k = [ A 11 k A 12 k ⋯ A 1 n k A 21 k A 22 k ⋯ A 2 n k ⋮ ⋮ ⋱ ⋮ A m 1 k A m 2 k ⋯ A m n k ] = { A i j k ∣ 1 ≤ i ≤ m , 1 ≤ j ≤ n } {\displaystyle A^{k}={\begin{bmatrix}A_{11}^{k}&A_{12}^{k}&\cdots &A_{1n}^{k}\\A_{21}^{k}&A_{22}^{k}&\cdots &A_{2n}^{k}\\\vdots &\vdots &\ddots &\vdots \\A_{m1}^{k}&A_{m2}^{k}&\cdots &A_{mn}^{k}\end{bmatrix}}=\left\{A_{

    Read more →
  • Ordination (statistics)

    Ordination (statistics)

    Ordination or gradient analysis, in multivariate analysis, is a method complementary to data clustering, and used mainly in exploratory data analysis (rather than in hypothesis testing). In contrast to cluster analysis, ordination orders quantities in a (usually lower-dimensional) latent space. In the ordination space, quantities that are near each other share attributes (i.e., are similar to some degree), and dissimilar objects are farther from each other. Such relationships between the objects, on each of several axes or latent variables, are then characterized numerically and/or graphically in a biplot. The first ordination method, principal components analysis, was suggested by Karl Pearson in 1901. == Methods == Ordination methods can broadly be categorized in eigenvector-, algorithm-, or model-based methods. Many classical ordination techniques, including principal components analysis, correspondence analysis (CA) and its derivatives (detrended correspondence analysis, canonical correspondence analysis, and redundancy analysis, belong to the first group). The second group includes some distance-based methods such as non-metric multidimensional scaling, and machine learning methods such as T-distributed stochastic neighbor embedding and nonlinear dimensionality reduction. The third group includes model-based ordination methods, which can be considered as multivariate extensions of Generalized Linear Models. Model-based ordination methods are more flexible in their application than classical ordination methods, so that it is for example possible to include random-effects. Unlike in the aforementioned two groups, there is no (implicit or explicit) distance measure in the ordination. Instead, a distribution needs to be specified for the responses as is typical for statistical models. These and other assumptions, such as the assumed mean-variance relationship, can be validated with the use of residual diagnostics, unlike in other ordination methods. == Applications == Ordination can be used on the analysis of any set of multivariate objects. It is frequently used in several environmental or ecological sciences, particularly plant community ecology. It is also used in genetics and systems biology for microarray data analysis and in psychometrics.

    Read more →
  • Apache Mahout

    Apache Mahout

    Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily on linear algebra. In the past, many of the implementations use the Apache Hadoop platform, however today it is primarily focused on Apache Spark. Mahout also provides Java/Scala libraries for common math operations (focused on linear algebra and statistics) and primitive Java collections. Mahout is a work in progress; a number of algorithms have been implemented. == Features == === Samsara === Apache Mahout-Samsara refers to a Scala domain-specific language (DSL) that allows users to use R-like syntax as opposed to traditional Scala-like syntax. This allows user to express algorithms concisely and clearly. === Backend agnostic === Apache Mahout's code abstracts the domain-specific language from the engine where the code is run. While active development is done with the Apache Spark engine, users are free to implement any engine they choose- H2O and Apache Flink have been implemented in the past and examples exist in the code base. === GPU/CPU accelerators === The JVM has notoriously slow computation. To improve speed, "native solvers" were added which move in-core, and by extension, distributed BLAS operations out of the JVM, offloading to off-heap or GPU memory for processing via multiple CPUs and/or CPU cores, or GPUs when built against the ViennaCL library. ViennaCL is a highly optimized C++ library with BLAS operations implemented in OpenMP, and OpenCL. As of release 14.1, the OpenMP build considered to be stable, leaving the OpenCL build is still in its experimental proof-of-concept phase. === Recommenders === Apache Mahout features implementations of Alternating Least Squares, Co-Occurrence, and Correlated Co-Occurrence, a unique-to-Mahout recommender algorithm that extends co-occurrence to be used on multiple dimensions of data. == History == === Transition from Map Reduce to Apache Spark === While Mahout's core algorithms for clustering, classification and batch based collaborative filtering were implemented on top of Apache Hadoop using the map/reduce paradigm, it did not restrict contributions to Hadoop-based implementations. Contributions that run on a single node or on a non-Hadoop cluster were also welcomed. For example, the 'Taste' collaborative-filtering recommender component of Mahout was originally a separate project and can run stand-alone without Hadoop. Starting with the release 0.10.0, the project shifted its focus to building a backend-independent programming environment, code named "Samsara". The environment consists of an algebraic backend-independent optimizer and an algebraic Scala DSL unifying in-memory and distributed algebraic operators. Supported algebraic platforms are Apache Spark, H2O, and Apache Flink. Support for MapReduce algorithms started being gradually phased out in 2014. === Release history === === Developers === Apache Mahout is developed by a community. The project is managed by a group called the "Project Management Committee" (PMC). The current PMC is Andrew Musselman, Andrew Palumbo, Drew Farris, Isabel Drost-Fromm, Jake Mannix, Pat Ferrel, Paritosh Ranjan, Trevor Grant, Robin Anil, Sebastian Schelter, Stevo Slavić.

    Read more →
  • Representer theorem

    Representer theorem

    For computer science, in statistical learning theory, a representer theorem is any of several related results stating that a minimizer f ∗ {\displaystyle f^{}} of a regularized empirical risk functional defined over a reproducing kernel Hilbert space can be represented as a finite linear combination of kernel products evaluated on the input points in the training set data. == Formal statement == The following Representer Theorem and its proof are due to Schölkopf, Herbrich, and Smola: Theorem: Consider a positive-definite real-valued kernel k : X × X → R {\displaystyle k:{\mathcal {X}}\times {\mathcal {X}}\to \mathbb {R} } on a non-empty set X {\displaystyle {\mathcal {X}}} with a corresponding reproducing kernel Hilbert space H k {\displaystyle H_{k}} . Let there be given a training sample ( x 1 , y 1 ) , … , ( x n , y n ) ∈ X × R {\displaystyle (x_{1},y_{1}),\dotsc ,(x_{n},y_{n})\in {\mathcal {X}}\times \mathbb {R} } , a strictly increasing real-valued function g : [ 0 , ∞ ) → R {\displaystyle g\colon [0,\infty )\to \mathbb {R} } , and an arbitrary error function E : ( X × R 2 ) n → R ∪ { ∞ } {\displaystyle E\colon ({\mathcal {X}}\times \mathbb {R} ^{2})^{n}\to \mathbb {R} \cup \lbrace \infty \rbrace } , which together define the following regularized empirical risk functional on H k {\displaystyle H_{k}} : f ↦ E ( ( x 1 , y 1 , f ( x 1 ) ) , … , ( x n , y n , f ( x n ) ) ) + g ( ‖ f ‖ ) . {\displaystyle f\mapsto E\left((x_{1},y_{1},f(x_{1})),\ldots ,(x_{n},y_{n},f(x_{n}))\right)+g\left(\lVert f\rVert \right).} Then, any minimizer of the empirical risk f ∗ = argmin f ∈ H k { E ( ( x 1 , y 1 , f ( x 1 ) ) , … , ( x n , y n , f ( x n ) ) ) + g ( ‖ f ‖ ) } , ( ∗ ) {\displaystyle f^{}={\underset {f\in H_{k}}{\operatorname {argmin} }}\left\lbrace E\left((x_{1},y_{1},f(x_{1})),\ldots ,(x_{n},y_{n},f(x_{n}))\right)+g\left(\lVert f\rVert \right)\right\rbrace ,\quad ()} admits a representation of the form: f ∗ ( ⋅ ) = ∑ i = 1 n α i k ( ⋅ , x i ) , {\displaystyle f^{}(\cdot )=\sum _{i=1}^{n}\alpha _{i}k(\cdot ,x_{i}),} where α i ∈ R {\displaystyle \alpha _{i}\in \mathbb {R} } for all 1 ≤ i ≤ n {\displaystyle 1\leq i\leq n} . Proof: Define a mapping φ : X → H k φ ( x ) = k ( ⋅ , x ) {\displaystyle {\begin{aligned}\varphi \colon {\mathcal {X}}&\to H_{k}\\\varphi (x)&=k(\cdot ,x)\end{aligned}}} (so that φ ( x ) = k ( ⋅ , x ) {\displaystyle \varphi (x)=k(\cdot ,x)} is itself a map X → R {\displaystyle {\mathcal {X}}\to \mathbb {R} } ). Since k {\displaystyle k} is a reproducing kernel, then φ ( x ) ( x ′ ) = k ( x ′ , x ) = ⟨ φ ( x ′ ) , φ ( x ) ⟩ , {\displaystyle \varphi (x)(x')=k(x',x)=\langle \varphi (x'),\varphi (x)\rangle ,} where ⟨ ⋅ , ⋅ ⟩ {\displaystyle \langle \cdot ,\cdot \rangle } is the inner product on H k {\displaystyle H_{k}} . Given any x 1 , … , x n {\displaystyle x_{1},\ldots ,x_{n}} , one can use orthogonal projection to decompose any f ∈ H k {\displaystyle f\in H_{k}} into a sum of two functions, one lying in span ⁡ { φ ( x 1 ) , … , φ ( x n ) } {\displaystyle \operatorname {span} \left\lbrace \varphi (x_{1}),\ldots ,\varphi (x_{n})\right\rbrace } , and the other lying in the orthogonal complement: f = ∑ i = 1 n α i φ ( x i ) + v , {\displaystyle f=\sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})+v,} where ⟨ v , φ ( x i ) ⟩ = 0 {\displaystyle \langle v,\varphi (x_{i})\rangle =0} for all i {\displaystyle i} . The above orthogonal decomposition and the reproducing property together show that applying f {\displaystyle f} to any training point x j {\displaystyle x_{j}} produces f ( x j ) = ⟨ ∑ i = 1 n α i φ ( x i ) + v , φ ( x j ) ⟩ = ∑ i = 1 n α i ⟨ φ ( x i ) , φ ( x j ) ⟩ , {\displaystyle f(x_{j})=\left\langle \sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})+v,\varphi (x_{j})\right\rangle =\sum _{i=1}^{n}\alpha _{i}\langle \varphi (x_{i}),\varphi (x_{j})\rangle ,} which we observe is independent of v {\displaystyle v} . Consequently, the value of the error function E {\displaystyle E} in () is likewise independent of v {\displaystyle v} . For the second term (the regularization term), since v {\displaystyle v} is orthogonal to ∑ i = 1 n α i φ ( x i ) {\displaystyle \sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})} and g {\displaystyle g} is strictly monotonic, we have g ( ‖ f ‖ ) = g ( ‖ ∑ i = 1 n α i φ ( x i ) + v ‖ ) = g ( ‖ ∑ i = 1 n α i φ ( x i ) ‖ 2 + ‖ v ‖ 2 ) ≥ g ( ‖ ∑ i = 1 n α i φ ( x i ) ‖ ) . {\displaystyle {\begin{aligned}g\left(\lVert f\rVert \right)&=g\left(\lVert \sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})+v\rVert \right)\\&=g\left({\sqrt {\lVert \sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})\rVert ^{2}+\lVert v\rVert ^{2}}}\right)\\&\geq g\left(\lVert \sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})\rVert \right).\end{aligned}}} Therefore, setting v = 0 {\displaystyle v=0} does not affect the first term of (), while it strictly decreases the second term. Consequently, any minimizer f ∗ {\displaystyle f^{}} in () must have v = 0 {\displaystyle v=0} , i.e., it must be of the form f ∗ ( ⋅ ) = ∑ i = 1 n α i φ ( x i ) = ∑ i = 1 n α i k ( ⋅ , x i ) , {\displaystyle f^{}(\cdot )=\sum _{i=1}^{n}\alpha _{i}\varphi (x_{i})=\sum _{i=1}^{n}\alpha _{i}k(\cdot ,x_{i}),} which is the desired result. == Generalizations == The Theorem stated above is a particular example of a family of results that are collectively referred to as "representer theorems"; here we describe several such. The first statement of a representer theorem was due to Kimeldorf and Wahba for the special case in which E ( ( x 1 , y 1 , f ( x 1 ) ) , … , ( x n , y n , f ( x n ) ) ) = 1 n ∑ i = 1 n ( f ( x i ) − y i ) 2 , g ( ‖ f ‖ ) = λ ‖ f ‖ 2 {\displaystyle {\begin{aligned}E\left((x_{1},y_{1},f(x_{1})),\ldots ,(x_{n},y_{n},f(x_{n}))\right)&={\frac {1}{n}}\sum _{i=1}^{n}(f(x_{i})-y_{i})^{2},\\g(\lVert f\rVert )&=\lambda \lVert f\rVert ^{2}\end{aligned}}} for λ > 0 {\displaystyle \lambda >0} . Schölkopf, Herbrich, and Smola generalized this result by relaxing the assumption of the squared-loss cost and allowing the regularizer to be any strictly monotonically increasing function g ( ⋅ ) {\displaystyle g(\cdot )} of the Hilbert space norm. It is possible to generalize further by augmenting the regularized empirical risk functional through the addition of unpenalized offset terms. For example, Schölkopf, Herbrich, and Smola also consider the minimization f ~ ∗ = argmin ⁡ { E ( ( x 1 , y 1 , f ~ ( x 1 ) ) , … , ( x n , y n , f ~ ( x n ) ) ) + g ( ‖ f ‖ ) ∣ f ~ = f + h ∈ H k ⊕ span ⁡ { ψ p ∣ 1 ≤ p ≤ M } } , ( † ) {\displaystyle {\tilde {f}}^{}=\operatorname {argmin} \left\lbrace E\left((x_{1},y_{1},{\tilde {f}}(x_{1})),\ldots ,(x_{n},y_{n},{\tilde {f}}(x_{n}))\right)+g\left(\lVert f\rVert \right)\mid {\tilde {f}}=f+h\in H_{k}\oplus \operatorname {span} \lbrace \psi _{p}\mid 1\leq p\leq M\rbrace \right\rbrace ,\quad (\dagger )} i.e., we consider functions of the form f ~ = f + h {\displaystyle {\tilde {f}}=f+h} , where f ∈ H k {\displaystyle f\in H_{k}} and h {\displaystyle h} is an unpenalized function lying in the span of a finite set of real-valued functions { ψ p : X → R ∣ 1 ≤ p ≤ M } {\displaystyle \lbrace \psi _{p}\colon {\mathcal {X}}\to \mathbb {R} \mid 1\leq p\leq M\rbrace } . Under the assumption that the n × M {\displaystyle n\times M} matrix ( ψ p ( x i ) ) i p {\displaystyle \left(\psi _{p}(x_{i})\right)_{ip}} has rank M {\displaystyle M} , they show that the minimizer f ~ ∗ {\displaystyle {\tilde {f}}^{}} in ( † ) {\displaystyle (\dagger )} admits a representation of the form f ~ ∗ ( ⋅ ) = ∑ i = 1 n α i k ( ⋅ , x i ) + ∑ p = 1 M β p ψ p ( ⋅ ) {\displaystyle {\tilde {f}}^{}(\cdot )=\sum _{i=1}^{n}\alpha _{i}k(\cdot ,x_{i})+\sum _{p=1}^{M}\beta _{p}\psi _{p}(\cdot )} where α i , β p ∈ R {\displaystyle \alpha _{i},\beta _{p}\in \mathbb {R} } and the β p {\displaystyle \beta _{p}} are all uniquely determined. The conditions under which a representer theorem exists were investigated by Argyriou, Micchelli, and Pontil, who proved the following: Theorem: Let X {\displaystyle {\mathcal {X}}} be a nonempty set, k {\displaystyle k} a positive-definite real-valued kernel on X × X {\displaystyle {\mathcal {X}}\times {\mathcal {X}}} with corresponding reproducing kernel Hilbert space H k {\displaystyle H_{k}} , and let R : H k → R {\displaystyle R\colon H_{k}\to \mathbb {R} } be a differentiable regularization function. Then given a training sample ( x 1 , y 1 ) , … , ( x n , y n ) ∈ X × R {\displaystyle (x_{1},y_{1}),\ldots ,(x_{n},y_{n})\in {\mathcal {X}}\times \mathbb {R} } and an arbitrary error function E : ( X × R 2 ) m → R ∪ { ∞ } {\displaystyle E\colon ({\mathcal {X}}\times \mathbb {R} ^{2})^{m}\to \mathbb {R} \cup \lbrace \infty \rbrace } , a minimizer f ∗ = argmin f ∈ H k { E ( ( x 1 , y 1 , f ( x 1 ) ) , … , ( x n , y n , f ( x n ) ) ) + R ( f ) } ( ‡ ) {\displaystyle f^{}={\underset {f\in H_{k}}{\operatorname {argmin} }}\left\lbrace E\left((x_{1},y_{1},f(x_{1})),\ldots ,(x_{n},y_{n},f(x_{n}))\right)+R(f)\right\rbrace \quad (\ddagger )} of the regularized empirical risk admits a repr

    Read more →
  • Centurion Guard

    Centurion Guard

    Centurion Guard is a PC hardware and software-based security product, developed by Centurion Technologies. It was first released in 1996. There were several different releases and versions of this product, and many were distributed in computers donated to libraries by the Bill & Melinda Gates Foundation. == Operating system compatibility == Microsoft Windows 7 Microsoft Windows Vista Microsoft Windows XP

    Read more →
  • Almeida–Pineda recurrent backpropagation

    Almeida–Pineda recurrent backpropagation

    Almeida–Pineda recurrent backpropagation is an extension to the backpropagation algorithm that is applicable to recurrent neural networks. It is a type of supervised learning. It was described somewhat cryptically in Richard Feynman's senior thesis, and rediscovered independently in the context of artificial neural networks by both Fernando Pineda and Luis B. Almeida. A recurrent neural network for this algorithm consists of some input units, some output units and eventually some hidden units. For a given set of (input, target) states, the network is trained to settle into a stable activation state with the output units in the target state, based on a given input state clamped on the input units.

    Read more →
  • Random neural network

    Random neural network

    The Random Neural Network (RNN) is a mathematical representation of an interconnected network of neurons or cells which exchange spiking signals. It was invented by Erol Gelenbe and is linked to the G-network model of queueing networks which Erol Gelenbe also invented, and with his Gene Regulatory Network models. In this model, each neuronal cell state is represented by an integer whose value rises when the cell receives an excitatory spike and drops when it receives an inhibitory spike. The spikes can originate outside the network itself, or they can come from other cells in the networks. Cells whose internal excitatory state has a positive value are allowed to send out spikes of either kind to other cells in the network according to specific cell-dependent spiking rates. The model has a mathematical solution in steady-state which provides the joint probability distribution of the network in terms of the individual probabilities that each cell is excited and able to send out spikes. Computing this solution is based on solving a set of non-linear algebraic equations whose parameters are related to the spiking rates of individual cells and their connectivity to other cells, as well as the arrival rates of spikes from outside the network. The RNN is a recurrent model, i.e. a neural network that is allowed to have complex feedback loops. A highly energy-efficient implementation of random neural networks was demonstrated by Krishna Palem et al. using the Probabilistic CMOS or PCMOS technology and was shown to be c. 226–300 times more efficient in terms of Energy-Performance-Product. RNNs are also related to artificial neural networks, which (like the random neural network) have gradient-based learning algorithms. The learning algorithm for an n-node random neural network that includes feedback loops (it is also a recurrent neural network) is of computational complexity O(n^3) (the number of computations is proportional to the cube of n, the number of neurons). The random neural network can also be used with other learning algorithms such as reinforcement learning. The RNN has been shown to be a universal approximator for bounded and continuous functions.

    Read more →
  • Influence diagram

    Influence diagram

    An influence diagram (ID) (also called a relevance diagram, decision diagram or a decision network) is a compact graphical and mathematical representation of a decision situation. It is a generalization of a Bayesian network, in which not only probabilistic inference problems but also decision making problems (following the maximum expected utility criterion) can be modeled and solved. ID was first developed in the mid-1970s by decision analysts with an intuitive semantic that is easy to understand. It is now adopted widely and becoming an alternative to the decision tree which typically suffers from exponential growth in number of branches with each variable modeled. ID is directly applicable in team decision analysis, since it allows incomplete sharing of information among team members to be modeled and solved explicitly. Extensions of ID also find their use in game theory as an alternative representation of the game tree. == Semantics == An ID is a directed acyclic graph with three types (plus one subtype) of node and three types of arc (or arrow) between nodes. Nodes: Decision node (corresponding to each decision to be made) is drawn as a rectangle. Uncertainty node (corresponding to each uncertainty to be modeled) is drawn as an oval. Deterministic node (corresponding to special kind of uncertainty that its outcome is deterministically known whenever the outcome of some other uncertainties are also known) is drawn as a double oval. Value node (corresponding to each component of additively separable Von Neumann-Morgenstern utility function) is drawn as an octagon (or diamond). Arcs: Functional arcs (ending in value node) indicate that one of the components of additively separable utility function is a function of all the nodes at their tails. Conditional arcs (ending in uncertainty node) indicate that the uncertainty at their heads is probabilistically conditioned on all the nodes at their tails. Conditional arcs (ending in deterministic node) indicate that the uncertainty at their heads is deterministically conditioned on all the nodes at their tails. Informational arcs (ending in decision node) indicate that the decision at their heads is made with the outcome of all the nodes at their tails known beforehand. Given a properly structured ID: Decision nodes and incoming information arcs collectively state the alternatives (what can be done when the outcome of certain decisions and/or uncertainties are known beforehand) Uncertainty/deterministic nodes and incoming conditional arcs collectively model the information (what are known and their probabilistic/deterministic relationships) Value nodes and incoming functional arcs collectively quantify the preference (how things are preferred over one another). Alternative, information, and preference are termed decision basis in decision analysis, they represent three required components of any valid decision situation. Formally, the semantic of influence diagram is based on sequential construction of nodes and arcs, which implies a specification of all conditional independencies in the diagram. The specification is defined by the d {\displaystyle d} -separation criterion of Bayesian network. According to this semantic, every node is probabilistically independent on its non-successor nodes given the outcome of its immediate predecessor nodes. Likewise, a missing arc between non-value node X {\displaystyle X} and non-value node Y {\displaystyle Y} implies that there exists a set of non-value nodes Z {\displaystyle Z} , e.g., the parents of Y {\displaystyle Y} , that renders Y {\displaystyle Y} independent of X {\displaystyle X} given the outcome of the nodes in Z {\displaystyle Z} . == Example == Consider the simple influence diagram representing a situation where a decision-maker is planning their vacation. There is 1 decision node (Vacation Activity), 2 uncertainty nodes (Weather Condition, Weather Forecast), and 1 value node (Satisfaction). There are 2 functional arcs (ending in Satisfaction), 1 conditional arc (ending in Weather Forecast), and 1 informational arc (ending in Vacation Activity). Functional arcs ending in Satisfaction indicate that Satisfaction is a utility function of Weather Condition and Vacation Activity. In other words, their satisfaction can be quantified if they know what the weather is like and what their choice of activity is. (Note that they do not value Weather Forecast directly) Conditional arc ending in Weather Forecast indicates their belief that Weather Forecast and Weather Condition can be dependent. Informational arc ending in Vacation Activity indicates that they will only know Weather Forecast, not Weather Condition, when making their choice. In other words, actual weather will be known after they make their choice, and only forecast is what they can count on at this stage. It also follows semantically, for example, that Vacation Activity is independent on (irrelevant to) Weather Condition given Weather Forecast is known. == Applicability to value of information == The above example highlights the power of the influence diagram in representing an extremely important concept in decision analysis known as the value of information. Consider the following three scenarios; Scenario 1: The decision-maker could make their Vacation Activity decision while knowing what Weather Condition will be like. This corresponds to adding extra informational arc from Weather Condition to Vacation Activity in the above influence diagram. Scenario 2: The original influence diagram as shown above. Scenario 3: The decision-maker makes their decision without even knowing the Weather Forecast. This corresponds to removing informational arc from Weather Forecast to Vacation Activity in the above influence diagram. Scenario 1 is the best possible scenario for this decision situation since there is no longer any uncertainty on what they care about (Weather Condition) when making their decision. Scenario 3, however, is the worst possible scenario for this decision situation since they need to make their decision without any hint (Weather Forecast) on what they care about (Weather Condition) will turn out to be. The decision-maker is usually better off (definitely no worse off, on average) to move from scenario 3 to scenario 2 through the acquisition of new information. The most they should be willing to pay for such move is called the value of information on Weather Forecast, which is essentially the value of imperfect information on Weather Condition. The applicability of this simple ID and the value of information concept is tremendous, especially in medical decision making when most decisions have to be made with imperfect information about their patients, diseases, etc. == Related concepts == Influence diagrams are hierarchical and can be defined either in terms of their structure or in greater detail in terms of the functional and numerical relation between diagram elements. An ID that is consistently defined at all levels—structure, function, and number—is a well-defined mathematical representation and is referred to as a well-formed influence diagram (WFID). WFIDs can be evaluated using reversal and removal operations to yield answers to a large class of probabilistic, inferential, and decision questions. More recent techniques have been developed by artificial intelligence researchers concerning Bayesian network inference (belief propagation). An influence diagram having only uncertainty nodes (i.e., a Bayesian network) is also called a relevance diagram. An arc connecting node A to B implies not only that "A is relevant to B", but also that "B is relevant to A" (i.e., relevance is a symmetric relationship).

    Read more →
  • Baby Bundle (app)

    Baby Bundle (app)

    Baby Bundle is a parenting mobile app for iPhone and iPad. It was designed to help new parents through pregnancy and the first two years of parenthood. Developed in collaboration with medical experts, it helps track and record the child's development and growth, offers parental advice, manages vaccinations and health check-ups, stores photos and provides baby monitoring services. == History == Baby Bundle was founded in the United Kingdom by brothers, Nick and Anthony von Christierson. Each worked in investment banking prior to developing Baby Bundle, Nick at Greenhill & Co., and Anthony at Goldman Sachs. The idea for the app came when a friend's wife voiced her frustration over having multiple parenting apps on her smartphone. Nick and Anthony left their jobs to create a single app that would include all those features. They conducted market research by interviewing more than 500 parents in the UK and US. It took them a year to build the app, which was named by their mother. Looking for endorsement, they first went to the US in 2013 and partnered with parenting expert and pediatrician Dr. Jennifer Trachtenberg. Baby Bundle was launched in the US and Canadian App Stores in April 2014. In the same month, it became the #1 parenting app in iTunes and was featured by Apple as the #1 Editor's pick across all categories. Mashable called it one of the "Top 5 Can’t Miss Apps." Baby Bundle raised $1.8m seed round in March 2015 to fund development. The money came from a range of angel investors from across the US, UK and Asia. The von Christierson brothers have signed a deal to co-brand the app in the Middle East and expect to launch in Europe and Africa. == Features == Baby Bundle is an app for both the iPhone or iPad and provides smart monitoring tools and trackers for pregnancy and child development. It acts as a growth and daily activity tracker and offers parental advice, manages vaccinations and health check-ups. It has a parenting guide with tips and advice on what to expect when the baby arrives. An interactive forum also lets parents ask questions from others in the community. The app is free and also include paid premium features like the ability to turn two iPhones running into a baby monitor, a cloud service to share the child's data with a spouse and the ability to store data on more than one baby.

    Read more →
  • Self-organizing map

    Self-organizing map

    A self-organizing map (SOM) or self-organizing feature map (SOFM) is an unsupervised machine learning technique used to produce a low-dimensional (typically two-dimensional) representation of a higher-dimensional data set while preserving the topological structure of the data. For example, a data set with p {\displaystyle p} variables measured in n {\displaystyle n} observations could be represented as clusters of observations with similar values for the variables. These clusters then could be visualized as a two-dimensional "map" such that observations in proximal clusters have more similar values than observations in distal clusters. This can make high-dimensional data easier to visualize and analyze. A SOM is a type of artificial neural network but is trained using competitive learning rather than the error-correction learning (e.g., backpropagation with gradient descent) used by other artificial neural networks. The SOM was introduced by the Finnish professor Teuvo Kohonen in the 1980s and therefore is sometimes called a Kohonen map or Kohonen network. The Kohonen map or network is a computationally convenient abstraction building on biological models of neural systems from the 1970s and morphogenesis models dating back to Alan Turing in the 1950s. SOMs create internal representations reminiscent of the cortical homunculus, a distorted representation of the human body, based on a neurological "map" of the areas and proportions of the human brain dedicated to processing sensory functions, for different parts of the body. == Overview == Self-organizing maps, like most artificial neural networks, operate in two modes: training and mapping. First, training uses an input data set (the "input space") to generate a lower-dimensional representation of the input data (the "map space"). Second, mapping classifies additional input data using the generated map. The goal of training is to represent an input space with p dimensions as a map space with n dimensions, where p > n. Specifically, an input space with p variables is said to have p dimensions. A map space consists of components called "nodes" or "neurons", which are arranged as a hexagonal or rectangular grid with two dimensions. The number of nodes and their arrangement are specified beforehand based on the larger goals of the analysis and exploration of the data. Each node in the map space is associated with a "weight" vector, which is the position of the node in the input space. While nodes in the map space stay fixed, training consists in moving weight vectors toward the input data (reducing a distance metric such as Euclidean distance) without spoiling the topology induced from the map space. After training, the map can be used to classify additional observations for the input space by finding the node with the closest weight vector (smallest distance metric) to the input space vector. == Learning algorithm == The goal of learning in the self-organizing map is to cause different parts of the network to respond similarly to certain input patterns. This is partly motivated by how visual, auditory or other sensory information is handled in separate parts of the cerebral cortex in the human brain. The weights of the neurons are initialized either to small random values or sampled evenly from the subspace spanned by the two largest principal component eigenvectors. With the latter alternative, learning is much faster because the initial weights already give a good approximation of SOM weights. The network must be fed a large number of example vectors that represent, as close as possible, the kinds of vectors expected during mapping. The examples are usually administered several times as iterations. The training utilizes competitive learning. When a training example is fed to the network, its Euclidean distance to all weight vectors is computed. The neuron whose weight vector is most similar to the input is called the best matching unit (BMU). The weights of the BMU and neurons close to it in the SOM grid are adjusted towards the input vector. The magnitude of the change decreases with time and with the grid-distance from the BMU. The update formula for a neuron v with weight vector Wv(s) is W v ( s + 1 ) = W v ( s ) + θ ( u , v , s ) ⋅ α ( s ) ⋅ ( D ( t ) − W v ( s ) ) {\displaystyle W_{v}(s+1)=W_{v}(s)+\theta (u,v,s)\cdot \alpha (s)\cdot (D(t)-W_{v}(s))} , where s is the step index, t is an index into the training sample, u is the index of the BMU for the input vector D(t), α(s) is a monotonically decreasing learning coefficient; θ(u, v, s) is the neighborhood function which gives the distance between the neuron u and the neuron v in step s. Depending on the implementations, t can scan the training data set systematically (t is 0, 1, 2...T-1, then repeat, T being the training sample's size), be randomly drawn from the data set (bootstrap sampling), or implement some other sampling method (such as jackknifing). The neighborhood function θ(u, v, s) (also called function of lateral interaction) depends on the grid-distance between the BMU (neuron u) and neuron v. In the simplest form, it is 1 for all neurons close enough to BMU and 0 for others, but the Gaussian and Mexican-hat functions are common choices, too. Regardless of the functional form, the neighborhood function shrinks with time. At the beginning when the neighborhood is broad, the self-organizing takes place on the global scale. When the neighborhood has shrunk to just a couple of neurons, the weights are converging to local estimates. In some implementations, the learning coefficient α and the neighborhood function θ decrease steadily with increasing s, in others (in particular those where t scans the training data set) they decrease in step-wise fashion, once every T steps. This process is repeated for each input vector for a (usually large) number of cycles λ. The network winds up associating output nodes with groups or patterns in the input data set. If these patterns can be named, the names can be attached to the associated nodes in the trained net. During mapping, there will be one single winning neuron: the neuron whose weight vector lies closest to the input vector. This can be simply determined by calculating the Euclidean distance between input vector and weight vector. While representing input data as vectors has been emphasized in this article, any kind of object which can be represented digitally, which has an appropriate distance measure associated with it, and in which the necessary operations for training are possible can be used to construct a self-organizing map. This includes matrices, continuous functions or even other self-organizing maps. === Algorithm === Randomize the node weight vectors in a map For s = 0 , 1 , 2 , . . . , λ {\displaystyle s=0,1,2,...,\lambda } Randomly pick an input vector D ( t ) {\displaystyle {D}(t)} Find the node in the map closest to the input vector. This node is the best matching unit (BMU). Denote it by u {\displaystyle u} For each node v {\displaystyle v} , update its vector by pulling it closer to the input vector: W v ( s + 1 ) = W v ( s ) + θ ( u , v , s ) ⋅ α ( s ) ⋅ ( D ( t ) − W v ( s ) ) {\displaystyle W_{v}(s+1)=W_{v}(s)+\theta (u,v,s)\cdot \alpha (s)\cdot (D(t)-W_{v}(s))} The variable names mean the following, with vectors in bold, s {\displaystyle s} is the current iteration λ {\displaystyle \lambda } is the iteration limit t {\displaystyle t} is the index of the target input data vector in the input data set D {\displaystyle \mathbf {D} } D ( t ) {\displaystyle {D}(t)} is a target input data vector v {\displaystyle v} is the index of the node in the map W v {\displaystyle \mathbf {W} _{v}} is the current weight vector of node v {\displaystyle v} u {\displaystyle u} is the index of the best matching unit (BMU) in the map θ ( u , v , s ) {\displaystyle \theta (u,v,s)} is the neighbourhood function, α ( s ) {\displaystyle \alpha (s)} is the learning rate schedule. The key design choices are the shape of the SOM, the neighbourhood function, and the learning rate schedule. The idea of the neighborhood function is to make it such that the BMU is updated the most, its immediate neighbors are updated a little less, and so on. The idea of the learning rate schedule is to make it so that the map updates are large at the start, and gradually stop updating. For example, if we want to learn a SOM using a square grid, we can index it using ( i , j ) {\displaystyle (i,j)} where both i , j ∈ 1 : N {\displaystyle i,j\in 1:N} . The neighborhood function can make it so that the BMU updates in full, the nearest neighbors update in half, and their neighbors update in half again, etc. θ ( ( i , j ) , ( i ′ , j ′ ) , s ) = 1 2 | i − i ′ | + | j − j ′ | = { 1 if i = i ′ , j = j ′ 1 / 2 if | i − i ′ | + | j − j ′ | = 1 1 / 4 if | i − i ′ | + | j − j ′ | = 2 ⋯ ⋯ {\displaystyle \theta ((i,j),(i',j'),s)={\frac {1}{2^{|i-i'|+|j-j'|}}}={\begin{cases}1&{\text{if }}i=i',j=j'\\1/2&{\text{if

    Read more →
  • Training, validation, and test data sets

    Training, validation, and test data sets

    In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from input data. These input data used to build the model are usually divided into multiple data sets. In particular, three data sets are commonly used in different stages of the creation of the model: training, validation, and testing sets. The model is initially fit on a training data set, which is a set of examples used to fit the parameters (e.g. weights of connections between neurons in artificial neural networks) of the model. The model (e.g. a naive Bayes classifier) is trained on the training data set using a supervised learning method, for example using optimization methods such as gradient descent or stochastic gradient descent. In practice, the training data set often consists of pairs of an input vector (or scalar) and the corresponding output vector (or scalar), where the answer key is commonly denoted as the target (or label). The current model is run with the training data set and produces a result, which is then compared with the target, for each input vector in the training data set. Based on the result of the comparison and the specific learning algorithm being used, the parameters of the model are adjusted. The model fitting can include both variable selection and parameter estimation. Successively, the fitted model is used to predict the responses for the observations in a second data set called the validation data set. The validation data set provides an unbiased evaluation of a model fit on the training data set while tuning the model's hyperparameters (e.g. the number of hidden units—layers and layer widths—in a neural network). Validation data sets can be used for regularization by early stopping (stopping training when the error on the validation data set increases, as this is a sign of over-fitting to the training data set). This simple procedure is complicated in practice by the fact that the validation data set's error may fluctuate during training, producing multiple local minima. This complication has led to the creation of many ad-hoc rules for deciding when over-fitting has truly begun. Finally, the test data set is a data set used to provide an unbiased evaluation of a model fit on the training data set. When the data in the test data set has never been used (for example in cross-validation), the test data set is called a holdout data set. The term "validation set" is sometimes used instead of "test set" in some literature (e.g., if the original data set was partitioned into only two subsets, the test set might be referred to as the validation set). Deciding the sizes and strategies for data set division in training, test and validation sets is very dependent on the problem and data available. == Training data set == A training data set is a data set of examples used during the learning process and is used to fit the parameters (e.g., weights) of, for example, a classifier. For classification tasks, a supervised learning algorithm looks at the training data set to determine, or learn, the optimal combinations of variables that will generate a good predictive model. The goal is to produce a trained (fitted) model that generalizes well to new, unknown data. The fitted model is evaluated using “new” examples from the held-out data sets (validation and test data sets) to estimate the model’s accuracy in classifying new data. To reduce the risk of issues such as over-fitting, the examples in the validation and test data sets should not be used to train the model. Most approaches that search through training data for empirical relationships tend to overfit the data, meaning that they can identify and exploit apparent relationships in the training data that do not hold in general. When a training set is continuously expanded with new data, then this is incremental learning. == Validation data set == A validation data set is a data set of examples used to tune the hyperparameters (i.e. the architecture) of a model. It is sometimes also called the development set or the "dev set". An example of a hyperparameter for artificial neural networks includes the number of hidden units in each layer. It, as well as the testing set (as mentioned below), should follow the same probability distribution as the training data set. In order to avoid overfitting, when any classification parameter needs to be adjusted, it is necessary to have a validation data set in addition to the training and test data sets. For example, if the most suitable classifier for the problem is sought, the training data set is used to train the different candidate classifiers, the validation data set is used to compare their performances and decide which one to take and, finally, the test data set is used to obtain the performance characteristics such as accuracy, sensitivity, specificity, F-measure, and so on. The validation data set functions as a hybrid: it is training data used for testing, but neither as part of the low-level training nor as part of the final testing. The basic process of using a validation data set for model selection (as part of training data set, validation data set, and test data set) is: Since our goal is to find the network having the best performance on new data, the simplest approach to the comparison of different networks is to evaluate the error function using data which is independent of that used for training. Various networks are trained by minimization of an appropriate error function defined with respect to a training data set. The performance of the networks is then compared by evaluating the error function using an independent validation set, and the network having the smallest error with respect to the validation set is selected. This approach is called the hold out method. Since this procedure can itself lead to some overfitting to the validation set, the performance of the selected network should be confirmed by measuring its performance on a third independent set of data called a test set. An application of this process is in early stopping, where the candidate models are successive iterations of the same network, and training stops when the error on the validation set grows, choosing the previous model (the one with minimum error). == Test data set == A test data set is a data set that is independent of the training data set, but that follows the same probability distribution as the training data set. A test set is therefore a set of examples used only to assess the performance (i.e. generalization) of a specified classifier on unseen data. To do this, the model is used to predict classifications of examples in the test set. Those predictions are compared to the examples' true classifications to assess the model's accuracy. If a model fit to the training and validation data set also fits the test data set well, minimal overfitting has taken place (see figure below). A better fitting of the training or validation data sets as opposed to the test data set usually points to overfitting. In the scenario where a data set has a low number of samples, it is usually partitioned into a training set and a validation data set, where the model is trained on the training set and refined using the validation set to improve accuracy, but this approach will lead to overfitting. The holdout method can also be employed, where the test set is used at the end, after training on the training set. Other techniques, such as cross-validation and bootstrapping, are used on small data sets. The bootstrap method generates numerous simulated data sets of the same size by randomly sampling with replacement from the original data, allowing the random data points to serve as test sets for evaluating model performance. Cross-validation splits the data set into multiple folds, with a single sub-fold used as test data; the model is trained on the remaining folds, and all folds are cross-validated (with results averaged and models consolidated) to estimate final model performance. Note that some sources advise against using a single split, as it can lead to overfitting as well as biased model performance estimates. For this reason, data sets are split into three partitions: training, validation and test data sets. The standard machine learning practice is to train on the training set and tune hyperparameters using the validation set, where the validation process selects the model with the lowest validation loss, which is then tested on the test data set (normally held out) to assess the final model. The holdout method for the test set reduces computation by avoiding using the test set after each epoch. The test data set should never be used for validating the training model or fine-tuning hyperparameters, as it provides an accurate and honest evaluation of the model's final performance on unseen dat

    Read more →
  • Policy gradient method

    Policy gradient method

    Policy gradient methods are a class of reinforcement learning algorithms and a sub-class of policy optimization methods. Unlike value-based methods which learn a value function to derive a policy, policy optimization methods directly learn a policy function π {\displaystyle \pi } that selects actions without consulting a value function. For policy gradient to apply, the policy function π θ {\displaystyle \pi _{\theta }} is parameterized by a differentiable parameter θ {\displaystyle \theta } . == Overview == In policy-based RL, the actor is a parameterized policy function π θ {\displaystyle \pi _{\theta }} , where θ {\displaystyle \theta } are the parameters of the actor. The actor takes as argument the state of the environment s {\displaystyle s} and produces a probability distribution π θ ( ⋅ ∣ s ) {\displaystyle \pi _{\theta }(\cdot \mid s)} . If the action space is discrete, then ∑ a π θ ( a ∣ s ) = 1 {\displaystyle \sum _{a}\pi _{\theta }(a\mid s)=1} . If the action space is continuous, then ∫ a π θ ( a ∣ s ) d a = 1 {\displaystyle \int _{a}\pi _{\theta }(a\mid s)\mathrm {d} a=1} . The goal of policy optimization is to find some θ {\displaystyle \theta } that maximizes the expected episodic reward J ( θ ) {\displaystyle J(\theta )} : J ( θ ) = E π θ [ ∑ t = 0 T γ t R t | S 0 = s 0 ] {\displaystyle J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\gamma ^{t}R_{t}{\Big |}S_{0}=s_{0}\right]} where γ {\displaystyle \gamma } is the discount factor, R t {\displaystyle R_{t}} is the reward at step t {\displaystyle t} , s 0 {\displaystyle s_{0}} is the starting state, and T {\displaystyle T} is the time-horizon (which can be infinite). The policy gradient is defined as ∇ θ J ( θ ) {\displaystyle \nabla _{\theta }J(\theta )} . Different policy gradient methods stochastically estimate the policy gradient in different ways. The goal of any policy gradient method is to iteratively maximize J ( θ ) {\displaystyle J(\theta )} by gradient ascent. Since the key part of any policy gradient method is the stochastic estimation of the policy gradient, they are also studied under the title of "Monte Carlo gradient estimation". == REINFORCE == === Policy gradient === The REINFORCE algorithm, introduced by Ronald J. Williams in 1992, was the first policy gradient method. It is based on the identity for the policy gradient ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t ∣ S t ) ∑ t = 0 T ( γ t R t ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}\mid S_{t})\;\sum _{t=0}^{T}(\gamma ^{t}R_{t}){\Big |}S_{0}=s_{0}\right]} which can be improved via the "causality trick" ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t ∣ S t ) ∑ τ = t T ( γ τ R τ ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}\mid S_{t})\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau }){\Big |}S_{0}=s_{0}\right]} Thus, we have an unbiased estimator of the policy gradient: ∇ θ J ( θ ) ≈ 1 N ∑ n = 1 N [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t , n ∣ S t , n ) ∑ τ = t T ( γ τ − t R τ , n ) ] {\displaystyle \nabla _{\theta }J(\theta )\approx {\frac {1}{N}}\sum _{n=1}^{N}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t,n}\mid S_{t,n})\sum _{\tau =t}^{T}(\gamma ^{\tau -t}R_{\tau ,n})\right]} where the index n {\displaystyle n} ranges over N {\displaystyle N} rollout trajectories using the policy π θ {\displaystyle \pi _{\theta }} . The score function ∇ θ ln ⁡ π θ ( A t ∣ S t ) {\displaystyle \nabla _{\theta }\ln \pi _{\theta }(A_{t}\mid S_{t})} can be interpreted as the direction in the parameter space that increases the probability of taking action A t {\displaystyle A_{t}} in state S t {\displaystyle S_{t}} . The policy gradient, then, is a weighted average of all possible directions to increase the probability of taking any action in any state, but weighted by reward signals, so that if taking a certain action in a certain state is associated with high reward, then that direction would be highly reinforced, and vice versa. === Algorithm === The REINFORCE algorithm is a loop: Rollout N {\displaystyle N} trajectories in the environment, using π θ t {\displaystyle \pi _{\theta _{t}}} as the policy function. Compute the policy gradient estimation: g i ← 1 N ∑ n = 1 N [ ∑ t = 0 T ∇ θ t ln ⁡ π θ ( A t , n ∣ S t , n ) ∑ τ = t T ( γ τ R τ , n ) ] {\displaystyle g_{i}\leftarrow {\frac {1}{N}}\sum _{n=1}^{N}\left[\sum _{t=0}^{T}\nabla _{\theta _{t}}\ln \pi _{\theta }(A_{t,n}\mid S_{t,n})\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau ,n})\right]} Update the policy by gradient ascent: θ i + 1 ← θ i + α i g i {\displaystyle \theta _{i+1}\leftarrow \theta _{i}+\alpha _{i}g_{i}} Here, α i {\displaystyle \alpha _{i}} is the learning rate at update step i {\displaystyle i} . == Variance reduction == REINFORCE is an on-policy algorithm, meaning that the trajectories used for the update must be sampled from the current policy π θ {\displaystyle \pi _{\theta }} . This can lead to high variance in the updates, as the returns R ( τ ) {\displaystyle R(\tau )} can vary significantly between trajectories. Many variants of REINFORCE have been introduced, under the title of variance reduction. === REINFORCE with baseline === A common way for reducing variance is the REINFORCE with baseline algorithm, based on the following identity: ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t | S t ) ( ∑ τ = t T ( γ τ R τ ) − b ( S t ) ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\left(\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau })-b(S_{t})\right){\Big |}S_{0}=s_{0}\right]} for any function b : States → R {\displaystyle b:{\text{States}}\to \mathbb {R} } . This can be proven by applying the previous lemma. The algorithm uses the modified gradient estimator g i ← 1 N ∑ n = 1 N [ ∑ t = 0 T ∇ θ t ln ⁡ π θ ( A t , n | S t , n ) ( ∑ τ = t T ( γ τ R τ , n ) − b i ( S t , n ) ) ] {\displaystyle g_{i}\leftarrow {\frac {1}{N}}\sum _{n=1}^{N}\left[\sum _{t=0}^{T}\nabla _{\theta _{t}}\ln \pi _{\theta }(A_{t,n}|S_{t,n})\left(\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau ,n})-b_{i}(S_{t,n})\right)\right]} and the original REINFORCE algorithm is the special case where b i ≡ 0 {\displaystyle b_{i}\equiv 0} . === Actor-critic methods === If b i {\textstyle b_{i}} is chosen well, such that b i ( S t ) ≈ ∑ τ = t T ( γ τ R τ ) = γ t V π θ i ( S t ) {\textstyle b_{i}(S_{t})\approx \sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau })=\gamma ^{t}V^{\pi _{\theta _{i}}}(S_{t})} , this could significantly decrease variance in the gradient estimation. That is, the baseline should be as close to the value function V π θ i ( S t ) {\displaystyle V^{\pi _{\theta _{i}}}(S_{t})} as possible, approaching the ideal of: ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t | S t ) ( ∑ τ = t T ( γ τ R τ ) − γ t V π θ ( S t ) ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\left(\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau })-\gamma ^{t}V^{\pi _{\theta }}(S_{t})\right){\Big |}S_{0}=s_{0}\right]} Note that, as the policy π θ t {\displaystyle \pi _{\theta _{t}}} updates, the value function V π θ i ( S t ) {\displaystyle V^{\pi _{\theta _{i}}}(S_{t})} updates as well, so the baseline should also be updated. One common approach is to train a separate function that estimates the value function, and use that as the baseline. This is one of the actor-critic methods, where the policy function is the actor and the value function is the critic. The Q-function Q π {\displaystyle Q^{\pi }} can also be used as the critic, since ∇ θ J ( θ ) = E π θ [ ∑ 0 ≤ t ≤ T γ t ∇ θ ln ⁡ π θ ( A t | S t ) ⋅ Q π θ ( S t , A t ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left[\sum _{0\leq t\leq T}\gamma ^{t}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\cdot Q^{\pi _{\theta }}(S_{t},A_{t}){\Big |}S_{0}=s_{0}\right]} by a similar argument using the tower law. Subtracting the value function as a baseline, we find that the advantage function A π ( S , A ) = Q π ( S , A ) − V π ( S ) {\displaystyle A^{\pi }(S,A)=Q^{\pi }(S,A)-V^{\pi }(S)} can be used as the critic as well: ∇ θ J ( θ ) = E π θ [ ∑ 0 ≤ t ≤ T γ t ∇ θ ln ⁡ π θ ( A t | S t ) ⋅ A π θ ( S t , A t ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left[\sum _{0\leq t\leq T}\gamma ^{t}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\cdot A^{\pi _{\theta }}(S_{t},A_{t}){\Big |}S_{0}=s_{0}\right]} In summary, there are many unbiased estimators for ∇ θ J θ {\textstyle \nabla _{\theta }J_{\theta }} , all in the form of: ∇ θ J ( θ ) = E π θ [ ∑ 0 ≤ t ≤ T ∇ θ ln ⁡ π θ ( A t | S t ) ⋅ Ψ t | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left[\su

    Read more →
  • Data-centric AI

    Data-centric AI

    Data-centric AI is an approach within artificial intelligence that emphasizes on improving the quality, consistency and representativeness of the data used to train machine learning models, rather than focusing primarily on optimizing model architectures or algorithms. This idea has gained traction as researchers and practitioners have come to believe that many performance limitations of machine learning systems stem from issues such as noisy labels, biased datasets, and lack of coverage in the data. Data-centric AI involves disciplined approach to data cleaning, augmentation, labeling, and governance that improves model performance and reliability in applications such as computer vision, natural language processing, and further.

    Read more →
  • VIGRA

    VIGRA

    VIGRA is the abbreviation for "Vision with Generic Algorithms". It is a free open-source computer vision library which focuses on customizable algorithms and data structures. VIGRA component can be easily adapted to specific needs of target application without compromising execution speed, by using template techniques similar to those in the C++ Standard Template Library. == Features == VIGRA is cross-platform, with working builds on Microsoft Windows, Mac OS X, Linux, and OpenBSD. Since version 1.7.1, VIGRA provides Python bindings based on numpy framework. == History == VIGRA was originally designed and implemented by scientists at University of Hamburg faculty of computer science; its core maintainers are now working at Heidelberg Collaboratory for Image Processing (HCI) University of Heidelberg. In the meantime, many developers have contributed to the project. == Application == CellCognition and ilastik uses VIGRA computer vision library. OpenOffice.org uses VIGRA as part of its headless software rendering backend; LibreOffice does so until version 5.2.

    Read more →
  • Weka (software)

    Weka (software)

    Waikato Environment for Knowledge Analysis (Weka) is a collection of machine learning and data analysis free software licensed under the GNU General Public License. It was developed at the University of Waikato, New Zealand, and is the companion software to the book "Data Mining: Practical Machine Learning Tools and Techniques". == Description == Weka contains a collection of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to these functions. The original non-Java version of Weka was a Tcl/Tk front-end to (mostly third-party) modeling algorithms implemented in other programming languages, plus data preprocessing utilities in C, and a makefile-based system for running machine learning experiments. This original version was primarily designed as a tool for analyzing data from agricultural domains, but the more recent fully Java-based version (Weka 3), for which development started in 1997, is now used in many different application areas, in particular for educational purposes and research. Advantages of Weka include: Free availability under the GNU General Public License. Portability, since it is fully implemented in the Java programming language and thus runs on almost any modern computing platform. A comprehensive collection of data preprocessing and modeling techniques. Ease of use due to its graphical user interfaces. Weka supports several standard data mining tasks, more specifically, data preprocessing, clustering, classification, regression, visualization, and feature selection. Input to Weka is expected to be formatted according the Attribute-Relational File Format and with the filename bearing the .arff extension. All of Weka's techniques are predicated on the assumption that the data is available as one flat file or relation, where each data point is described by a fixed number of attributes (normally, numeric or nominal attributes, but some other attribute types are also supported). Weka provides access to SQL databases using Java Database Connectivity and can process the result returned by a database query. Weka provides access to deep learning with Deeplearning4j. It is not capable of multi-relational data mining, but there is separate software for converting a collection of linked database tables into a single table that is suitable for processing using Weka. Another important area that is currently not covered by the algorithms included in the Weka distribution is sequence modeling. == Extension packages == In version 3.7.2, a package manager was added to allow the easier installation of extension packages. Some functionality that used to be included with Weka prior to this version has since been moved into such extension packages, but this change also makes it easier for others to contribute extensions to Weka and to maintain the software, as this modular architecture allows independent updates of the Weka core and individual extensions. == History == In 1993, the University of Waikato in New Zealand began development of the original version of Weka, which became a mix of Tcl/Tk, C, and makefiles. In 1997, the decision was made to redevelop Weka from scratch in Java, including implementations of modeling algorithms. In 2005, Weka received the SIGKDD Data Mining and Knowledge Discovery Service Award. In 2006, Pentaho Corporation acquired an exclusive licence to use Weka for business intelligence. It forms the data mining and predictive analytics component of the Pentaho business intelligence suite. Pentaho has since been acquired by Hitachi Vantara, and Weka now underpins the PMI (Plugin for Machine Intelligence) open source component. == Related tools == Auto-WEKA is an automated machine learning system for Weka. Environment for DeveLoping KDD-Applications Supported by Index-Structures (ELKI) is a similar project to Weka with a focus on cluster analysis, i.e., unsupervised methods. H2O.ai is an open-source data science and machine learning platform KNIME is a machine learning and data mining software implemented in Java. Massive Online Analysis (MOA) is an open-source project for large scale mining of data streams, also developed at the University of Waikato in New Zealand. Neural Designer is a data mining software based on deep learning techniques written in C++. Orange is a similar open-source project for data mining, machine learning and visualization based on scikit-learn. RapidMiner is a commercial machine learning framework implemented in Java which integrates Weka. scikit-learn is a popular machine learning library in Python.

    Read more →