AI For Business Ualbany

AI For Business Ualbany — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Termcap

    Termcap

    Termcap (terminal capability) is a legacy software library and database used on Unix-like computers that enables programs to use display computer terminals in a terminal-independent manner, which greatly simplifies the process of writing portable text mode applications. It was superseded by the terminfo database used by ncurses, tput, and other programs. A termcap database can describe the capabilities of hundreds of different display terminals. This allows programs to have character-based display output, independent of the type of terminal. On-screen text editors such as vi and Emacs are examples of programs that may use termcap. Other programs are listed in the Termcap category. Access to the termcap database was usually provided by separate libraries, e.g. GNU Termcap. Examples of what the database describes: how many columns wide the display is what string to send to move the cursor to an arbitrary position (including how to encode the row and column numbers) how to scroll the screen up one or several lines how much padding is needed for such a scrolling operation. == History == Bill Joy wrote the first termcap library in 1978 for the Berkeley Unix operating system; it has since been ported to most Unix and Unix-like environments, even OS-9. Joy's design was reportedly influenced by the design of the terminal data store in the earlier Incompatible Timesharing System. == Data model == Termcap databases consist of one or more descriptions of terminals. === Indices === Each description must contain the canonical name of the terminal. It may also contain one or more aliases for the name of the terminal. The canonical name or aliases are the keys by which the library searches the termcap database. === Data values === The description contains one or more capabilities, which have conventional names. The capabilities are typed: boolean, numeric and string. The termcap library has no predetermined type for each capability name. It determines the types of each capability by the syntax: string capabilities have an "=" between the capability name and its value, numeric capabilities have a "#" between the capability name and its value, and boolean capabilities have no associated value (they are always true if specified). Applications which use termcap do expect specific types for the commonly used capabilities, and obtain the values of capabilities from the termcap database using library calls that return successfully only when the database contents matches the assumed type. === Hierarchy === Termcap descriptions can be constructed by including the contents of one description in another, suppressing capabilities from the included description or overriding or adding capabilities. No matter what storage model is used, the termcap library constructs the terminal description from the requested description, including, suppressing or overriding at the time of the request. == Storage model == Termcap data is stored as text, making it simple to modify. The text can be retrieved by the termcap library from files or environment variables. === Environment variables === The TERM environment variable contains the terminal type name. The TERMCAP environment variable may contain a termcap database. It is most often used to store a single termcap description, set by a terminal emulator to provide the terminal's characteristics to the shell and dependent programs. The TERMPATH environment variable is supported by newer termcap implementations and defines a search path for termcap files. === Flat file === The original (and most common) implementation of the termcap library retrieves data from a flat text file. Searching a large termcap file, e.g., 500 kB, can be slow. To aid performance, a utility such as reorder is used to put the most frequently used entries near the beginning of the file. === Hashed database === 4.4BSD based implementations of termcap store the terminal description in a hashed database (e.g., something like Berkeley DB version 1.85). These store two types of records: aliases which point to the canonical entry, and the canonical entry itself. The text of the termcap entry is stored literally. == Limitations and extensions == The original termcap implementation was designed to use little memory: the first name is two characters, to fit in 16 bits capability names are two characters descriptions are limited to 1023 characters. only one termcap entry with its definitions can be included, and must be at the end. Newer implementations of the termcap interface generally do not require the two-character name at the beginning of the entry. Capability names are still two characters in all implementations. The tgetent function used to read the terminal description uses a buffer whose size must be large enough for the data, and is assumed to be 1024 characters. Newer implementations of the termcap interface may relax this constraint by allowing a null pointer in place of the fixed buffer, or by hiding the data which would not fit, e.g., via the ZZ capability in NetBSD termcap. The terminfo library interface also emulates the termcap interface, and does not actually use the fixed-size buffer. The terminfo library's emulation of termcap allows multiple other entries to be included without restricting the position. A few other newer implementations of the termcap library may also provide this ability, though it is not well documented. == Obsolete features == A special capability, the "hz" capability, was defined specifically to support the Hazeltine 1500 terminal, which had the unfortunate characteristic of using the ASCII tilde character ('~') as a control sequence introducer. In order to support that terminal, not only did code that used the database have to know about using the tilde to introduce certain control sequences, but it also had to know to substitute another printable character for any tildes in the displayed text, since a tilde in the text would be interpreted by the terminal as the start of a control sequence, resulting in missing text and screen garbling. Additionally, attribute markers (such as start and end of underlining) themselves took up space on the screen. Comments in the database source code often referred to this as "Hazeltine braindamage". Since the Hazeltine 1500 was a widely used terminal in the late 1970s, it was important for applications to be able to deal with its limitations.

    Read more →
  • Apache Mahout

    Apache Mahout

    Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms focused primarily on linear algebra. In the past, many of the implementations use the Apache Hadoop platform, however today it is primarily focused on Apache Spark. Mahout also provides Java/Scala libraries for common math operations (focused on linear algebra and statistics) and primitive Java collections. Mahout is a work in progress; a number of algorithms have been implemented. == Features == === Samsara === Apache Mahout-Samsara refers to a Scala domain-specific language (DSL) that allows users to use R-like syntax as opposed to traditional Scala-like syntax. This allows user to express algorithms concisely and clearly. === Backend agnostic === Apache Mahout's code abstracts the domain-specific language from the engine where the code is run. While active development is done with the Apache Spark engine, users are free to implement any engine they choose- H2O and Apache Flink have been implemented in the past and examples exist in the code base. === GPU/CPU accelerators === The JVM has notoriously slow computation. To improve speed, "native solvers" were added which move in-core, and by extension, distributed BLAS operations out of the JVM, offloading to off-heap or GPU memory for processing via multiple CPUs and/or CPU cores, or GPUs when built against the ViennaCL library. ViennaCL is a highly optimized C++ library with BLAS operations implemented in OpenMP, and OpenCL. As of release 14.1, the OpenMP build considered to be stable, leaving the OpenCL build is still in its experimental proof-of-concept phase. === Recommenders === Apache Mahout features implementations of Alternating Least Squares, Co-Occurrence, and Correlated Co-Occurrence, a unique-to-Mahout recommender algorithm that extends co-occurrence to be used on multiple dimensions of data. == History == === Transition from Map Reduce to Apache Spark === While Mahout's core algorithms for clustering, classification and batch based collaborative filtering were implemented on top of Apache Hadoop using the map/reduce paradigm, it did not restrict contributions to Hadoop-based implementations. Contributions that run on a single node or on a non-Hadoop cluster were also welcomed. For example, the 'Taste' collaborative-filtering recommender component of Mahout was originally a separate project and can run stand-alone without Hadoop. Starting with the release 0.10.0, the project shifted its focus to building a backend-independent programming environment, code named "Samsara". The environment consists of an algebraic backend-independent optimizer and an algebraic Scala DSL unifying in-memory and distributed algebraic operators. Supported algebraic platforms are Apache Spark, H2O, and Apache Flink. Support for MapReduce algorithms started being gradually phased out in 2014. === Release history === === Developers === Apache Mahout is developed by a community. The project is managed by a group called the "Project Management Committee" (PMC). The current PMC is Andrew Musselman, Andrew Palumbo, Drew Farris, Isabel Drost-Fromm, Jake Mannix, Pat Ferrel, Paritosh Ranjan, Trevor Grant, Robin Anil, Sebastian Schelter, Stevo Slavić.

    Read more →
  • Operational taxonomic unit

    Operational taxonomic unit

    An operational taxonomic unit (OTU) is an operational definition used to classify groups of closely related individuals. The term was originally introduced in 1963 by Robert R. Sokal and Peter H. A. Sneath in the context of numerical taxonomy, where an "operational taxonomic unit" is simply the group of organisms currently being studied. In this sense, an OTU is a pragmatic definition to group individuals by similarity, equivalent to but not necessarily in line with classical Linnaean taxonomy or modern evolutionary taxonomy. Nowadays, however, the term is commonly used in a different context and refers to clusters of (uncultivated or unknown) organisms, grouped by DNA sequence similarity of a specific taxonomic marker gene (originally coined as mOTU; molecular OTU). In other words, OTUs are pragmatic proxies for "species" at different taxonomic levels, in the absence of traditional systems of biological classification as are available for macroscopic organisms. For several years, OTUs have been the most commonly used units of diversity, especially when analysing small subunit 16S (for prokaryotes) or 18S rRNA (for eukaryotes) marker gene sequence datasets. == Molecular OTU by clustering of marker gene sequences == In the approach represented by DNA barcoding, a particular locus is chosen to be used as the marker gene for classification. This locus should be universally present in the scope selected, variable enough to be different among close-related species, and be flanked by conservative sequences that allow for easy amplification and detection. There are databases containing sequences for such marker genes from many different species, allowing for comparison. (Sometimes only using one locus does not provide sufficient resolution, so multiple marker genes are used. This is the case for plants, where rbcL+matK is common.) Sequences obtained this way can be clustered according to their similarity to one another, and operational taxonomic units are defined based on the similarity threshold set by the researcher. The exact threshold depends on the taxa in question and the mutational rates of the selected locus in the taxon. 97–99% are commonly used, but "it is now recognized to be somewhat arbitrary as sequence variation within and among species varies across taxa". 100% similarity (fully identical) is also common, also known as single variants. It remains debatable how well this commonly used method recapitulates true microbial species phylogeny or ecology. Although OTUs can be calculated differently when using different algorithms or thresholds, research by Schmidt et al. (2014) demonstrated that 16S-derived microbial OTUs were generally ecologically consistent across habitats and several clustering approaches. The number of OTUs defined may be inflated due to errors in DNA sequencing. === OTU clustering approaches === There are three main approaches to clustering OTUs: De novo, for which the clustering is based on similarities between sequencing reads. Closed-reference, for which the clustering is performed against a reference database of sequences. Open-reference, where clustering is first performed against a reference database of sequences, then any remaining sequences that could not be mapped to the reference are clustered de novo. Using a reference provides taxonomic context for the OTUs found. Alternatively, taxonomic context can be found after the construction of clusters by comparing representative sequences from clusters against a reference database. There are also specialized classifiers for this purpose which are much faster than naive comparison using BLAST. === OTU clustering algorithms === Hierarchical clustering algorithms (HCA): uclust & cd-hit & ESPRIT Bayesian clustering: CROP == Molecular OTU by other methods == In addition to similarity-based grouping, marker gene sequences can be sorted into OTUs using molecular phylogeny, k-mer composition, or hybrid methods combining these methods with similarity. There are also Bayesian tree-less methods and machine learning approaches. Using phylogeny often involves manually assigning terminal clades or single nodes to an OTU, so this is usually only done for refinement. Genome skimming can be used to obtain high-copy DNA without the need to choose marker genes or to design PCR primers for the chosen genes. It can provide fairly good coverage of organelle DNA and repetitive elements such as ribosomal DNA, both of which can be used like marker genes in OTU analysis. Whole-genome sequencing is more expensive and involves the production and processing of more data. By considering the entire genome, many (sometimes over 100) marker genes can be used at the same time, producing highly resolved phylogenies that correctly identify problematic taxa. It is also possible to use entire genomes for OTU assignment. For example, genomes from different bacterial species almost always have an average nucleotide identity lower than 95%, a fact that can be used to define new OTUs (and likely new species).

    Read more →
  • Kernel method

    Kernel method

    In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pattern analysis is to find and study general types of relations (for example clusters, rankings, principal components, correlations, classifications) in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over all pairs of data points computed using inner products. The feature map in kernel machines is infinite dimensional but only requires a finite dimensional matrix from user-input according to the representer theorem. Kernel machines are slow to compute for datasets larger than a couple of thousand examples without parallel processing. Kernel methods owe their name to the use of kernel functions, which enable them to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. This operation is often computationally cheaper than the explicit computation of the coordinates. This approach is called the "kernel trick". Kernel functions have been introduced for sequence data, graphs, text, images, as well as vectors. Algorithms capable of operating with kernels include the kernel perceptron, support-vector machines (SVM), Gaussian processes, principal components analysis (PCA), canonical correlation analysis, ridge regression, spectral clustering, linear adaptive filters and many others. Most kernel algorithms are based on convex optimization or eigenproblems and are statistically well-founded. Typically, their statistical properties are analyzed using statistical learning theory (for example, using Rademacher complexity). == Motivation and informal explanation == Kernel methods can be thought of as instance-based learners: rather than learning some fixed set of parameters corresponding to the features of their inputs, they instead "remember" the i {\displaystyle i} -th training example ( x i , y i ) {\displaystyle (\mathbf {x} _{i},y_{i})} and learn for it a corresponding weight w i {\displaystyle w_{i}} . Prediction for unlabeled inputs, i.e., those not in the training set, are treated by the application of a similarity function k {\displaystyle k} , called a kernel, between the unlabeled input x ′ {\displaystyle \mathbf {x'} } and each of the training inputs x i {\displaystyle \mathbf {x} _{i}} . For instance, a kernelized binary classifier typically computes a weighted sum of similarities y ^ = sgn ⁡ ∑ i = 1 n w i y i k ( x i , x ′ ) , {\displaystyle {\hat {y}}=\operatorname {sgn} \sum _{i=1}^{n}w_{i}y_{i}k(\mathbf {x} _{i},\mathbf {x'} ),} where y ^ ∈ { − 1 , + 1 } {\displaystyle {\hat {y}}\in \{-1,+1\}} is the kernelized binary classifier's predicted label for the unlabeled input x ′ {\displaystyle \mathbf {x'} } whose hidden true label y {\displaystyle y} is of interest; k : X × X → R {\displaystyle k\colon {\mathcal {X}}\times {\mathcal {X}}\to \mathbb {R} } is the kernel function that measures similarity between any pair of inputs x , x ′ ∈ X {\displaystyle \mathbf {x} ,\mathbf {x'} \in {\mathcal {X}}} ; the sum ranges over the n labeled examples { ( x i , y i ) } i = 1 n {\displaystyle \{(\mathbf {x} _{i},y_{i})\}_{i=1}^{n}} in the classifier's training set, with y i ∈ { − 1 , + 1 } {\displaystyle y_{i}\in \{-1,+1\}} ; the w i ∈ R {\displaystyle w_{i}\in \mathbb {R} } are the weights for the training examples, as determined by the learning algorithm; the sign function sgn {\displaystyle \operatorname {sgn} } determines whether the predicted classification y ^ {\displaystyle {\hat {y}}} comes out positive or negative. Kernel classifiers were described as early as the 1960s, with the invention of the kernel perceptron. They rose to great prominence with the popularity of the support-vector machine (SVM) in the 1990s, when the SVM was found to be competitive with neural networks on tasks such as handwriting recognition. == Mathematics: the kernel trick == The kernel trick avoids the explicit mapping that is needed to get linear learning algorithms to learn a nonlinear function or decision boundary. For all x {\displaystyle \mathbf {x} } and x ′ {\displaystyle \mathbf {x'} } in the input space X {\displaystyle {\mathcal {X}}} , certain functions k ( x , x ′ ) {\displaystyle k(\mathbf {x} ,\mathbf {x'} )} can be expressed as an inner product in another space V {\displaystyle {\mathcal {V}}} . The function k : X × X → R {\displaystyle k\colon {\mathcal {X}}\times {\mathcal {X}}\to \mathbb {R} } is often referred to as a kernel or a kernel function. The word "kernel" is used in mathematics to denote a weighting function for a weighted sum or integral. Certain problems in machine learning have more structure than an arbitrary weighting function k {\displaystyle k} . The computation is made much simpler if the kernel can be written in the form of a "feature map" φ : X → V {\displaystyle \varphi \colon {\mathcal {X}}\to {\mathcal {V}}} which satisfies k ( x , x ′ ) = ⟨ φ ( x ) , φ ( x ′ ) ⟩ V . {\displaystyle k(\mathbf {x} ,\mathbf {x'} )=\langle \varphi (\mathbf {x} ),\varphi (\mathbf {x'} )\rangle _{\mathcal {V}}.} The key restriction is that ⟨ ⋅ , ⋅ ⟩ V {\displaystyle \langle \cdot ,\cdot \rangle _{\mathcal {V}}} must be a proper inner product. On the other hand, an explicit representation for φ {\displaystyle \varphi } is not necessary, as long as V {\displaystyle {\mathcal {V}}} is an inner product space. The alternative follows from Mercer's theorem: an implicitly defined function φ {\displaystyle \varphi } exists whenever the space X {\displaystyle {\mathcal {X}}} can be equipped with a suitable measure ensuring the function k {\displaystyle k} satisfies Mercer's condition. Mercer's theorem is similar to a generalization of the result from linear algebra that associates an inner product to any positive-definite matrix. In fact, Mercer's condition can be reduced to this simpler case. If we choose as our measure the counting measure μ ( T ) = | T | {\displaystyle \mu (T)=|T|} for all T ⊂ X {\displaystyle T\subset X} , which counts the number of points inside the set T {\displaystyle T} , then the integral in Mercer's theorem reduces to a summation ∑ i = 1 n ∑ j = 1 n k ( x i , x j ) c i c j ≥ 0. {\displaystyle \sum _{i=1}^{n}\sum _{j=1}^{n}k(\mathbf {x} _{i},\mathbf {x} _{j})c_{i}c_{j}\geq 0.} If this summation holds for all finite sequences of points ( x 1 , … , x n ) {\displaystyle (\mathbf {x} _{1},\dotsc ,\mathbf {x} _{n})} in X {\displaystyle {\mathcal {X}}} and all choices of n {\displaystyle n} real-valued coefficients ( c 1 , … , c n ) {\displaystyle (c_{1},\dots ,c_{n})} (cf. positive definite kernel), then the function k {\displaystyle k} satisfies Mercer's condition. Some algorithms that depend on arbitrary relationships in the native space X {\displaystyle {\mathcal {X}}} would, in fact, have a linear interpretation in a different setting: the range space of φ {\displaystyle \varphi } . The linear interpretation gives us insight about the algorithm. Furthermore, there is often no need to compute φ {\displaystyle \varphi } directly during computation, as is the case with support-vector machines. Some cite this running time shortcut as the primary benefit. Researchers also use it to justify the meanings and properties of existing algorithms. Theoretically, a Gram matrix K ∈ R n × n {\displaystyle \mathbf {K} \in \mathbb {R} ^{n\times n}} with respect to { x 1 , … , x n } {\displaystyle \{\mathbf {x} _{1},\dotsc ,\mathbf {x} _{n}\}} (sometimes also called a "kernel matrix"), where K i j = k ( x i , x j ) {\displaystyle K_{ij}=k(\mathbf {x} _{i},\mathbf {x} _{j})} , must be positive semi-definite (PSD). Empirically, for machine learning heuristics, choices of a function k {\displaystyle k} that do not satisfy Mercer's condition may still perform reasonably if k {\displaystyle k} at least approximates the intuitive idea of similarity. Regardless of whether k {\displaystyle k} is a Mercer kernel, k {\displaystyle k} may still be referred to as a "kernel". If the kernel function k {\displaystyle k} is also a covariance function as used in Gaussian processes, then the Gram matrix K {\displaystyle \mathbf {K} } can also be called a covariance matrix. == Applications == Application areas of kernel methods are diverse and include geostatistics, kriging, inverse distance weighting, 3D reconstruction, bioinformatics, cheminformatics, information extraction and handwriting recognition. == Popular kernels == Fisher kernel Graph kernels Kernel smoother Polynomial kernel Radial basis function kern

    Read more →
  • IruSoft

    IruSoft

    IruSoft (Arabic: آيروسوفت) is an insurance regulatory platform designated for licensing, supervision and inspection of the insurance sector within a country. The platform introduced unique supervision-technology (suptech), insurance-technology (insurtech) and regulatory-technology (regtech) automated modules by which a regulator requires less resources to ensure fairness, transparency and competition and to prevent conflicts of interest in the sector. IruSoft was founded by Abdullah Al-Salloum and owned by the Insurance Regulatory Unit in Kuwait. The Insurance Regulatory Unit optimized processing insurance-sector's customer complaints by issuing Resolution No. (1) of 2022 that introduced IruSoft's complaints public module; an automated resolution center, by which the process of receiving submitted complaints, passing them on to the platforms of licensed insurance companies, tracking matter-related discussions and updates and getting them escalated if unresolved to be discussed by a committee assigned by the unit is integrally automated and analyzed for better key performance indicators.

    Read more →
  • Elastic net regularization

    Elastic net regularization

    In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods. Nevertheless, elastic net regularization is typically more accurate than both methods with regard to reconstruction. == Specification == The elastic net method overcomes the limitations of the LASSO (least absolute shrinkage and selection operator) method which uses a penalty function based on ‖ β ‖ 1 = ∑ j = 1 p | β j | . {\displaystyle \|\beta \|_{1}=\textstyle \sum _{j=1}^{p}|\beta _{j}|.} Use of this penalty function has several limitations. For example, in the "large p, small n" case (high-dimensional data with few examples), the LASSO selects at most n variables before it saturates. Also if there is a group of highly correlated variables, then the LASSO tends to select one variable from a group and ignore the others. To overcome these limitations, the elastic net adds a quadratic part ( ‖ β ‖ 2 {\displaystyle \|\beta \|^{2}} ) to the penalty, which when used alone is ridge regression (known also as Tikhonov regularization). The estimates from the elastic net method are defined by β ^ ≡ argmin β ( ‖ y − X β ‖ 2 + λ 2 ‖ β ‖ 2 + λ 1 ‖ β ‖ 1 ) . {\displaystyle {\hat {\beta }}\equiv {\underset {\beta }{\operatorname {argmin} }}(\|y-X\beta \|^{2}+\lambda _{2}\|\beta \|^{2}+\lambda _{1}\|\beta \|_{1}).} The quadratic penalty term makes the loss function strongly convex, and it therefore has a unique minimum. The elastic net method includes the LASSO and ridge regression: in other words, each of them is a special case where λ 1 = λ , λ 2 = 0 {\displaystyle \lambda _{1}=\lambda ,\lambda _{2}=0} or λ 1 = 0 , λ 2 = λ {\displaystyle \lambda _{1}=0,\lambda _{2}=\lambda } . Meanwhile, the naive version of elastic net method finds an estimator in a two-stage procedure : first for each fixed λ 2 {\displaystyle \lambda _{2}} it finds the ridge regression coefficients, and then does a LASSO type shrinkage. This kind of estimation incurs a double amount of shrinkage, which leads to increased bias and poor predictions. To improve the prediction performance, sometimes the coefficients of the naive version of elastic net is rescaled by multiplying the estimated coefficients by ( 1 + λ 2 ) {\displaystyle (1+\lambda _{2})} . Examples of where the elastic net method has been applied are: Support vector machine Metric learning Portfolio optimization Cancer prognosis == Reduction to support vector machine == It was proven in 2014 that the elastic net can be reduced to the linear support vector machine. A similar reduction was previously proven for the LASSO in 2014. The authors showed that for every instance of the elastic net, an artificial binary classification problem can be constructed such that the hyper-plane solution of a linear support vector machine (SVM) is identical to the solution β {\displaystyle \beta } (after re-scaling). The reduction immediately enables the use of highly optimized SVM solvers for elastic net problems. It also enables the use of GPU acceleration, which is often already used for large-scale SVM solvers. The reduction is a simple transformation of the original data and regularization constants X ∈ R n × p , y ∈ R n , λ 1 ≥ 0 , λ 2 ≥ 0 {\displaystyle X\in {\mathbb {R} }^{n\times p},y\in {\mathbb {R} }^{n},\lambda _{1}\geq 0,\lambda _{2}\geq 0} into new artificial data instances and a regularization constant that specify a binary classification problem and the SVM regularization constant X 2 ∈ R 2 p × n , y 2 ∈ { − 1 , 1 } 2 p , C ≥ 0. {\displaystyle X_{2}\in {\mathbb {R} }^{2p\times n},y_{2}\in \{-1,1\}^{2p},C\geq 0.} Here, y 2 {\displaystyle y_{2}} consists of binary labels − 1 , 1 {\displaystyle {-1,1}} . When 2 p > n {\displaystyle 2p>n} it is typically faster to solve the linear SVM in the primal, whereas otherwise the dual formulation is faster. Some authors have referred to the transformation as Support Vector Elastic Net (SVEN), and provided the following MATLAB pseudo-code: == Software == "Glmnet: Lasso and elastic-net regularized generalized linear models" is a software which is implemented as an R source package and as a MATLAB toolbox. This includes fast algorithms for estimation of generalized linear models with ℓ1 (the lasso), ℓ2 (ridge regression) and mixtures of the two penalties (the elastic net) using cyclical coordinate descent, computed along a regularization path. JMP Pro 11 includes elastic net regularization, using the Generalized Regression personality with Fit Model. "pensim: Simulation of high-dimensional data and parallelized repeated penalized regression" implements an alternate, parallelised "2D" tuning method of the ℓ parameters, a method claimed to result in improved prediction accuracy. scikit-learn includes linear regression and logistic regression with elastic net regularization. SVEN, a Matlab implementation of Support Vector Elastic Net. This solver reduces the Elastic Net problem to an instance of SVM binary classification and uses a Matlab SVM solver to find the solution. Because SVM is easily parallelizable, the code can be faster than Glmnet on modern hardware. SpaSM, a Matlab implementation of sparse regression, classification and principal component analysis, including elastic net regularized regression. Apache Spark provides support for Elastic Net Regression in its MLlib machine learning library. The method is available as a parameter of the more general LinearRegression class. SAS (software) The SAS procedure Glmselect and SAS Viya procedure Regselect support the use of elastic net regularization for model selection.

    Read more →
  • Neocognitron

    Neocognitron

    The neocognitron is a hierarchical, multilayered artificial neural network proposed by Kunihiko Fukushima in 1979. It has been used for Japanese handwritten character recognition and other pattern recognition tasks, and served as the inspiration for convolutional neural networks. Previously in 1969, he published a similar architecture, but with hand-designed kernels inspired by convolutions in mammalian vision. In 1975 he improved it to the Cognitron, and in 1979 he improved it to the neocognitron, which learns all convolutional kernels by unsupervised learning (in his terminology, "self-organized by 'learning without a teacher'"). The neocognitron was inspired by the model proposed by Hubel & Wiesel in 1959. They found two types of cells in the visual primary cortex called simple cell and complex cell, and also proposed a cascading model of these two types of cells for use in pattern recognition tasks. The neocognitron is a natural extension of these cascading models. The neocognitron consists of multiple types of cells, the most important of which are called S-cells and C-cells. The local features are extracted by S-cells, and these features' deformation, such as local shifts, are tolerated by C-cells. Local features in the input are integrated gradually and classified in the higher layers. The idea of local feature integration is found in several other models, such as the Convolutional Neural Network model, the SIFT method, and the HoG method. There are various kinds of neocognitron. For example, some types of neocognitron can detect multiple patterns in the same input by using backward signals to achieve selective attention.

    Read more →
  • Proximal policy optimization

    Proximal policy optimization

    Proximal policy optimization (PPO) is a reinforcement learning (RL) algorithm for training an intelligent agent. Specifically, it is a policy gradient method, often used for deep RL when the policy network is very large. == History == The predecessor to PPO, Trust Region Policy Optimization (TRPO), was published in 2015. It addressed the instability issue of another algorithm, the Deep Q-Network (DQN), by using the trust region method to limit the KL divergence between the old and new policies. However, TRPO uses the Hessian matrix (a matrix of second derivatives) to enforce the trust region, but the Hessian is inefficient for large-scale problems. PPO was published in 2017. It was essentially an approximation of TRPO that does not require computing the Hessian. The KL divergence constraint was approximated by simply clipping the policy gradient. Since 2018, PPO was the default RL algorithm at OpenAI. PPO has been applied to many areas, such as controlling a robotic arm, beating professional players at Dota 2 (OpenAI Five), and playing Atari games. == TRPO == TRPO, the predecessor of PPO, is an on-policy algorithm. It can be used for environments with either discrete or continuous action spaces. The pseudocode is as follows: Input: initial policy parameters θ 0 {\textstyle \theta _{0}} , initial value function parameters ϕ 0 {\textstyle \phi _{0}} Hyperparameters: KL-divergence limit δ {\textstyle \delta } , backtracking coefficient α {\textstyle \alpha } , maximum number of backtracking steps K {\textstyle K} for k = 0 , 1 , 2 , … {\textstyle k=0,1,2,\ldots } do Collect set of trajectories D k = { τ i } {\textstyle {\mathcal {D}}_{k}=\left\{\tau _{i}\right\}} by running policy π k = π ( θ k ) {\textstyle \pi _{k}=\pi \left(\theta _{k}\right)} in the environment. Compute rewards-to-go R ^ t {\textstyle {\hat {R}}_{t}} . Compute advantage estimates, A ^ t {\textstyle {\hat {A}}_{t}} (using any method of advantage estimation) based on the current value function V ϕ k {\textstyle V_{\phi _{k}}} . Estimate policy gradient as g ^ k = 1 | D k | ∑ τ ∈ D k ∑ t = 0 T ∇ θ log ⁡ π θ ( a t ∣ s t ) | θ k A ^ t {\displaystyle {\hat {g}}_{k}=\left.{\frac {1}{\left|{\mathcal {D}}_{k}\right|}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\nabla _{\theta }\log \pi _{\theta }\left(a_{t}\mid s_{t}\right)\right|_{\theta _{k}}{\hat {A}}_{t}} Use the conjugate gradient algorithm to compute x ^ k ≈ H ^ k − 1 g ^ k {\displaystyle {\hat {x}}_{k}\approx {\hat {H}}_{k}^{-1}{\hat {g}}_{k}} where H ^ k {\textstyle {\hat {H}}_{k}} is the Hessian of the sample average KL-divergence. Update the policy by backtracking line search with θ k + 1 = θ k + α j 2 δ x ^ k T H ^ k x ^ k x ^ k {\displaystyle \theta _{k+1}=\theta _{k}+\alpha ^{j}{\sqrt {\frac {2\delta }{{\hat {x}}_{k}^{T}{\hat {H}}_{k}{\hat {x}}_{k}}}}{\hat {x}}_{k}} where j ∈ { 0 , 1 , 2 , … K } {\textstyle j\in \{0,1,2,\ldots K\}} is the smallest value which improves the sample loss and satisfies the sample KL-divergence constraint. Fit value function by regression on mean-squared error: ϕ k + 1 = arg ⁡ min ϕ 1 | D k | T ∑ τ ∈ D k ∑ t = 0 T ( V ϕ ( s t ) − R ^ t ) 2 {\displaystyle \phi _{k+1}=\arg \min _{\phi }{\frac {1}{\left|{\mathcal {D}}_{k}\right|T}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\left(V_{\phi }\left(s_{t}\right)-{\hat {R}}_{t}\right)^{2}} typically via some gradient descent algorithm. == PPO == The pseudocode is as follows: Input: initial policy parameters θ 0 {\textstyle \theta _{0}} , initial value function parameters ϕ 0 {\textstyle \phi _{0}} for k = 0 , 1 , 2 , … {\textstyle k=0,1,2,\ldots } do Collect set of trajectories D k = { τ i } {\textstyle {\mathcal {D}}_{k}=\left\{\tau _{i}\right\}} by running policy π k = π ( θ k ) {\textstyle \pi _{k}=\pi \left(\theta _{k}\right)} in the environment. Compute rewards-to-go R ^ t {\textstyle {\hat {R}}_{t}} . Compute advantage estimates, A ^ t {\textstyle {\hat {A}}_{t}} (using any method of advantage estimation) based on the current value function V ϕ k {\textstyle V_{\phi _{k}}} . Update the policy by maximizing the PPO-Clip objective: θ k + 1 = arg ⁡ max θ 1 | D k | T ∑ τ ∈ D k ∑ t = 0 T min ( π θ ( a t ∣ s t ) π θ k ( a t ∣ s t ) A π θ k ( s t , a t ) , g ( ϵ , A π θ k ( s t , a t ) ) ) {\displaystyle \theta _{k+1}=\arg \max _{\theta }{\frac {1}{\left|{\mathcal {D}}_{k}\right|T}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\min \left({\frac {\pi _{\theta }\left(a_{t}\mid s_{t}\right)}{\pi _{\theta _{k}}\left(a_{t}\mid s_{t}\right)}}A^{\pi _{\theta _{k}}}\left(s_{t},a_{t}\right),\quad g\left(\epsilon ,A^{\pi _{\theta _{k}}}\left(s_{t},a_{t}\right)\right)\right)} typically via stochastic gradient ascent with Adam. Fit value function by regression on mean-squared error: ϕ k + 1 = arg ⁡ min ϕ 1 | D k | T ∑ τ ∈ D k ∑ t = 0 T ( V ϕ ( s t ) − R ^ t ) 2 {\displaystyle \phi _{k+1}=\arg \min _{\phi }{\frac {1}{\left|{\mathcal {D}}_{k}\right|T}}\sum _{\tau \in {\mathcal {D}}_{k}}\sum _{t=0}^{T}\left(V_{\phi }\left(s_{t}\right)-{\hat {R}}_{t}\right)^{2}} typically via some gradient descent algorithm. Like all policy gradient methods, PPO is used for training an RL agent whose actions are determined by a differentiable policy function by gradient ascent. Intuitively, a policy gradient method takes small policy update steps, so the agent can reach higher and higher rewards in expectation. Policy gradient methods may be unstable: A step size that is too big may direct the policy in a suboptimal direction, thus having little possibility of recovery; a step size that is too small lowers the overall efficiency. To solve the instability, PPO implements a clip function that constrains the policy update of an agent from being too large, so that larger step sizes may be used without negatively affecting the gradient ascent process. === Basic concepts === To begin the PPO training process, the agent is set in an environment to perform actions based on its current input. In the early phase of training, the agent can freely explore solutions and keep track of the result. Later, with a certain amount of transition samples and policy updates, the agent will select an action to take by randomly sampling from the probability distribution P ( A | S ) {\displaystyle P(A|S)} generated by the policy network. The actions that are most likely to be beneficial will have the highest probability of being selected from the random sample. After an agent arrives at a different scenario (a new state) by acting, it is rewarded with a positive reward or a negative reward. The objective of an agent is to maximize the cumulative reward signal across sequences of states, known as episodes. === Policy gradient laws: the advantage function === The advantage function (denoted as A {\displaystyle A} ) is central to PPO, as it tries to answer the question of whether a specific action of the agent is better or worse than some other possible action in a given state. By definition, the advantage function is an estimate of the relative value for a selected action. If the output of this function is positive, it means that the action in question is better than the average return, so the possibilities of selecting that specific action will increase. The opposite is true for a negative advantage output. The advantage function can be defined as A = Q − V {\displaystyle A=Q-V} , where Q {\displaystyle Q} is the discounted sum of rewards (the total weighted reward for the completion of an episode) and V {\displaystyle V} is the baseline estimate. Since the advantage function is calculated after the completion of an episode, the program records the outcome of the episode. Therefore, calculating advantage is essentially an unsupervised learning problem. The baseline estimate comes from the value function that outputs the expected discounted sum of an episode starting from the current state. In the PPO algorithm, the baseline estimate will be noisy (with some variance), as it also uses a neural network, like the policy function itself. With Q {\displaystyle Q} and V {\displaystyle V} computed, the advantage function is calculated by subtracting the baseline estimate from the actual discounted return. If A > 0 {\displaystyle A>0} , the actual return of the action is better than the expected return from experience; if A < 0 {\displaystyle A<0} , the actual return is worse. === Ratio function === In PPO, the ratio function ( r t {\displaystyle r_{t}} ) calculates the probability of selecting action a {\displaystyle a} in state s {\displaystyle s} given the current policy network, divided by the previous probability under the old policy. In other words: If r t ( θ ) > 1 {\displaystyle r_{t}(\theta )>1} , where θ {\displaystyle \theta } are the policy network parameters, then selecting action a {\displaystyle a} in state s {\displaystyle s} is more likely based on the current policy than the previous policy. If 0 ≤ r t ( θ ) < 1 {\displaystyle 0\leq r_{t}(\theta )<1} , then selecting actio

    Read more →
  • AARON

    AARON

    AARON is the collective name for a series of computer programs written by artist Harold Cohen that create original artistic images autonomously, which set it apart from previous programs. Proceeding from Cohen's initial question "What are the minimum conditions under which a set of marks functions as an image?", AARON was in development between 1972 and the 2010s. As the software is not open source, its development effectively ended with Cohen's death in 2016. The name "AARON" does not seem to be an acronym; rather, it was a name chosen to start with the letter "A" so that the names of successive programs could follow it alphabetically. However, Cohen did not create any other major programs. Initial versions of AARON created abstract drawings that grew more complex through the 1970s. More representational imagery was added in the 1980s; first rocks, then plants, then people. In the 1990s more representational figures set in interior scenes were added, along with color. AARON returned to more abstract imagery, this time in color, in the early 2000s. Cohen used machines that allowed AARON to produce physical artwork. The first machines drew in black and white using a succession of custom-built "turtle" and flatbed plotter devices. Cohen would sometimes color these images by hand in fabric dye (Procion), or scale them up to make larger paintings and murals. In the 1990s Cohen built a series of digital painting machines to output AARON's images in ink and fabric dye. His later work used a large-scale inkjet printer on canvas. Development of AARON began in the C programming language then switched to Lisp in the early 1990s. Cohen credits Lisp with helping him solve the challenges he faced in adding color capabilities to AARON. An article about Cohen appeared in Computer Answers that describes AARON and shows two line drawings that were exhibited at the Tate gallery. The article goes on to describe the workings of AARON, then running on a DEC VAX 750 minicomputer. Raymond Kurzweil's company has produced a downloadable screensaver of AARON for Microsoft Windows PCs. This version of AARON can also produce printable images. AARON's source code is not publicly available, but Cohen has described AARON's operations in various essays and it is discussed in abstract in Pamela McCorduck's book. AARON cannot learn new styles or imagery on its own; each new capability must be hand-coded by Cohen. It is capable of producing a practically infinite supply of distinct images in its own style. Examples of these images have been exhibited in galleries worldwide. AARON's artwork has been used as an artistic equivalent of the Turing test. It does seem however that AARON's output follows a noticeable formula (figures standing next to a potted plant, framed within a colored square is a common theme). Cohen is very careful not to claim that AARON is creative. But he does ask "If what AARON is making is not art, what is it exactly, and in what ways, other than its origin, does it differ from the 'real thing?' If it is not thinking, what exactly is it doing?" — The further exploits of AARON, Painter. The Whitney Museum featured AARON in 2024, showcasing the evolution of AARON as the earliest artificial intelligence (AI) program for artmaking.

    Read more →
  • Multispectral pattern recognition

    Multispectral pattern recognition

    Multispectral remote sensing is the collection and analysis of reflected, emitted, or back-scattered energy from an object or an area of interest in multiple bands of regions of the electromagnetic spectrum (Jensen, 2005). Subcategories of multispectral remote sensing include hyperspectral, in which hundreds of bands are collected and analyzed, and ultraspectral remote sensing where many hundreds of bands are used (Logicon, 1997). The main purpose of multispectral imaging is the potential to classify the image using multispectral classification. This is a much faster method of image analysis than is possible by human interpretation. == Multispectral remote sensing systems == Remote sensing systems gather data via instruments typically carried on satellites in orbit around the Earth. The remote sensing scanner detects the energy that radiates from the object or area of interest. This energy is recorded as an analog electrical signal and converted into a digital value though an A-to-D conversion. There are several multispectral remote sensing systems that can be categorized in the following way: === Multispectral imaging using discrete detectors and scanning mirrors === Landsat Multispectral Scanner (MSS) Landsat Thematic Mapper (TM) NOAA Geostationary Operational Environmental Satellite (GOES) NOAA Advanced Very High Resolution Radiometer (AVHRR) NASA and ORBIMAGE, Inc., Sea-viewing Wide field-of-view Sensor (SeaWiFS) Daedalus, Inc., Aircraft Multispectral Scanner (AMS) NASA Airborne Terrestrial Applications Sensor (ATLAS) === Multispectral imaging using linear arrays === SPOT 1, 2, and 3 High Resolution Visible (HRV) sensors and Spot 4 and 5 High Resolution Visible Infrared (HRVIR) and vegetation sensor Indian Remote Sensing System (IRS) Linear Imaging Self-scanning Sensor (LISS) Space Imaging, Inc. (IKONOS) Digital Globe, Inc. (QuickBird) ORBIMAGE, Inc. (OrbView-3) ImageSat International, Inc. (EROS A1) NASA Terra Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) NASA Terra Multiangle Imaging Spectroradiometer (MISR) === Imaging spectrometry using linear and area arrays === NASA Jet Propulsion Laboratory Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) Compact Airborne Spectrographic Imager 3 (CASI 3) NASA Terra Moderate Resolution Imaging Spectrometer (MODIS) NASA Earth Observer (EO-1) Advanced Land Imager (ALI), Hyperion, and LEISA Atmospheric Corrector (LAC) === Satellite analog and digital photographic systems === Russian SPIN-2 TK-350, and KVR-1000 NASA Space Shuttle and International Space Station Imagery == Multispectral classification methods == A variety of methods can be used for the multispectral classification of images: Algorithms based on parametric and nonparametric statistics that use ratio-and interval-scaled data and nonmetric methods that can also incorporate nominal scale data (Duda et al., 2001), Supervised or unsupervised classification logic, Hard or soft (fuzzy) set classification logic to create hard or fuzzy thematic output products, Per-pixel or object-oriented classification logic, and Hybrid approaches == Supervised classification == In this classification method, the identity and location of some of the land-cover types are obtained beforehand from a combination of fieldwork, interpretation of aerial photography, map analysis, and personal experience. The analyst would locate sites that have similar characteristics to the known land-cover types. These areas are known as training sites because the known characteristics of these sites are used to train the classification algorithm for eventual land-cover mapping of the remainder of the image. Multivariate statistical parameters (means, standard deviations, covariance matrices, correlation matrices, etc.) are calculated for each training site. All pixels inside and outside of the training sites are evaluated and allocated to the class with the more similar characteristics. === Classification scheme === The first step in the supervised classification method is to identify the land-cover and land-use classes to be used. Land-cover refers to the type of material present on the site (e.g. water, crops, forest, wet land, asphalt, and concrete). Land-use refers to the modifications made by people to the land cover (e.g. agriculture, commerce, settlement). All classes should be selected and defined carefully to properly classify remotely sensed data into the correct land-use and/or land-cover information. To achieve this purpose, it is necessary to use a classification system that contains taxonomically correct definitions of classes. If a hard classification is desired, the following classes should be used: Mutually exclusive: there is not any taxonomic overlap of any classes (i.e., rain forest and evergreen forest are distinct classes). Exhaustive: all land-covers in the area have been included. Hierarchical: sub-level classes (e.g., single-family residential, multiple-family residential) are created, allowing that these classes can be included in a higher category (e.g., residential). Some examples of hard classification schemes are: American Planning Association Land-Based Classification System United States Geological Survey Land-use/Land-cover Classification System for Use with Remote Sensor Data U.S. Department of the Interior Fish and Wildlife Service U.S. National Vegetation and Classification System International Geosphere-Biosphere Program IGBP Land Cover Classification System === Training sites === Once the classification scheme is adopted, the image analyst may select training sites in the image that are representative of the land-cover or land-use of interest. If the environment where the data was collected is relatively homogeneous, the training data can be used. If different conditions are found in the site, it would not be possible to extend the remote sensing training data to the site. To solve this problem, a geographical stratification should be done during the preliminary stages of the project. All differences should be recorded (e.g. soil type, water turbidity, crop species, etc.). These differences should be recorded on the imagery and the selection training sites made based on the geographical stratification of this data. The final classification map would be a composite of the individual stratum classifications. After the data are organized in different training sites, a measurement vector is created. This vector would contain the brightness values for each pixel in each band in each training class. The mean, standard deviation, variance-covariance matrix, and correlation matrix are calculated from the measurement vectors. Once the statistics from each training site are determined, the most effective bands for each class should be selected. The objective of this discrimination is to eliminate the bands that can provide redundant information. Graphical and statistical methods can be used to achieve this objective. Some of the graphic methods are: Bar graph spectral plots Cospectral mean vector plots Feature space plots Cospectral parallelepiped or ellipse plots === Classification algorithm === The last step in supervised classification is selecting an appropriate algorithm. The choice of a specific algorithm depends on the input data and the desired output. Parametric algorithms are based on the fact that the data is normally distributed. If the data is not normally distributed, nonparametric algorithms should be used. The more common nonparametric algorithms are: One-dimensional density slicing Parallelipiped Minimum distance Nearest-neighbor Expert system analysis Convolutional neural network == Unsupervised classification == Unsupervised classification (also known as clustering) is a method of partitioning remote sensor image data in multispectral feature space and extracting land-cover information. Unsupervised classification require less input information from the analyst compared to supervised classification because clustering does not require training data. This process consists in a series of numerical operations to search for the spectral properties of pixels. From this process, a map with m spectral classes is obtained. Using the map, the analyst tries to assign or transform the spectral classes into thematic information of interest (i.e. forest, agriculture, urban). This process may not be easy because some spectral clusters represent mixed classes of surface materials and may not be useful. The analyst has to understand the spectral characteristics of the terrain to be able to label clusters as a specific information class. There are hundreds of clustering algorithms. Two of the most conceptually simple algorithms are the chain method and the ISODATA method. === Chain method === The algorithm used in this method operates in a two-pass mode (it passes through the multispectral dataset two times. In the first pass, the program reads through the dataset and sequentially builds clusters (groups of p

    Read more →
  • Stochastic variance reduction

    Stochastic variance reduction

    (Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting. Variance reduction approaches are widely used for training machine learning models such as logistic regression and support vector machines as these problems have finite-sum structure and uniform conditioning that make them ideal candidates for variance reduction. == Finite sum objectives == A function f {\displaystyle f} is considered to have finite sum structure if it can be decomposed into a summation or average: f ( x ) = 1 n ∑ i = 1 n f i ( x ) , {\displaystyle f(x)={\frac {1}{n}}\sum _{i=1}^{n}f_{i}(x),} where the function value and derivative of each f i {\displaystyle f_{i}} can be queried independently. Although variance reduction methods can be applied for any positive n {\displaystyle n} and any f i {\displaystyle f_{i}} structure, their favorable theoretical and practical properties arise when n {\displaystyle n} is large compared to the condition number of each f i {\displaystyle f_{i}} , and when the f i {\displaystyle f_{i}} have similar (but not necessarily identical) Lipschitz smoothness and strong convexity constants. The finite sum structure should be contrasted with the stochastic approximation setting which deals with functions of the form f ( θ ) = E ξ ⁡ [ F ( θ , ξ ) ] {\textstyle f(\theta )=\operatorname {E} _{\xi }[F(\theta ,\xi )]} which is the expected value of a function depending on a random variable ξ {\textstyle \xi } . Any finite sum problem can be optimized using a stochastic approximation algorithm by using F ( ⋅ , ξ ) = f ξ {\displaystyle F(\cdot ,\xi )=f_{\xi }} . == Rapid Convergence == Stochastic variance reduced methods without acceleration are able to find a minima of f {\displaystyle f} within accuracy ϵ > {\displaystyle \epsilon >} , i.e. f ( x ) − f ( x ∗ ) ≤ ϵ {\displaystyle f(x)-f(x_{})\leq \epsilon } in a number of steps of the order: O ( ( L μ + n ) log ⁡ ( 1 ϵ ) ) . {\displaystyle O\left(\left({\frac {L}{\mu }}+n\right)\log \left({\frac {1}{\epsilon }}\right)\right).} The number of steps depends only logarithmically on the level of accuracy required, in contrast to the stochastic approximation framework, where the number of steps O ( L / ( μ ϵ ) ) {\displaystyle O{\bigl (}L/(\mu \epsilon ){\bigr )}} required grows proportionally to the accuracy required. Stochastic variance reduction methods converge almost as fast as the gradient descent method's O ( ( L / μ ) log ⁡ ( 1 / ϵ ) ) {\displaystyle O{\bigl (}(L/\mu )\log(1/\epsilon ){\bigr )}} rate, despite using only a stochastic gradient, at a 1 / n {\displaystyle 1/n} lower cost than gradient descent. Accelerated methods in the stochastic variance reduction framework achieve even faster convergence rates, requiring only O ( ( n L μ + n ) log ⁡ ( 1 ϵ ) ) {\displaystyle O\left(\left({\sqrt {\frac {nL}{\mu }}}+n\right)\log \left({\frac {1}{\epsilon }}\right)\right)} steps to reach ϵ {\displaystyle \epsilon } accuracy, potentially n {\displaystyle {\sqrt {n}}} faster than non-accelerated methods. Lower complexity bounds. for the finite sum class establish that this rate is the fastest possible for smooth strongly convex problems. == Approaches == Variance reduction approaches fall within four main categories: table averaging methods, full-gradient snapshot methods, recursive estimator methods (e.g., SARAH), and dual methods. Each category contains methods designed for dealing with convex, non-smooth, and non-convex problems, each differing in hyper-parameter settings and other algorithmic details. === SAGA === In the SAGA method, the prototypical table averaging approach, a table of size n {\displaystyle n} is maintained that contains the last gradient witnessed for each f i {\displaystyle f_{i}} term, which we denote g i {\displaystyle g_{i}} . At each step, an index i {\displaystyle i} is sampled, and a new gradient ∇ f i ( x k ) {\displaystyle \nabla f_{i}(x_{k})} is computed. The iterate x k {\displaystyle x_{k}} is updated with: x k + 1 = x k − γ [ ∇ f i ( x k ) − g i + 1 n ∑ i = 1 n g i ] , {\displaystyle x_{k+1}=x_{k}-\gamma \left[\nabla f_{i}(x_{k})-g_{i}+{\frac {1}{n}}\sum _{i=1}^{n}g_{i}\right],} and afterwards table entry i {\displaystyle i} is updated with g i = ∇ f i ( x k ) {\displaystyle g_{i}=\nabla f_{i}(x_{k})} . SAGA is among the most popular of the variance reduction methods due to its simplicity, easily adaptable theory, and excellent performance. It is the successor of the SAG method, improving on its flexibility and performance. === SVRG === The stochastic variance reduced gradient method (SVRG), the prototypical snapshot method, uses a similar update except instead of using the average of a table it instead uses a full-gradient that is reevaluated at a snapshot point x ~ {\displaystyle {\tilde {x}}} at regular intervals of m ≥ n {\displaystyle m\geq n} iterations. The update becomes: x k + 1 = x k − γ [ ∇ f i ( x k ) − ∇ f i ( x ~ ) + ∇ f ( x ~ ) ] , {\displaystyle x_{k+1}=x_{k}-\gamma [\nabla f_{i}(x_{k})-\nabla f_{i}({\tilde {x}})+\nabla f({\tilde {x}})],} This approach requires two stochastic gradient evaluations per step, one to compute ∇ f i ( x k ) {\displaystyle \nabla f_{i}(x_{k})} and one to compute ∇ f i ( x ~ ) , {\displaystyle \nabla f_{i}({\tilde {x}}),} where-as table averaging approaches need only one. Despite the high computational cost, SVRG is popular as its simple convergence theory is highly adaptable to new optimization settings. It also has lower storage requirements than tabular averaging approaches, which make it applicable in many settings where tabular methods can not be used. === SARAH === The SARAH (stochastic recursive gradient) method maintains a recursive estimator of the gradient rather than storing a table of past gradients (as in SAGA) or computing periodic full-gradient snapshots (as in SVRG). At the start of an inner loop, a full gradient is computed at a reference point x ~ {\displaystyle {\tilde {x}}} : v 0 = ∇ f ( x ~ ) {\displaystyle v_{0}=\nabla f({\tilde {x}})} . For inner iterations, with a sampled index i k {\displaystyle i_{k}} , the gradient estimator and iterate are updated by: v k = ∇ f i k ( x k ) − ∇ f i k ( x k − 1 ) + v k − 1 , x k + 1 = x k − γ v k . {\displaystyle v_{k}=\nabla f_{i_{k}}(x_{k})-\nabla f_{i_{k}}(x_{k-1})+v_{k-1},\qquad x_{k+1}=x_{k}-\gamma v_{k}.} This recursion requires two component-gradient evaluations per step ∇ f i k ( x k ) {\displaystyle \nabla f_{i_{k}}(x_{k})} and ∇ f i k ( x k − 1 ) {\displaystyle \nabla f_{i_{k}}(x_{k-1})} but does not need to store per-sample gradients, resulting in lower memory cost than table-averaging methods. SARAH admits linear convergence for strongly convex functions and has been extended to more general nonconvex and composite problems. === SDCA === Exploiting the dual representation of the objective leads to another variance reduction approach that is particularly suited to finite-sums where each term has a structure that makes computing the convex conjugate f i ∗ , {\displaystyle f_{i}^{},} or its proximal operator tractable. The standard SDCA method considers finite sums that have additional structure compared to generic finite sum setting: f ( x ) = 1 n ∑ i = 1 n f i ( x T v i ) + λ 2 ‖ x ‖ 2 , {\displaystyle f(x)={\frac {1}{n}}\sum _{i=1}^{n}f_{i}(x^{T}v_{i})+{\frac {\lambda }{2}}\|x\|^{2},} where each f i {\displaystyle f_{i}} is 1 dimensional and each v i {\displaystyle v_{i}} is a data point associated with f i {\displaystyle f_{i}} . SDCA solves the dual problem: max α ∈ R n − 1 n ∑ i = 1 n f i ∗ ( − α i ) − λ 2 ‖ 1 λ n ∑ i = 1 n α i v i ‖ 2 , {\displaystyle \max _{\alpha \in \mathbb {R} ^{n}}-{\frac {1}{n}}\sum _{i=1}^{n}f_{i}^{}(-\alpha _{i})-{\frac {\lambda }{2}}\left\|{\frac {1}{\lambda n}}\sum _{i=1}^{n}\alpha _{i}v_{i}\right\|^{2},} by a stochastic coordinate ascent procedure, where at each step the objective is optimized with respect to a randomly chosen coordinate α i {\displaystyle \alpha _{i}} , leaving all other coordinates the same. An approximate primal solution x {\displaystyle x} can be recovered from the α {\displaystyle \alpha } values: x = 1 λ n ∑ i = 1 n α i v i {\displaystyle x={\frac {1}{\lambda n}}\sum _{i=1}^{n}\alpha _{i}v_{i}} . This method obtains similar theoretical rates of convergence to other stochastic variance reduced methods, while avoiding the need to specify a step-size parameter. It is fast in practice when λ {\displaystyle \lambda } is large, but significantly slower than the other approaches when λ {\displaystyle \lambda } is small. == Accelerated approaches == Accelerated variance reduction methods are built upon the standard methods above. The earliest approaches make use of proximal operators t

    Read more →
  • Random neural network

    Random neural network

    The Random Neural Network (RNN) is a mathematical representation of an interconnected network of neurons or cells which exchange spiking signals. It was invented by Erol Gelenbe and is linked to the G-network model of queueing networks which Erol Gelenbe also invented, and with his Gene Regulatory Network models. In this model, each neuronal cell state is represented by an integer whose value rises when the cell receives an excitatory spike and drops when it receives an inhibitory spike. The spikes can originate outside the network itself, or they can come from other cells in the networks. Cells whose internal excitatory state has a positive value are allowed to send out spikes of either kind to other cells in the network according to specific cell-dependent spiking rates. The model has a mathematical solution in steady-state which provides the joint probability distribution of the network in terms of the individual probabilities that each cell is excited and able to send out spikes. Computing this solution is based on solving a set of non-linear algebraic equations whose parameters are related to the spiking rates of individual cells and their connectivity to other cells, as well as the arrival rates of spikes from outside the network. The RNN is a recurrent model, i.e. a neural network that is allowed to have complex feedback loops. A highly energy-efficient implementation of random neural networks was demonstrated by Krishna Palem et al. using the Probabilistic CMOS or PCMOS technology and was shown to be c. 226–300 times more efficient in terms of Energy-Performance-Product. RNNs are also related to artificial neural networks, which (like the random neural network) have gradient-based learning algorithms. The learning algorithm for an n-node random neural network that includes feedback loops (it is also a recurrent neural network) is of computational complexity O(n^3) (the number of computations is proportional to the cube of n, the number of neurons). The random neural network can also be used with other learning algorithms such as reinforcement learning. The RNN has been shown to be a universal approximator for bounded and continuous functions.

    Read more →
  • Transderivational search

    Transderivational search

    Transderivational search (often abbreviated to TDS) is a psychological and cybernetics term, meaning when a search is being conducted for a fuzzy match across a broad field. In computing the equivalent function can be performed using content-addressable memory. Unlike usual searches, which look for literal (i.e. exact, logical, or regular expression) matches, a transderivational search is a search for a possible meaning or possible match as part of communication, and without which an incoming communication cannot be made any sense of whatsoever. It is thus an integral part of processing language, and of attaching meaning to communication. In NLP (Neuro-linguistic programming), a transderivational search (Bandler and Grinder, 1976) is essentially the process of searching back through one's stored memories and mental representations to find the personal reference experiences from which a current understanding or mental map has been derived. By the end of 1976, Grinder and Bandler had combined Satir’s and Perls’ language patterns and Erickson’s hypnotic language and use of metaphor with anchoring to create new processes that they called collapsing anchors, trans-derivational search, changing personal history, and reframing. A psychological example of TDS is in Ericksonian hypnotherapy, where vague suggestions are used that the patient must process intensely in order to find their own meanings, thus ensuring that the practitioner does not intrude his own beliefs into the subject's inner world. == TDS in human communication and processing == Because TDS is a compelling, automatic and unconscious state of internal focus and processing (i.e. a type of everyday trance state), and often a state of internal lack of certainty, or openness to finding an answer (since something is being checked out at that moment), it can be utilized or interrupted, in order to create, or deepen, trance. TDS is a fundamental part of human language and cognitive processing. Arguably, every word or utterance a person hears, for example, and everything they see or feel and take note of, results in a very brief trance while TDS is carried out to establish a contextual meaning for it. === Examples === Leading statements: "And those thoughts you had yesterday..." the human mind cannot process hearing this phrase, without at some level searching internally for some thoughts or other that it had yesterday, to make the subject of the sentence. "The many colors that fruit can be" likewise starts the human mind considering even if briefly, different fruit sorted by color. "You did it again, didn't you!" This everyday manipulative use of TDS usually sends the recipient looking internally for some "it" they may have done for which blame is being fairly given. Regardless of whether such a matter can be identified, guilt or anger may result. "There has been pain, hasn't there" the mind of a patient suffering an illness will find it very hard or impossible to hear or answer this sentence without conducting internal searches to verify whether this is true or not, or to find an example if so. "You'd forgotten something [or: some part of your body], hadn't you?" the mind usually checks through the various things, or parts of the body, on hearing this, seeing if each in turn has been forgotten. Textual ambiguity: "Do you remember line dancing on the steps?" Without sufficient context, some statements may trigger TDS in order to resolve inherent ambiguity in the interpretation of a posed question. Do I remember a bygone fad called "line dancing on the steps"? Do I remember personally engaging in dancing in the past? Do I remember my routine practice dancing by focusing on the steps of the dance? Do I tend to forget about dancing when I am standing on steps? "Penny-wise and pound the table dance to the beat of a different drummer". The mixing of cliché and stock phrases may trigger TDS in order to reconcile the discrepancies between expected and actual utterances in sequence. Although TDS is often associated with spoken language, it can be induced in any perceptual system. Thus Milton Erickson's "hypnotic handshake" is a technique that leaves the other person performing TDS in search of meaning to a deliberately ambiguous use of touch.

    Read more →
  • Logic learning machine

    Logic learning machine

    Logic learning machine (LLM) is a machine learning method based on the generation of intelligible rules. LLM is an efficient implementation of the Switching Neural Network (SNN) paradigm, developed by Marco Muselli, Senior Researcher at the Italian National Research Council CNR-IEIIT in Genoa. LLM has been employed in many different sectors, including the field of medicine (orthopedic patient classification, DNA micro-array analysis and Clinical Decision Support Systems), financial services and supply chain management. == History == The Switching Neural Network approach was developed in the 1990s to overcome the drawbacks of the most commonly used machine learning methods. In particular, black box methods, such as multilayer perceptron and support vector machine, had good accuracy but could not provide deep insight into the studied phenomenon. On the other hand, decision trees were able to describe the phenomenon but often lacked accuracy. Switching Neural Networks made use of Boolean algebra to build sets of intelligible rules able to obtain very good performance. In 2014, an efficient version of Switching Neural Network was developed and implemented in the Rulex suite with the name Logic Learning Machine. Also, an LLM version devoted to regression problems was developed. == General == Like other machine learning methods, LLM uses data to build a model able to perform a good forecast about future behaviors. LLM starts from a table including a target variable (output) and some inputs and generates a set of rules that return the output value y {\displaystyle y} corresponding to a given configuration of inputs. A rule is written in the form: if premise then consequence where consequence contains the output value whereas premise includes one or more conditions on the inputs. According to the input type, conditions can have different forms: for categorical variables the input value must be in a given subset: x 1 ∈ { A , B , C , . . . } {\displaystyle x_{1}\in \{A,B,C,...\}} . for ordered variables the condition is written as an inequality or an interval: x 2 ≤ α {\displaystyle x_{2}\leq \alpha } or β ≤ x 3 ≤ γ {\displaystyle \beta \leq x_{3}\leq \gamma } A possible rule is therefore in the form if x 1 ∈ { A , B , C , . . . } {\displaystyle x_{1}\in \{A,B,C,...\}} AND x 2 ≤ α {\displaystyle x_{2}\leq \alpha } AND β ≤ x 3 ≤ γ {\displaystyle \beta \leq x_{3}\leq \gamma } then y = y ¯ {\displaystyle y={\bar {y}}} == Types == According to the output type, different versions of the Logic Learning Machine have been developed: Logic Learning Machine for classification, when the output is a categorical variable, which can assume values in a finite set Logic Learning Machine for regression, when the output is an integer or real number.

    Read more →
  • Multiple discriminant analysis

    Multiple discriminant analysis

    Multiple Discriminant Analysis (MDA) is a multivariate dimensionality reduction technique. It has been used to predict signals as diverse as neural memory traces and corporate failure. MDA is not directly used to perform classification. It merely supports classification by yielding a compressed signal amenable to classification. The method described in Duda et al. (2001) §3.8.3 projects the multivariate signal down to an M−1 dimensional space where M is the number of categories. MDA is useful because most classifiers are strongly affected by the curse of dimensionality. In other words, when signals are represented in very-high-dimensional spaces, the classifier's performance is catastrophically impaired by the overfitting problem. This problem is reduced by compressing the signal down to a lower-dimensional space as MDA does. MDA has been used to reveal neural codes.

    Read more →