AI For Business Specialization Upenn

AI For Business Specialization Upenn — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Huawei Mobile Services

    Huawei Mobile Services

    Huawei Mobile Services (HMS) is a collection of proprietary services and high level application programming interfaces (APIs) developed by Huawei Technologies Co., Ltd. Its hub known as HMS Core serves as a toolkit for app development on Huawei devices. HMS is typically installed on Huawei devices on top of running HarmonyOS 4.x and earlier operating system on its earlier devices running the Android operating system with EMUI including devices already distributed with Google Mobile Services. Alongside, HMS Core Wear Engine for Android phones with lightweight based LiteOS wearable middleware app framework integration connectivity like notifications, status etc. HMS consists of seven key services and the HMS Core. The key services are Huawei ID, Huawei Cloud, AppGallery, Themes, Huawei Video, Browser, and Assistant. The web browser is based on Chromium. Huawei Quick Apps is the alternative to Google Instant Apps. By January 2020, over 50,000 apps had been integrated with HMS Core. Its rival, Google Mobile Services has 3 million apps on Google's Play Store. The AppGallery claimed 180 billion downloads in 2019. In March 2020, HMS was used by 650 million monthly active users across 170 countries. A Chinese phone manufacturer, LeTV, hosted a smartphone business communication meeting in Beijing on September 27, 2021, to demonstrate its phone, the LeTV S1. This was the first smartphone from a third-party manufacturer to include Huawei Mobile Services (HMS). == HMS on Android and HarmonyOS == Huawei Mobile Services on Android goes all the way back to August 2016 as Huawei ID services for phones, basic functionalities for Huawei P9 series. However, in May 2019 proved to be a significant change to HMS when Google was prohibited from working with Huawei on any new devices extending ecosystem for AppGallery store front launched in April 2018, year prior. This also included bundling Google's Apps, including Gmail, Maps and YouTube. Any new Huawei devices launched after 16 May 2019 were unable to receive updates from Google services and would be considered 'uncertified' meaning Huawei's only solution at the time was to turn HMS into a genuine competitor to Google and incentivize app developers to utilize the platform. Huawei officially launched Huawei Mobile Services in China on December 24, 2019, as a beta. Huawei expanded Huawei Mobile Services in Europe in February 2020 and other markets in Asia, Latin America, Middle East & Africa, Canada, Mexico followed outside banned US market. HMS is available on the Honor 9X Pro, View 30 Pro, Huawei Mate XS. HMS is also available, alongside GMS, on many other Huawei models launched before the ban. Huawei promised developers it would take, “less than 10 minutes", to port their app over to HMS - to illustrate the ease of portability between Google's Play Store and the HMS AppGallery. On January 15, 2020, HMS Core 4.0 (Huawei Mobile Services Core 4.0) was officially launched. Huawei announced that at this time, there were already 1.3 million developers and 55,000 applications on board. The next day, Huawei held a developer day event in London and invested £20 million to encourage developers in the United Kingdom and Ireland to use HMS. On July 15, 2021, Huawei expanded HMS with classic HarmonyOS dual-framework that provided Java support and eventually with JavaScript and ArkTS (eTS) language support with HMS Core 6.0 for app development with primarily Android apps, alongside limited HAP imperative developed based apps that shares AOSP file system libraries in all types of devices from smartphones, tablets, smart screens, smartwatches, and car machines. Including various third-party development frameworks, such as React Native, Cordova, etc. At HDC 2023, Huawei unveiled HarmonyOS 5, marking a total break from the hybrid Android derived platform. This shift replaced the legacy Android and classic HarmonyOS-based HMS SDK with a full native API developer kit SDK built solely on OpenHarmony. The architecture moved from middleware services to vertical integration path. In this new model, HMS Core libraries are no longer external add-ons but are bundled directly into the system and DevEco Studio as native HarmonyOS Kits. == HMS Core == HMS Core is a hub for Huawei Mobile Services and serves as a toolkit for app development on Huawei devices. The core comprises Development, Growth and Monetizing and was created as a replacement for Google Mobile Services (GMS) Core. HMS core services were available in more than 55,000 apps in June 2020; HMS Core 5.0 debuted in September 2020. HMS Core 6.0 was launched in June 2021 with extended support for Huawei Cloud services. In June 2021, the number of registered developers within the HMS ecosystem was 4 million, and the number of apps integrated with the HMS Core had reached 134,000. As of July 2022, registered developers within HMS ecosystem had grown to 5 million, and the number of apps integrated with the HMS Core reached 203,000. The number of apps had grown to 220,000 by 30 September 2022. == AppGallery == The AppGallery has a key rival, Google's Play Store on Android. The AppGallery is available in 170 countries, across 78 languages. == Reception == The reception of HMS is mixed, with the majority of discussion based around the key Google/Android apps which are not yet present on the AppGallery and whether or not this presents a significant problem to users. The open development of HMS Core has been regarded by some as benefiting the Android project as a whole, "If Huawei continues to invest in a holistically open approach ... the result could be that we could all end up a bit less beholden to Google".

    Read more →
  • AdaBoost

    AdaBoost

    AdaBoost (short for Adaptive Boosting) is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many types of learning algorithm to improve performance. The output of multiple weak learners is combined into a weighted sum that represents the final output of the boosted classifier. Usually, AdaBoost is presented for binary classification, although it can be generalized to multiple classes or bounded intervals of real values. AdaBoost is adaptive in the sense that subsequent weak learners (models) are adjusted in favor of instances misclassified by previous models. In some problems, it can be less susceptible to overfitting than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing, the final model can be proven to converge to a strong learner. Although AdaBoost is typically used to combine weak base learners (such as decision stumps), it has been shown to also effectively combine strong base learners (such as deeper decision trees), producing an even more accurate model. Every learning algorithm tends to suit some problem types better than others, and typically has many different parameters and configurations to adjust before it achieves optimal performance on a dataset. AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier. When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree-growing algorithm such that later trees tend to focus on harder-to-classify examples. == Training == AdaBoost refers to a particular method of training a boosted classifier. A boosted classifier is a classifier of the form F T ( x ) = ∑ t = 1 T f t ( x ) {\displaystyle F_{T}(x)=\sum _{t=1}^{T}f_{t}(x)} where each f t {\displaystyle f_{t}} is a weak learner that takes an object x {\displaystyle x} as input and returns a value indicating the class of the object. For example, in the two-class problem, the sign of the weak learner's output identifies the predicted object class and the absolute value gives the confidence in that classification. Each weak learner produces an output hypothesis h {\displaystyle h} which fixes a prediction h ( x i ) {\displaystyle h(x_{i})} for each sample in the training set. At each iteration t {\displaystyle t} , a weak learner is selected and assigned a coefficient α t {\displaystyle \alpha _{t}} such that the total training error E t {\displaystyle E_{t}} of the resulting t {\displaystyle t} -stage boosted classifier is minimized. E t = ∑ i E [ F t − 1 ( x i ) + α t h ( x i ) ] {\displaystyle E_{t}=\sum _{i}E[F_{t-1}(x_{i})+\alpha _{t}h(x_{i})]} Here F t − 1 ( x ) {\displaystyle F_{t-1}(x)} is the boosted classifier that has been built up to the previous stage of training and f t ( x ) = α t h ( x ) {\displaystyle f_{t}(x)=\alpha _{t}h(x)} is the weak learner that is being considered for addition to the final classifier. === Weighting === At each iteration of the training process, a weight w i , t {\displaystyle w_{i,t}} is assigned to each sample in the training set equal to the current error E ( F t − 1 ( x i ) ) {\displaystyle E(F_{t-1}(x_{i}))} on that sample. These weights can be used in the training of the weak learner. For instance, decision trees can be grown which favor the splitting of sets of samples with large weights. == Derivation == This derivation follows Rojas (2009): Suppose we have a data set { ( x 1 , y 1 ) , … , ( x N , y N ) } {\displaystyle \{(x_{1},y_{1}),\ldots ,(x_{N},y_{N})\}} where each item x i {\displaystyle x_{i}} has an associated class y i ∈ { − 1 , 1 } {\displaystyle y_{i}\in \{-1,1\}} , and a set of weak classifiers { k 1 , … , k L } {\displaystyle \{k_{1},\ldots ,k_{L}\}} each of which outputs a classification k j ( x i ) ∈ { − 1 , 1 } {\displaystyle k_{j}(x_{i})\in \{-1,1\}} for each item. After the ( m − 1 ) {\displaystyle (m-1)} -th iteration our boosted classifier is a linear combination of the weak classifiers of the form: C ( m − 1 ) ( x i ) = α 1 k 1 ( x i ) + ⋯ + α m − 1 k m − 1 ( x i ) , {\displaystyle C_{(m-1)}(x_{i})=\alpha _{1}k_{1}(x_{i})+\cdots +\alpha _{m-1}k_{m-1}(x_{i}),} where the class will be the sign of C ( m − 1 ) ( x i ) {\displaystyle C_{(m-1)}(x_{i})} . At the m {\displaystyle m} -th iteration we want to extend this to a better boosted classifier by adding another weak classifier k m {\displaystyle k_{m}} , with another weight α m {\displaystyle \alpha _{m}} : C m ( x i ) = C ( m − 1 ) ( x i ) + α m k m ( x i ) {\displaystyle C_{m}(x_{i})=C_{(m-1)}(x_{i})+\alpha _{m}k_{m}(x_{i})} So it remains to determine which weak classifier is the best choice for k m {\displaystyle k_{m}} , and what its weight α m {\displaystyle \alpha _{m}} should be. We define the total error E {\displaystyle E} of C m {\displaystyle C_{m}} as the sum of its exponential loss on each data point, given as follows: E = ∑ i = 1 N e − y i C m ( x i ) = ∑ i = 1 N e − y i C ( m − 1 ) ( x i ) e − y i α m k m ( x i ) {\displaystyle E=\sum _{i=1}^{N}e^{-y_{i}C_{m}(x_{i})}=\sum _{i=1}^{N}e^{-y_{i}C_{(m-1)}(x_{i})}e^{-y_{i}\alpha _{m}k_{m}(x_{i})}} Letting w i ( 1 ) = 1 {\displaystyle w_{i}^{(1)}=1} and w i ( m ) = e − y i C m − 1 ( x i ) {\displaystyle w_{i}^{(m)}=e^{-y_{i}C_{m-1}(x_{i})}} for m > 1 {\displaystyle m>1} , we have: E = ∑ i = 1 N w i ( m ) e − y i α m k m ( x i ) {\displaystyle E=\sum _{i=1}^{N}w_{i}^{(m)}e^{-y_{i}\alpha _{m}k_{m}(x_{i})}} We can split this summation between those data points that are correctly classified by k m {\displaystyle k_{m}} (so y i k m ( x i ) = 1 {\displaystyle y_{i}k_{m}(x_{i})=1} ) and those that are misclassified (so y i k m ( x i ) = − 1 {\displaystyle y_{i}k_{m}(x_{i})=-1} ): E = ∑ y i = k m ( x i ) w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) e α m = ∑ i = 1 N w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) ( e α m − e − α m ) {\displaystyle {\begin{aligned}E&=\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}e^{\alpha _{m}}\\&=\sum _{i=1}^{N}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}\left(e^{\alpha _{m}}-e^{-\alpha _{m}}\right)\end{aligned}}} Since the only part of the right-hand side of this equation that depends on k m {\displaystyle k_{m}} is ∑ y i ≠ k m ( x i ) w i ( m ) {\textstyle \sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}} , we see that the k m {\displaystyle k_{m}} that minimizes E {\displaystyle E} is the one in the set { k 1 , … , k L } {\displaystyle \{k_{1},\ldots ,k_{L}\}} that minimizes ∑ y i ≠ k m ( x i ) w i ( m ) {\textstyle \sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}} [assuming that α m > 0 {\displaystyle \alpha _{m}>0} ], i.e. the weak classifier with the lowest weighted error (with weights w i ( m ) = e − y i C m − 1 ( x i ) {\displaystyle w_{i}^{(m)}=e^{-y_{i}C_{m-1}(x_{i})}} ). To determine the desired weight α m {\displaystyle \alpha _{m}} that minimizes E {\displaystyle E} with the k m {\displaystyle k_{m}} that we just determined, we differentiate: d E d α m = d ( ∑ y i = k m ( x i ) w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) e α m ) d α m {\displaystyle {\frac {dE}{d\alpha _{m}}}={\frac {d(\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}e^{\alpha _{m}})}{d\alpha _{m}}}} The value of α m {\displaystyle \alpha _{m}} that minimizes the above expression is: α m = 1 2 ln ⁡ ( ∑ y i = k m ( x i ) w i ( m ) ∑ y i ≠ k m ( x i ) w i ( m ) ) {\displaystyle \alpha _{m}={\frac {1}{2}}\ln \left({\frac {\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}}\right)} We calculate the weighted error rate of the weak classifier to be ϵ m = ∑ y i ≠ k m ( x i ) w i ( m ) ∑ i = 1 N w i ( m ) {\displaystyle \epsilon _{m}={\frac {\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{i=1}^{N}w_{i}^{(m)}}}} , so it follows that: α m = 1 2 ln ⁡ ( 1 − ϵ m ϵ m ) {\displaystyle \alpha _{m}={\frac {1}{2}}\ln \left({\frac {1-\epsilon _{m}}{\epsilon _{m}}}\right)} which is the negative logit function multiplied by 0.5. Due to the convexity of E {\displaystyle E} as a function of α m {\displaystyle \alpha _{m}} , this new expression for α m {\displaystyle \alpha _{m}} gives the global minimum of the loss function. Note: This derivation only applies when k m ( x i ) ∈ { − 1 , 1 } {\displaystyle k_{m}(x_{i})\in \{-1,1\}} , though it can be a good starting guess in other cases, such as when the weak learner is biased ( k m ( x ) ∈ { a , b } , a ≠ − b {\displaystyle k_{m}(x)\in \{a,b\},a\neq -b} ), has multiple leaves ( k m ( x ) ∈ { a , b , … , n } {\displaystyle k_{m}(x)\in \{a,b,\dots ,n\}} ) or is some other function k m ( x ) ∈ R {\displaystyle k_{m}(x)\in \mathbb {R} } . Thus we have derived the AdaBoost algorithm: At each

    Read more →
  • Principal component analysis

    Principal component analysis

    Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing. The data are linearly transformed onto a new coordinate system such that the directions (principal components) capturing the largest variation in the data can be easily identified. The principal components of a collection of points in a real coordinate space are a sequence of p {\displaystyle p} unit vectors, where the i {\displaystyle i} -th vector is the direction of a line that best fits the data while being orthogonal to the first i − 1 {\displaystyle i-1} vectors. Here, a best-fitting line is defined as one that minimizes the average squared perpendicular distance from the points to the line. These directions (i.e., principal components) constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated. Many studies use the first two principal components in order to plot the data in two dimensions and to visually identify clusters of closely related data points. Principal component analysis has applications in many fields such as population genetics, microbiome studies, and atmospheric science. == Overview == When performing PCA, the first principal component of a set of p {\displaystyle p} variables is the derived variable formed as a linear combination of the original variables that explains the most variance. The second principal component explains the most variance in what is left once the effect of the first component is removed, and we may proceed through p {\displaystyle p} iterations until all the variance is explained. PCA is most commonly used when many of the variables are highly correlated with each other and it is desirable to reduce their number to an independent set. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data. The i {\displaystyle i} -th principal component can be taken as a direction orthogonal to the first i − 1 {\displaystyle i-1} principal components that maximizes the variance of the projected data. For either objective, it can be shown that the principal components are eigenvectors of the data's covariance matrix. Thus, the principal components are often computed by eigendecomposition of the data covariance matrix or singular value decomposition of the data matrix. PCA is the simplest of the true eigenvector-based multivariate analyses and is closely related to factor analysis. Factor analysis typically incorporates more domain-specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. PCA is also related to canonical correlation analysis (CCA). CCA defines coordinate systems that optimally describe the cross-covariance between two datasets while PCA defines a new orthogonal coordinate system that optimally describes variance in a single dataset. Robust and L1-norm-based variants of standard PCA have also been proposed. == History == PCA was invented in 1901 by Karl Pearson, as an analogue of the principal axis theorem in mechanics; it was later independently developed and named by Harold Hotelling in the 1930s. Depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT) in signal processing, the Hotelling transform in multivariate quality control, proper orthogonal decomposition (POD) in mechanical engineering, singular value decomposition (SVD) of X (invented in the last quarter of the 19th century), eigenvalue decomposition (EVD) of XTX in linear algebra, factor analysis (for a discussion of the differences between PCA and factor analysis see Ch. 7 of Jolliffe's Principal Component Analysis), Eckart–Young theorem (Harman, 1960), or empirical orthogonal functions (EOF) in meteorological science (Lorenz, 1956), empirical eigenfunction decomposition (Sirovich, 1987), quasiharmonic modes (Brooks et al., 1988), spectral decomposition in noise and vibration, and empirical modal analysis in structural dynamics. == Intuition == PCA can be thought of as fitting a p-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipsoid is small, then the variance along that axis is also small. To find the axes of the ellipsoid, we must first center the values of each variable in the dataset on 0 by subtracting the mean of the variable's observed values from each of those values. These transformed values are used instead of the original observed values for each of the variables. Then, we compute the covariance matrix of the data and calculate the eigenvalues and corresponding eigenvectors of this covariance matrix. Then we must normalize each of the orthogonal eigenvectors to turn them into unit vectors. Once this is done, each of the mutually-orthogonal unit eigenvectors can be interpreted as an axis of the ellipsoid fitted to the data. This choice of basis will transform the covariance matrix into a diagonalized form, in which the diagonal elements represent the variance of each axis. The proportion of the variance that each eigenvector represents can be calculated by dividing the eigenvalue corresponding to that eigenvector by the sum of all eigenvalues. Biplots and scree plots (degree of explained variance) are used to interpret findings of the PCA. == Details == PCA is defined as an orthogonal linear transformation on a real inner product space that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. Consider an n × p {\displaystyle n\times p} data matrix, X, with column-wise zero empirical mean (the sample mean of each column has been shifted to zero), where each of the n rows represents a different repetition of the experiment, and each of the p columns gives a particular kind of feature (say, the results from a particular sensor). Mathematically, the transformation is defined by a set of size l {\displaystyle l} (where l {\displaystyle l} is usually selected to be strictly less than p {\displaystyle p} to reduce dimensionality) of p {\displaystyle p} -dimensional vectors of weights or coefficients w ( k ) = ( w 1 , … , w p ) ( k ) {\displaystyle \mathbf {w} _{(k)}=(w_{1},\dots ,w_{p})_{(k)}} that map each row vector x ( i ) = ( x 1 , … , x p ) ( i ) {\displaystyle \mathbf {x} _{(i)}=(x_{1},\dots ,x_{p})_{(i)}} of X to a new vector of principal component scores t ( i ) = ( t 1 , … , t l ) ( i ) {\displaystyle \mathbf {t} _{(i)}=(t_{1},\dots ,t_{l})_{(i)}} , given by t k ( i ) = x ( i ) ⋅ w ( k ) f o r i = 1 , … , n k = 1 , … , l {\displaystyle {t_{k}}_{(i)}=\mathbf {x} _{(i)}\cdot \mathbf {w} _{(k)}\qquad \mathrm {for} \qquad i=1,\dots ,n\qquad k=1,\dots ,l} in such a way that the individual variables t 1 , … , t l {\displaystyle t_{1},\dots ,t_{l}} of t considered over the data set successively inherit the maximum possible variance from X, with each coefficient vector w constrained to be a unit vector. The above may equivalently be written in matrix form as T = X W {\displaystyle \mathbf {T} =\mathbf {X} \mathbf {W} } where T i k = t k ( i ) {\displaystyle {\mathbf {T} }_{ik}={t_{k}}_{(i)}} , X i j = x j ( i ) {\displaystyle {\mathbf {X} }_{ij}={x_{j}}_{(i)}} , and W j k = w j ( k ) {\displaystyle {\mathbf {W} }_{jk}={w_{j}}_{(k)}} . === First component === In order to maximize variance, the first weight vector w(1) thus has to satisfy w ( 1 ) = arg ⁡ max ‖ w ‖ = 1 { ∑ i ( t 1 ) ( i ) 2 } = arg ⁡ max ‖ w ‖ = 1 { ∑ i ( x ( i ) ⋅ w ) 2 } {\displaystyle \mathbf {w} _{(1)}=\arg \max _{\Vert \mathbf {w} \Vert =1}\,\left\{\sum _{i}(t_{1})_{(i)}^{2}\right\}=\arg \max _{\Vert \mathbf {w} \Vert =1}\,\left\{\sum _{i}\left(\mathbf {x} _{(i)}\cdot \mathbf {w} \right)^{2}\right\}} Equivalently, writing this in matrix form gives w ( 1 ) = arg ⁡ max ‖ w ‖ = 1 { ‖ X w ‖ 2 } = arg ⁡ max ‖ w ‖ = 1 { w T X T X w } {\displaystyle \mathbf {w} _{(1)}=\arg \max _{\left\|\mathbf {w} \right\|=1}\left\{\left\|\mathbf {Xw} \right\|^{2}\right\}=\arg \max _{\left\|\mathbf {w} \right\|=1}\left\{\mathbf {w} ^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {Xw} \right\}} Since w(1) has been defined to be a unit vector, it equivalently also satisfies w ( 1 ) = arg ⁡ max { w T X T X w w T w } {\displaystyle \mathbf {w} _{(1)}=\arg \max \left\{{\frac {\mathbf {w} ^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {Xw} }{\mathbf {w} ^{\mathsf {T}}\mathbf {w} }}\right\}} The quantity to be maximised can be recognised as a Rayleigh quotient. A standard result for a positive semidefinite matrix such as XTX is that the quotient's maximum possible value is the largest eigenvalue of the matrix, which occurs when w is the corresponding eigenvector. With w(1) found, the first principal component of a data vector

    Read more →
  • Relief (feature selection)

    Relief (feature selection)

    Relief is an algorithm developed by Kenji Kira and Larry Rendell in 1992 that takes a filter-method approach to feature selection that is notably sensitive to feature interactions. It was originally designed for application to binary classification problems with discrete or numerical features. Relief calculates a feature score for each feature which can then be applied to rank and select top scoring features for feature selection. Alternatively, these scores may be applied as feature weights to guide downstream modeling. Relief feature scoring is based on the identification of feature value differences between nearest neighbor instance pairs. If a feature value difference is observed in a neighboring instance pair with the same class (a 'hit'), the feature score decreases. Alternatively, if a feature value difference is observed in a neighboring instance pair with different class values (a 'miss'), the feature score increases. The original Relief algorithm has since inspired a family of Relief-based feature selection algorithms (RBAs), including the ReliefF algorithm. Beyond the original Relief algorithm, RBAs have been adapted to (1) perform more reliably in noisy problems, (2) generalize to multi-class problems (3) generalize to numerical outcome (i.e. regression) problems, and (4) to make them robust to incomplete (i.e. missing) data. To date, the development of RBA variants and extensions has focused on four areas; (1) improving performance of the 'core' Relief algorithm, i.e. examining strategies for neighbor selection and instance weighting, (2) improving scalability of the 'core' Relief algorithm to larger feature spaces through iterative approaches, (3) methods for flexibly adapting Relief to different data types, and (4) improving Relief run efficiency. Their strengths are that they are not dependent on heuristics, they run in low-order polynomial time, and they are noise-tolerant and robust to feature interactions, as well as being applicable for binary or continuous data; however, it does not discriminate between redundant features, and low numbers of training instances fool the algorithm. == Relief Algorithm == Take a data set with n instances of p features, belonging to two known classes. Within the data set, each feature should be scaled to the interval [0 1] (binary data should remain as 0 and 1). The algorithm will be repeated m times. Start with a p-long weight vector (W) of zeros. At each iteration, take the feature vector (X) belonging to one random instance, and the feature vectors of the instance closest to X (by Euclidean distance) from each class. The closest same-class instance is called 'near-hit', and the closest different-class instance is called 'near-miss'. Update the weight vector such that W i = W i − ( x i − n e a r H i t i ) 2 + ( x i − n e a r M i s s i ) 2 , {\displaystyle W_{i}=W_{i}-(x_{i}-\mathrm {nearHit} _{i})^{2}+(x_{i}-\mathrm {nearMiss} _{i})^{2},} where i {\displaystyle i} indexes the components and runs from 1 to p. Thus the weight of any given feature decreases if it differs from that feature in nearby instances of the same class more than nearby instances of the other class, and increases in the reverse case. After m iterations, divide each element of the weight vector by m. This becomes the relevance vector. Features are selected if their relevance is greater than a threshold τ. Kira and Rendell's experiments showed a clear contrast between relevant and irrelevant features, allowing τ to be determined by inspection. However, it can also be determined by Chebyshev's inequality for a given confidence level (α) that a τ of 1/sqrt(αm) is good enough to make the probability of a Type I error less than α, although it is stated that τ can be much smaller than that. Relief was also described as generalizable to multinomial classification by decomposition into a number of binary problems. == ReliefF Algorithm == Kononenko et al. propose a number of updates to Relief. Firstly, they find the near-hit and near-miss instances using the Manhattan (L1) norm rather than the Euclidean (L2) norm, although the rationale is not specified. Furthermore, they found taking the absolute differences between xi and near-hiti, and xi and near-missi to be sufficient when updating the weight vector (rather than the square of those differences). === Reliable probability estimation === Rather than repeating the algorithm m times, implement it exhaustively (i.e. n times, once for each instance) for relatively small n (up to one thousand). Furthermore, rather than finding the single nearest hit and single nearest miss, which may cause redundant and noisy attributes to affect the selection of the nearest neighbors, ReliefF searches for k nearest hits and misses and averages their contribution to the weights of each feature. k can be tuned for any individual problem. === Incomplete data === In ReliefF, the contribution of missing values to the feature weight is determined using the conditional probability that two values should be the same or different, approximated with relative frequencies from the data set. This can be calculated if one or both features are missing. === Multi-class problems === Rather than use Kira and Rendell's proposed decomposition of a multinomial classification into a number of binomial problems, ReliefF searches for k near misses from each different class and averages their contributions for updating W, weighted with the prior probability of each class. == Other Relief-based Algorithm Extensions/Derivatives == The following RBAs are arranged chronologically from oldest to most recent. They include methods for improving (1) the core Relief algorithm concept, (2) iterative approaches for scalability, (3) adaptations to different data types, (4) strategies for computational efficiency, or (5) some combination of these goals. For more on RBAs see these book chapters or this most recent review paper. === RRELIEFF === Robnik-Šikonja and Kononenko propose further updates to ReliefF, making it appropriate for regression. === Relieved-F === Introduced deterministic neighbor selection approach and a new approach for incomplete data handling. === Iterative Relief === Implemented method to address bias against non-monotonic features. Introduced the first iterative Relief approach. For the first time, neighbors were uniquely determined by a radius threshold and instances were weighted by their distance from the target instance. === I-RELIEF === Introduced sigmoidal weighting based on distance from target instance. All instance pairs (not just a defined subset of neighbors) contributed to score updates. Proposed an on-line learning variant of Relief. Extended the iterative Relief concept. Introduced local-learning updates between iterations for improved convergence. === TuRF (a.k.a. Tuned ReliefF) === Specifically sought to address noise in large feature spaces through the recursive elimination of features and the iterative application of ReliefF. === Evaporative Cooling ReliefF === Similarly seeking to address noise in large feature spaces. Utilized an iterative `evaporative' removal of lowest quality features using ReliefF scores in association with mutual information. === EReliefF (a.k.a. Extended ReliefF) === Addressing issues related to incomplete and multi-class data. === VLSReliefF (a.k.a. Very Large Scale ReliefF) === Dramatically improves the efficiency of detecting 2-way feature interactions in very large feature spaces by scoring random feature subsets rather than the entire feature space. === ReliefMSS === Introduced calculation of feature weights relative to average feature 'diff' between instance pairs. === SURF === SURF identifies nearest neighbors (both hits and misses) based on a distance threshold from the target instance defined by the average distance between all pairs of instances in the training data. Results suggest improved power to detect 2-way epistatic interactions over ReliefF. === SURF (a.k.a. SURFStar) === SURF extends the SURF algorithm to not only utilized 'near' neighbors in scoring updates, but 'far' instances as well, but employing inverted scoring updates for 'far instance pairs. Results suggest improved power to detect 2-way epistatic interactions over SURF, but an inability to detect simple main effects (i.e. univariate associations). === SWRF === SWRF extends the SURF algorithm adopting sigmoid weighting to take distance from the threshold into account. Also introduced a modular framework for further developing RBAs called MoRF. === MultiSURF (a.k.a. MultiSURFStar) === MultiSURF extends the SURF algorithm adapting the near/far neighborhood boundaries based on the average and standard deviation of distances from the target instance to all others. MultiSURF uses the standard deviation to define a dead-band zone where 'middle-distance' instances do not contribute to scoring. Evidence suggests MultiSURF performs best in detecting pure 2-way feature interactions. === Reli

    Read more →
  • Content as a service

    Content as a service

    Content as a service (CaaS) or managed content as a service (MCaaS) is a service-oriented model, where the service provider delivers the content on demand to the service consumer via web services that are licensed under subscription. The content is hosted by the service provider centrally in the cloud and offered to a number of consumers that need the content delivered into any applications or system, hence content can be demanded by the consumers as and when required. Content as a Service is a way to provide raw content (in other words, without the need for a specific human compatible representation, such as HTML) in a way that other systems can make use of it. Content as a Service is not meant for direct human consumption, but rather for other platforms to consume and make use of the content according to their particular needs. This happens usually on the cloud, with a centralized platform which can be globally accessible and provides a standard format for your content. With Content as a Service, you centralize your content into a single repository, where you can manage it, categorize it, make it available to others, search for it, or do whatever you wish with it. == Overview == The content delivered typically could be one or more of the following The technical terminology related to equipment or spares that is required to procure or design the materials The industrial terminology of the equipment or spares Technical values pertaining to various types, specifications, applications, characteristics of equipment or spares Sourcing information which will help in procurement or supply-chain management of equipment or spares Descriptive specifications of equipment or spares based on the product reference number or identifier UNSPSC codes or industry practiced classifications ISO, IEC compliant terminology Ontology or Technical Dictionary of products & services Predefined content for specific business needs The term "Content as a service" (CaaS) is considered to be part of the nomenclature of cloud computing service models & Service-oriented architecture along with Software as a service (SaaS), Infrastructure as a service (IaaS), and Platform as a service (PaaS).

    Read more →
  • Radial basis function

    Radial basis function

    In mathematics a radial basis function (RBF) is a real-valued function φ {\textstyle \varphi } whose value depends only on the distance between the input and some fixed point, either the origin, so that φ ( x ) = φ ^ ( ‖ x ‖ ) {\textstyle \varphi (\mathbf {x} )={\hat {\varphi }}(\left\|\mathbf {x} \right\|)} , or some other fixed point c {\textstyle \mathbf {c} } , called a center, so that φ ( x ) = φ ^ ( ‖ x − c ‖ ) {\textstyle \varphi (\mathbf {x} )={\hat {\varphi }}(\left\|\mathbf {x} -\mathbf {c} \right\|)} . Any function φ {\textstyle \varphi } that satisfies the property φ ( x ) = φ ^ ( ‖ x ‖ ) {\textstyle \varphi (\mathbf {x} )={\hat {\varphi }}(\left\|\mathbf {x} \right\|)} is a radial function. The distance is usually Euclidean distance, although other metrics are sometimes used. They are often used as a collection { φ k } k {\displaystyle \{\varphi _{k}\}_{k}} which forms a basis for some function space of interest, hence the name. Sums of radial basis functions are typically used to approximate given functions. This approximation process can also be interpreted as a simple kind of neural network; this was the context in which they were originally applied to machine learning, in work by David Broomhead and David Lowe in 1988, which stemmed from Michael J. D. Powell's seminal research from 1977. RBFs are also used as a kernel in support vector classification. The technique has proven effective and flexible enough that radial basis functions are now applied in a variety of engineering applications. == Definition == A radial function is a function φ : [ 0 , ∞ ) → R {\textstyle \varphi :[0,\infty )\to \mathbb {R} } . When paired with a norm ‖ ⋅ ‖ : V → [ 0 , ∞ ) {\textstyle \|\cdot \|:V\to [0,\infty )} on a vector space, a function of the form φ c = φ ( ‖ x − c ‖ ) {\textstyle \varphi _{\mathbf {c} }=\varphi (\|\mathbf {x} -\mathbf {c} \|)} is said to be a radial kernel centered at c ∈ V {\textstyle \mathbf {c} \in V} . A radial function and the associated radial kernels are said to be radial basis functions if, for any finite set of nodes { x k } k = 1 n ⊆ V {\displaystyle \{\mathbf {x} _{k}\}_{k=1}^{n}\subseteq V} , all of the following conditions are true: === Examples === Commonly used types of radial basis functions include (writing r = ‖ x − x i ‖ {\textstyle r=\left\|\mathbf {x} -\mathbf {x} _{i}\right\|} and using ε {\textstyle \varepsilon } to indicate a shape parameter that can be used to scale the input of the radial kernel): == Approximation == Radial basis functions are typically used to build up function approximations of the form where the approximating function y ( x ) {\textstyle y(\mathbf {x} )} is represented as a sum of N {\displaystyle N} radial basis functions, each associated with a different center x i {\textstyle \mathbf {x} _{i}} , and weighted by an appropriate coefficient w i . {\textstyle w_{i}.} The weights w i {\textstyle w_{i}} can be estimated using the matrix methods of linear least squares, because the approximating function is linear in the weights w i {\textstyle w_{i}} . Approximation schemes of this kind have been particularly used in time series prediction and control of nonlinear systems exhibiting sufficiently simple chaotic behaviour and 3D reconstruction in computer graphics (for example, hierarchical RBF and Pose Space Deformation). == RBF Network == The sum can also be interpreted as a rather simple single-layer type of artificial neural network called a radial basis function network, with the radial basis functions taking on the role of the activation functions of the network. It can be shown that any continuous function on a compact interval can in principle be interpolated with arbitrary accuracy by a sum of this form, if a sufficiently large number N {\textstyle N} of radial basis functions is used. The approximant y ( x ) {\textstyle y(\mathbf {x} )} is differentiable with respect to the weights w i {\textstyle w_{i}} . The weights could thus be learned using any of the standard iterative methods for neural networks. Using radial basis functions in this manner yields a reasonable interpolation approach provided that the fitting set has been chosen such that it covers the entire range systematically (equidistant data points are ideal). However, without a polynomial term that is orthogonal to the radial basis functions, estimates outside the fitting set tend to perform poorly. == RBFs for PDEs == Radial basis functions are used to approximate functions and so can be used to discretize and numerically solve Partial Differential Equations (PDEs). This was first done in 1990 by E. J. Kansa who developed the first RBF based numerical method. It is called the Kansa method and was used to solve the elliptic Poisson equation and the linear advection-diffusion equation. The function values at points x {\displaystyle \mathbf {x} } in the domain are approximated by the linear combination of RBFs: The derivatives are approximated as such: where N {\displaystyle N} are the number of points in the discretized domain, d {\displaystyle d} the dimension of the domain and λ {\displaystyle \lambda } the scalar coefficients that are unchanged by the differential operator. Different numerical methods based on Radial Basis Functions were developed thereafter. Some methods are the RBF-FD method, the RBF-QR method and the RBF-PUM method.

    Read more →
  • Policy gradient method

    Policy gradient method

    Policy gradient methods are a class of reinforcement learning algorithms and a sub-class of policy optimization methods. Unlike value-based methods which learn a value function to derive a policy, policy optimization methods directly learn a policy function π {\displaystyle \pi } that selects actions without consulting a value function. For policy gradient to apply, the policy function π θ {\displaystyle \pi _{\theta }} is parameterized by a differentiable parameter θ {\displaystyle \theta } . == Overview == In policy-based RL, the actor is a parameterized policy function π θ {\displaystyle \pi _{\theta }} , where θ {\displaystyle \theta } are the parameters of the actor. The actor takes as argument the state of the environment s {\displaystyle s} and produces a probability distribution π θ ( ⋅ ∣ s ) {\displaystyle \pi _{\theta }(\cdot \mid s)} . If the action space is discrete, then ∑ a π θ ( a ∣ s ) = 1 {\displaystyle \sum _{a}\pi _{\theta }(a\mid s)=1} . If the action space is continuous, then ∫ a π θ ( a ∣ s ) d a = 1 {\displaystyle \int _{a}\pi _{\theta }(a\mid s)\mathrm {d} a=1} . The goal of policy optimization is to find some θ {\displaystyle \theta } that maximizes the expected episodic reward J ( θ ) {\displaystyle J(\theta )} : J ( θ ) = E π θ [ ∑ t = 0 T γ t R t | S 0 = s 0 ] {\displaystyle J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\gamma ^{t}R_{t}{\Big |}S_{0}=s_{0}\right]} where γ {\displaystyle \gamma } is the discount factor, R t {\displaystyle R_{t}} is the reward at step t {\displaystyle t} , s 0 {\displaystyle s_{0}} is the starting state, and T {\displaystyle T} is the time-horizon (which can be infinite). The policy gradient is defined as ∇ θ J ( θ ) {\displaystyle \nabla _{\theta }J(\theta )} . Different policy gradient methods stochastically estimate the policy gradient in different ways. The goal of any policy gradient method is to iteratively maximize J ( θ ) {\displaystyle J(\theta )} by gradient ascent. Since the key part of any policy gradient method is the stochastic estimation of the policy gradient, they are also studied under the title of "Monte Carlo gradient estimation". == REINFORCE == === Policy gradient === The REINFORCE algorithm, introduced by Ronald J. Williams in 1992, was the first policy gradient method. It is based on the identity for the policy gradient ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t ∣ S t ) ∑ t = 0 T ( γ t R t ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}\mid S_{t})\;\sum _{t=0}^{T}(\gamma ^{t}R_{t}){\Big |}S_{0}=s_{0}\right]} which can be improved via the "causality trick" ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t ∣ S t ) ∑ τ = t T ( γ τ R τ ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}\mid S_{t})\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau }){\Big |}S_{0}=s_{0}\right]} Thus, we have an unbiased estimator of the policy gradient: ∇ θ J ( θ ) ≈ 1 N ∑ n = 1 N [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t , n ∣ S t , n ) ∑ τ = t T ( γ τ − t R τ , n ) ] {\displaystyle \nabla _{\theta }J(\theta )\approx {\frac {1}{N}}\sum _{n=1}^{N}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t,n}\mid S_{t,n})\sum _{\tau =t}^{T}(\gamma ^{\tau -t}R_{\tau ,n})\right]} where the index n {\displaystyle n} ranges over N {\displaystyle N} rollout trajectories using the policy π θ {\displaystyle \pi _{\theta }} . The score function ∇ θ ln ⁡ π θ ( A t ∣ S t ) {\displaystyle \nabla _{\theta }\ln \pi _{\theta }(A_{t}\mid S_{t})} can be interpreted as the direction in the parameter space that increases the probability of taking action A t {\displaystyle A_{t}} in state S t {\displaystyle S_{t}} . The policy gradient, then, is a weighted average of all possible directions to increase the probability of taking any action in any state, but weighted by reward signals, so that if taking a certain action in a certain state is associated with high reward, then that direction would be highly reinforced, and vice versa. === Algorithm === The REINFORCE algorithm is a loop: Rollout N {\displaystyle N} trajectories in the environment, using π θ t {\displaystyle \pi _{\theta _{t}}} as the policy function. Compute the policy gradient estimation: g i ← 1 N ∑ n = 1 N [ ∑ t = 0 T ∇ θ t ln ⁡ π θ ( A t , n ∣ S t , n ) ∑ τ = t T ( γ τ R τ , n ) ] {\displaystyle g_{i}\leftarrow {\frac {1}{N}}\sum _{n=1}^{N}\left[\sum _{t=0}^{T}\nabla _{\theta _{t}}\ln \pi _{\theta }(A_{t,n}\mid S_{t,n})\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau ,n})\right]} Update the policy by gradient ascent: θ i + 1 ← θ i + α i g i {\displaystyle \theta _{i+1}\leftarrow \theta _{i}+\alpha _{i}g_{i}} Here, α i {\displaystyle \alpha _{i}} is the learning rate at update step i {\displaystyle i} . == Variance reduction == REINFORCE is an on-policy algorithm, meaning that the trajectories used for the update must be sampled from the current policy π θ {\displaystyle \pi _{\theta }} . This can lead to high variance in the updates, as the returns R ( τ ) {\displaystyle R(\tau )} can vary significantly between trajectories. Many variants of REINFORCE have been introduced, under the title of variance reduction. === REINFORCE with baseline === A common way for reducing variance is the REINFORCE with baseline algorithm, based on the following identity: ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t | S t ) ( ∑ τ = t T ( γ τ R τ ) − b ( S t ) ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\left(\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau })-b(S_{t})\right){\Big |}S_{0}=s_{0}\right]} for any function b : States → R {\displaystyle b:{\text{States}}\to \mathbb {R} } . This can be proven by applying the previous lemma. The algorithm uses the modified gradient estimator g i ← 1 N ∑ n = 1 N [ ∑ t = 0 T ∇ θ t ln ⁡ π θ ( A t , n | S t , n ) ( ∑ τ = t T ( γ τ R τ , n ) − b i ( S t , n ) ) ] {\displaystyle g_{i}\leftarrow {\frac {1}{N}}\sum _{n=1}^{N}\left[\sum _{t=0}^{T}\nabla _{\theta _{t}}\ln \pi _{\theta }(A_{t,n}|S_{t,n})\left(\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau ,n})-b_{i}(S_{t,n})\right)\right]} and the original REINFORCE algorithm is the special case where b i ≡ 0 {\displaystyle b_{i}\equiv 0} . === Actor-critic methods === If b i {\textstyle b_{i}} is chosen well, such that b i ( S t ) ≈ ∑ τ = t T ( γ τ R τ ) = γ t V π θ i ( S t ) {\textstyle b_{i}(S_{t})\approx \sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau })=\gamma ^{t}V^{\pi _{\theta _{i}}}(S_{t})} , this could significantly decrease variance in the gradient estimation. That is, the baseline should be as close to the value function V π θ i ( S t ) {\displaystyle V^{\pi _{\theta _{i}}}(S_{t})} as possible, approaching the ideal of: ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t | S t ) ( ∑ τ = t T ( γ τ R τ ) − γ t V π θ ( S t ) ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\left(\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau })-\gamma ^{t}V^{\pi _{\theta }}(S_{t})\right){\Big |}S_{0}=s_{0}\right]} Note that, as the policy π θ t {\displaystyle \pi _{\theta _{t}}} updates, the value function V π θ i ( S t ) {\displaystyle V^{\pi _{\theta _{i}}}(S_{t})} updates as well, so the baseline should also be updated. One common approach is to train a separate function that estimates the value function, and use that as the baseline. This is one of the actor-critic methods, where the policy function is the actor and the value function is the critic. The Q-function Q π {\displaystyle Q^{\pi }} can also be used as the critic, since ∇ θ J ( θ ) = E π θ [ ∑ 0 ≤ t ≤ T γ t ∇ θ ln ⁡ π θ ( A t | S t ) ⋅ Q π θ ( S t , A t ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left[\sum _{0\leq t\leq T}\gamma ^{t}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\cdot Q^{\pi _{\theta }}(S_{t},A_{t}){\Big |}S_{0}=s_{0}\right]} by a similar argument using the tower law. Subtracting the value function as a baseline, we find that the advantage function A π ( S , A ) = Q π ( S , A ) − V π ( S ) {\displaystyle A^{\pi }(S,A)=Q^{\pi }(S,A)-V^{\pi }(S)} can be used as the critic as well: ∇ θ J ( θ ) = E π θ [ ∑ 0 ≤ t ≤ T γ t ∇ θ ln ⁡ π θ ( A t | S t ) ⋅ A π θ ( S t , A t ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left[\sum _{0\leq t\leq T}\gamma ^{t}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\cdot A^{\pi _{\theta }}(S_{t},A_{t}){\Big |}S_{0}=s_{0}\right]} In summary, there are many unbiased estimators for ∇ θ J θ {\textstyle \nabla _{\theta }J_{\theta }} , all in the form of: ∇ θ J ( θ ) = E π θ [ ∑ 0 ≤ t ≤ T ∇ θ ln ⁡ π θ ( A t | S t ) ⋅ Ψ t | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left[\su

    Read more →
  • Non-negative matrix factorization

    Non-negative matrix factorization

    Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically. NMF finds applications in such fields as astronomy, computer vision, document clustering, missing data imputation, chemometrics, audio signal processing, recommender systems, and bioinformatics. == History == In chemometrics non-negative matrix factorization has a long history under the name "self modeling curve resolution". In this framework the vectors in the right matrix are continuous curves rather than discrete vectors. Also early work on non-negative matrix factorizations was performed by a Finnish group of researchers in the 1990s under the name positive matrix factorization. It became more widely known as non-negative matrix factorization after Lee and Seung investigated the properties of the algorithm and published some simple and useful algorithms for two types of factorizations. == Background == Let matrix V be the product of the matrices W and H, V = W H . {\displaystyle \mathbf {V} =\mathbf {W} \mathbf {H} \,.} Matrix multiplication can be implemented as computing the column vectors of V as linear combinations of the column vectors in W using coefficients supplied by columns of H. That is, each column of V can be computed as follows: v i = W h i , {\displaystyle \mathbf {v} _{i}=\mathbf {W} \mathbf {h} _{i}\,,} where vi is the i-th column vector of the product matrix V and hi is the i-th column vector of the matrix H. When multiplying matrices, the dimensions of the factor matrices may be significantly lower than those of the product matrix and it is this property that forms the basis of NMF. NMF generates factors with significantly reduced dimensions compared to the original matrix. For example, if V is an m × n matrix, W is an m × p matrix, and H is a p × n matrix then p can be significantly less than both m and n. Here is an example based on a text-mining application: Let the input matrix (the matrix to be factored) be V with 10000 rows and 500 columns where words are in rows and documents are in columns. That is, we have 500 documents indexed by 10000 words. It follows that a column vector v in V represents a document. Assume we ask the algorithm to find 10 features in order to generate a features matrix W with 10000 rows and 10 columns and a coefficients matrix H with 10 rows and 500 columns. The product of W and H is a matrix with 10000 rows and 500 columns, the same shape as the input matrix V and, if the factorization worked, it is a reasonable approximation to the input matrix V. From the treatment of matrix multiplication above it follows that each column in the product matrix WH is a linear combination of the 10 column vectors in the features matrix W with coefficients supplied by the coefficients matrix H. This last point is the basis of NMF because we can consider each original document in our example as being built from a small set of hidden features. NMF generates these features. It is useful to think of each feature (column vector) in the features matrix W as a document archetype comprising a set of words where each word's cell value defines the word's rank in the feature: The higher a word's cell value the higher the word's rank in the feature. A column in the coefficients matrix H represents an original document with a cell value defining the document's rank for a feature. We can now reconstruct a document (column vector) from our input matrix by a linear combination of our features (column vectors in W) where each feature is weighted by the feature's cell value from the document's column in H. == Clustering property == NMF has an inherent clustering property, i.e., it automatically clusters the columns of input data V = ( v 1 , … , v n ) {\displaystyle \mathbf {V} =(v_{1},\dots ,v_{n})} . More specifically, the approximation of V {\displaystyle \mathbf {V} } by V ≃ W H {\displaystyle \mathbf {V} \simeq \mathbf {W} \mathbf {H} } is achieved by finding W {\displaystyle W} and H {\displaystyle H} that minimize the error function (using the Frobenius norm) ‖ V − W H ‖ F , {\displaystyle \left\|V-WH\right\|_{F},} subject to W ≥ 0 , H ≥ 0. {\displaystyle W\geq 0,H\geq 0.} , If we furthermore impose an orthogonality constraint on H {\displaystyle \mathbf {H} } , i.e. H H T = I {\displaystyle \mathbf {H} \mathbf {H} ^{T}=I} , then the above minimization is mathematically equivalent to the minimization of K-means clustering. Furthermore, the computed H {\displaystyle H} gives the cluster membership, i.e., if H k j > H i j {\displaystyle \mathbf {H} _{kj}>\mathbf {H} _{ij}} for all i ≠ k, this suggests that the input data v j {\displaystyle v_{j}} belongs to k {\displaystyle k} -th cluster. The computed W {\displaystyle W} gives the cluster centroids, i.e., the k {\displaystyle k} -th column gives the cluster centroid of k {\displaystyle k} -th cluster. This centroid's representation can be significantly enhanced by convex NMF. When the orthogonality constraint H H T = I {\displaystyle \mathbf {H} \mathbf {H} ^{T}=I} is not explicitly imposed, the orthogonality holds to a large extent, and the clustering property holds too. When the error function to be used is Kullback–Leibler divergence, NMF is identical to the probabilistic latent semantic analysis (PLSA), a popular document clustering method. == Types == === Approximate non-negative matrix factorization === Usually the number of columns of W and the number of rows of H in NMF are selected so the product WH will become an approximation to V. The full decomposition of V then amounts to the two non-negative matrices W and H as well as a residual U, such that: V = WH + U. The elements of the residual matrix can either be negative or positive. When W and H are smaller than V they become easier to store and manipulate. Another reason for factorizing V into smaller matrices W and H, is that if one's goal is to approximately represent the elements of V by significantly less data, then one has to infer some latent structure in the data. === Convex non-negative matrix factorization === In standard NMF, matrix factor W ∈ R+m × k, i.e., W can be anything in that space. Convex NMF restricts the columns of W to convex combinations of the input data vectors ( v 1 , … , v n ) {\displaystyle (v_{1},\dots ,v_{n})} . This greatly improves the quality of data representation of W. Furthermore, the resulting matrix factor H becomes more sparse and orthogonal. === Nonnegative rank factorization === In case the nonnegative rank of V is equal to its actual rank, V = WH is called a nonnegative rank factorization (NRF). The problem of finding the NRF of V, if it exists, is known to be NP-hard. === Different cost functions and regularizations === There are different types of non-negative matrix factorizations. The different types arise from using different cost functions for measuring the divergence between V and WH and possibly by regularization of the W and/or H matrices. Two simple divergence functions studied by Lee and Seung are the squared error (or Frobenius norm) and an extension of the Kullback–Leibler divergence to positive matrices (the original Kullback–Leibler divergence is defined on probability distributions). Each divergence leads to a different NMF algorithm, usually minimizing the divergence using iterative update rules. The factorization problem in the squared error version of NMF may be stated as: Given a matrix V {\displaystyle \mathbf {V} } find nonnegative matrices W and H that minimize the function F ( W , H ) = ‖ V − W H ‖ F 2 {\displaystyle F(\mathbf {W} ,\mathbf {H} )=\left\|\mathbf {V} -\mathbf {WH} \right\|_{F}^{2}} Another type of NMF for images is based on the total variation norm. When L1 regularization (akin to Lasso) is added to NMF with the mean squared error cost function, the resulting problem may be called non-negative sparse coding due to the similarity to the sparse coding problem, although it may also still be referred to as NMF. === Online NMF === Many standard NMF algorithms analyze all the data together; i.e., the whole matrix is available from the start. This may be unsatisfactory in applications where there are too many data to fit into memory or where the data are provided in streaming fashion. One such use is for collaborative filtering in recommendation systems, where there may be many users and many items to recommend, and it would be inefficient to recalculate everything when one user or one item is added to the system. The cost function for o

    Read more →
  • Purged cross-validation

    Purged cross-validation

    Purged cross-validation is a variant of k-fold cross-validation designed to prevent look-ahead bias in time series and other structured data, developed in 2017 by Marcos López de Prado at Guggenheim Partners and Cornell University. It is primarily used in financial machine learning to ensure the independence of training and testing samples when labels depend on future events. It provides an alternative to conventional cross-validation and walk-forward backtesting methods, which often yield overly optimistic performance estimates due to information leakage and overfitting. == Motivation == Standard cross-validation assumes that observations are independently and identically distributed (IID), which often does not hold in time series or financial datasets. If the label of a test sample overlaps in time with the features or labels in the training set, the result may be data leakage and overfitting. Purged cross-validation addresses this issue by removing overlapping observations and, optionally, adding a temporal buffer ("embargo") around the test set to further reduce the risk of leakage. The figure below illustrates standard 5 Fold Cross-Validation == Purging == Purging removes from the training set any observation whose timestamp falls within the time range of formation of a label in the test set. This can be the case for train set observations before and after the test set. Their removal ensures that the algorithm cannot learn during train time information that will be used to assess the performance of the algorithm. See the figure below for an illustration of purging. == Embargoing == Embargoing addresses a more subtle form of leakage: even if an observation does not directly overlap the test set, it may still be affected by test events due to market reaction lag or downstream dependencies. To guard against this, a percentage-based embargo is imposed after each test fold. For example, with a 5% embargo and 1000 observations, the 50 observations following each test fold are excluded from training. Unlike purging, embargoing can only occur after the test set. The figure below illustrates the application of embargo: == Applications == Purged and embargoed cross-validation has been useful in: Backtesting of trading strategies Validation of classifiers on labeled event-driven returns Any machine learning task with overlapping label horizons == Example == To illustrate the effect of purging and embargoing, consider the figures below. Both diagrams show the structure of 5-fold cross-validation over a 20-day period. In each row, blue squares indicate training samples and red squares denote test samples. Each label is defined based on the value of the next two observations, hence creating an overlap. If this overlap is left untreated, test set information leaks into the train set. The second figure applies the Purged CV procedure. Notice how purging removes overlapping observations from the training set and the embargo widens the gap between test and training data. This approach ensures that the evaluation more closely resembles a true out-of-sample test and reduces the risk of backtest overfitting. == Combinatorial Purged Cross-Validation == Walk-forward backtesting analysis, another common cross-validation technique in finance, preserves temporal order but evaluates the model on a single sequence of test sets. This leads to high variance in performance estimation, as results are contingent on a specific historical path. Combinatorial Purged Cross-Validation (CPCV) addresses this limitation by systematically constructing multiple train-test splits, purging overlapping samples, and enforcing an embargo period to prevent information leakage. The result is a distribution of out-of-sample performance estimates, enabling robust statistical inference and more realistic assessment of a model's predictive power. === Methodology === CPCV divides a time-series dataset into N sequential, non-overlapping groups. These groups preserve the temporal order of observations. Then, all combinations of k groups (where k < N) are selected as test sets, with the remaining N − k groups used for training. For each combination, the model is trained and evaluated under strict controls to prevent leakage. To eliminate potential contamination between training and test sets, CPCV introduces two additional mechanisms: Purging: Any training observations whose label horizon overlaps with the test period are excluded. This ensures that future information does not influence model training. Embargoing: After the end of each test period, a fixed number of observations (typically a small percentage) are removed from the training set. This prevents leakage due to delayed market reactions or auto-correlated features. Each data point appears in multiple test sets across different combinations. Because test groups are drawn combinatorially, this process produces multiple backtest "paths," each of which simulates a plausible market scenario. From these paths, practitioners can compute a distribution of performance statistics such as the Sharpe ratio, drawdown, or classification accuracy. === Formal definition === Let N be the number of sequential groups into which the dataset is divided, and let k be the number of groups selected as the test set for each split. Then: The number of unique train-test combinations is given by the binomial coefficient: ( N k ) {\displaystyle {\binom {N}{k}}} Each observation is used in k {\displaystyle k} test sets and contributes to φ [ N , k ] {\displaystyle \varphi [N,k]} unique backtest paths: φ [ N , k ] = k N ( N k ) {\displaystyle \varphi [N,k]={\frac {k}{N}}{\binom {N}{k}}} This yields a distribution of performance metrics rather than a single point estimate, making it possible to apply Monte Carlo-based or probabilistic techniques to assess model robustness. === Illustrative example === Consider the case where N = 6 and k = 2. The number of possible test set combinations is ( 6 2 ) = 15 {\displaystyle {\binom {6}{2}}=15} . Each of the six groups appears in five test splits. Consequently, five distinct backtest paths can be constructed, each incorporating one appearance from every group. ==== Test group assignment matrix ==== This table shows the 15 test combinations. An "x" indicates that the corresponding group is included in the test set for that split. ==== Backtest path assignment ==== Each group contributes to five different backtest paths. The number in each cell indicates the path to which the group's result is assigned for that split. === Advantages === Combinatorial Purged Cross-Validation offers several key benefits over conventional methods: It produces a distribution of performance metrics, enabling more rigorous statistical inference. The method systematically eliminates lookahead bias through purging and embargoing. By simulating multiple historical scenarios, it reduces the dependence on any single market regime or realization. It supports high-confidence comparisons between competing models or strategies. CPCV is commonly used in quantitative strategy research, especially for evaluating predictive models such as classifiers, regressors, and portfolio optimizers. It has been applied to estimate realistic Sharpe ratios, assess the risk of overfitting, and support the use of statistical tools such as the Deflated Sharpe Ratio (DSR). === Limitations === The main limitation of CPCV stems from its high computational cost. However, this cost can be managed by sampling a finite number of splits from the space of all possible combinations.

    Read more →
  • (1+ε)-approximate nearest neighbor search

    (1+ε)-approximate nearest neighbor search

    (1+ε)-approximate nearest neighbor search is a variant of the nearest neighbor search problem. A solution to the (1+ε)-approximate nearest neighbor search is a point or multiple points within distance (1+ε) R from a query point, where R is the distance between the query point and its true nearest neighbor. Reasons to approximate nearest neighbor search include the space and time costs of exact solutions in high-dimensional spaces (see curse of dimensionality) and that in some domains, finding an approximate nearest neighbor is an acceptable solution. Approaches for solving (1+ε)-approximate nearest neighbor search include k-d trees, locality-sensitive hashing and brute-force search.

    Read more →
  • Elastic net regularization

    Elastic net regularization

    In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods. Nevertheless, elastic net regularization is typically more accurate than both methods with regard to reconstruction. == Specification == The elastic net method overcomes the limitations of the LASSO (least absolute shrinkage and selection operator) method which uses a penalty function based on ‖ β ‖ 1 = ∑ j = 1 p | β j | . {\displaystyle \|\beta \|_{1}=\textstyle \sum _{j=1}^{p}|\beta _{j}|.} Use of this penalty function has several limitations. For example, in the "large p, small n" case (high-dimensional data with few examples), the LASSO selects at most n variables before it saturates. Also if there is a group of highly correlated variables, then the LASSO tends to select one variable from a group and ignore the others. To overcome these limitations, the elastic net adds a quadratic part ( ‖ β ‖ 2 {\displaystyle \|\beta \|^{2}} ) to the penalty, which when used alone is ridge regression (known also as Tikhonov regularization). The estimates from the elastic net method are defined by β ^ ≡ argmin β ( ‖ y − X β ‖ 2 + λ 2 ‖ β ‖ 2 + λ 1 ‖ β ‖ 1 ) . {\displaystyle {\hat {\beta }}\equiv {\underset {\beta }{\operatorname {argmin} }}(\|y-X\beta \|^{2}+\lambda _{2}\|\beta \|^{2}+\lambda _{1}\|\beta \|_{1}).} The quadratic penalty term makes the loss function strongly convex, and it therefore has a unique minimum. The elastic net method includes the LASSO and ridge regression: in other words, each of them is a special case where λ 1 = λ , λ 2 = 0 {\displaystyle \lambda _{1}=\lambda ,\lambda _{2}=0} or λ 1 = 0 , λ 2 = λ {\displaystyle \lambda _{1}=0,\lambda _{2}=\lambda } . Meanwhile, the naive version of elastic net method finds an estimator in a two-stage procedure : first for each fixed λ 2 {\displaystyle \lambda _{2}} it finds the ridge regression coefficients, and then does a LASSO type shrinkage. This kind of estimation incurs a double amount of shrinkage, which leads to increased bias and poor predictions. To improve the prediction performance, sometimes the coefficients of the naive version of elastic net is rescaled by multiplying the estimated coefficients by ( 1 + λ 2 ) {\displaystyle (1+\lambda _{2})} . Examples of where the elastic net method has been applied are: Support vector machine Metric learning Portfolio optimization Cancer prognosis == Reduction to support vector machine == It was proven in 2014 that the elastic net can be reduced to the linear support vector machine. A similar reduction was previously proven for the LASSO in 2014. The authors showed that for every instance of the elastic net, an artificial binary classification problem can be constructed such that the hyper-plane solution of a linear support vector machine (SVM) is identical to the solution β {\displaystyle \beta } (after re-scaling). The reduction immediately enables the use of highly optimized SVM solvers for elastic net problems. It also enables the use of GPU acceleration, which is often already used for large-scale SVM solvers. The reduction is a simple transformation of the original data and regularization constants X ∈ R n × p , y ∈ R n , λ 1 ≥ 0 , λ 2 ≥ 0 {\displaystyle X\in {\mathbb {R} }^{n\times p},y\in {\mathbb {R} }^{n},\lambda _{1}\geq 0,\lambda _{2}\geq 0} into new artificial data instances and a regularization constant that specify a binary classification problem and the SVM regularization constant X 2 ∈ R 2 p × n , y 2 ∈ { − 1 , 1 } 2 p , C ≥ 0. {\displaystyle X_{2}\in {\mathbb {R} }^{2p\times n},y_{2}\in \{-1,1\}^{2p},C\geq 0.} Here, y 2 {\displaystyle y_{2}} consists of binary labels − 1 , 1 {\displaystyle {-1,1}} . When 2 p > n {\displaystyle 2p>n} it is typically faster to solve the linear SVM in the primal, whereas otherwise the dual formulation is faster. Some authors have referred to the transformation as Support Vector Elastic Net (SVEN), and provided the following MATLAB pseudo-code: == Software == "Glmnet: Lasso and elastic-net regularized generalized linear models" is a software which is implemented as an R source package and as a MATLAB toolbox. This includes fast algorithms for estimation of generalized linear models with ℓ1 (the lasso), ℓ2 (ridge regression) and mixtures of the two penalties (the elastic net) using cyclical coordinate descent, computed along a regularization path. JMP Pro 11 includes elastic net regularization, using the Generalized Regression personality with Fit Model. "pensim: Simulation of high-dimensional data and parallelized repeated penalized regression" implements an alternate, parallelised "2D" tuning method of the ℓ parameters, a method claimed to result in improved prediction accuracy. scikit-learn includes linear regression and logistic regression with elastic net regularization. SVEN, a Matlab implementation of Support Vector Elastic Net. This solver reduces the Elastic Net problem to an instance of SVM binary classification and uses a Matlab SVM solver to find the solution. Because SVM is easily parallelizable, the code can be faster than Glmnet on modern hardware. SpaSM, a Matlab implementation of sparse regression, classification and principal component analysis, including elastic net regularized regression. Apache Spark provides support for Elastic Net Regression in its MLlib machine learning library. The method is available as a parameter of the more general LinearRegression class. SAS (software) The SAS procedure Glmselect and SAS Viya procedure Regselect support the use of elastic net regularization for model selection.

    Read more →
  • Sum of absolute transformed differences

    Sum of absolute transformed differences

    The sum of absolute transformed differences (SATD) is a block matching criterion widely used in fractional motion estimation for video compression. It works by taking a frequency transform, usually a Hadamard transform, of the differences between the pixels in the original block and the corresponding pixels in the block being used for comparison. The transform itself is often of a small block rather than the entire macroblock. For example, in x264, a series of 4×4 blocks are transformed rather than doing the more processor-intensive 16×16 transform. == Comparison to other metrics == SATD is slower than the sum of absolute differences (SAD), both due to its increased complexity and the fact that SAD-specific MMX and SSE2 instructions exist, while there are no such instructions for SATD. However, SATD can still be optimized considerably with SIMD instructions on most modern CPUs. The benefit of SATD is that it more accurately models the number of bits required to transmit the residual error signal. As such, it is often used in video compressors, either as a way to drive and estimate rate explicitly, such as in the Theora encoder (since 1.1 alpha2), as an optional metric used in wide motion searches, such as in the Microsoft VC-1 encoder, or as a metric used in sub-pixel refinement, such as in x264.

    Read more →
  • Seam carving

    Seam carving

    Seam carving (or liquid rescaling) is an algorithm for content-aware image resizing, developed by Shai Avidan, of Mitsubishi Electric Research Laboratories (MERL), and Ariel Shamir, of the Interdisciplinary Center and MERL. It functions by establishing a number of seams (paths of least importance) in an image and automatically removes seams to reduce image size or inserts seams to extend it. Seam carving also allows manually defining areas in which pixels may not be modified, and features the ability to remove whole objects from photographs. The purpose of the algorithm is image retargeting, which is the problem of displaying images without distortion on media of various sizes (cell phones, projection screens) using document standards, like HTML, that already support dynamic changes in page layout and text but not images. Image Retargeting was invented by Vidya Setlur, Saeko Takage, Ramesh Raskar, Michael Gleicher and Bruce Gooch in 2005. The work by Setlur et al. won the 10-year impact award in 2015. == Seams == Seams can be either vertical or horizontal. A vertical seam is a path of pixels connected from top to bottom in an image with one pixel in each row. A horizontal seam is similar with the exception of the connection being from left to right. The importance/energy function values a pixel by measuring its contrast with its neighbor pixels. == Process == The below example describes the process of seam carving: The seams to remove depends only on the dimension (height or width) one wants to shrink. It is also possible to invert step 4 so the algorithm enlarges in one dimension by copying a low energy seam and averaging its pixels with its neighbors. === Computing seams === Computing a seam consists of finding a path of minimum energy cost from one end of the image to another. This can be done via Dijkstra's algorithm, dynamic programming, greedy algorithm or graph cuts among others. ==== Dynamic programming ==== Dynamic programming is a programming method that stores the results of sub-calculations in order to simplify calculating a more complex result. Dynamic programming can be used to compute seams. If attempting to compute a vertical seam (path) of lowest energy, for each pixel in a row we compute the energy of the current pixel plus the energy of one of the three possible pixels above it. The images below depict a DP process to compute one optimal seam. Each square represents a pixel, with the top-left value in red representing the energy value of that pixel. The value in black represents the cumulative sum of energies leading up to and including that pixel. The energy calculation is trivially parallelized for simple functions. The calculation of the DP array can also be parallelized with some interprocess communication. However, the problem of making multiple seams at the same time is harder for two reasons: the energy needs to be regenerated for each removal for correctness and simply tracing back multiple seams can form overlaps. Avidan 2007 computes all seams by removing each seam iteratively and storing an "index map" to record all the seams generated. The map holds a "nth seam" number for each pixel on the image, and can be used later for size adjustment. If one ignores both issues however, a greedy approximation for parallel seam carving is possible. To do so, one starts with the minimum-energy pixel at one end, and keep choosing the minimum energy path to the other end. The used pixels are marked so that they are not picked again. Local seams can also be computed for smaller parts of the image in parallel for a good approximation. == Issues == The algorithm may need user-provided information to reduce errors. This can consist of painting the regions which are to be preserved. With human faces it is possible to use face detection. Sometimes the algorithm, by removing a low energy seam, may end up inadvertently creating a seam of higher energy. The solution to this is to simulate a removal of a seam, and then check the energy delta to see if the energy increases (forward energy). If it does, prefer other seams instead. == Implementations == Adobe Systems acquired a non-exclusive license to seam carving technology from MERL, and implemented it as a feature in Photoshop CS4, where it is called Content Aware Scaling. As the license is non-exclusive, other popular computer graphics applications (e. g. GIMP, digiKam, and ImageMagick) as well as some stand-alone programs (e. g. iResizer) also have implementations of this technique, some of which are released as free and open source software. There also exists an implementation for webpages. == Improvements and extensions == Better energy function and application to video by introducing 2D (time+1D) seams. Faster implementation on GPU. Application of this forward energy function to static images. Multi-operator: Combine with cropping and scaling. Much faster removal of multiple seams. Removing seams through neural deformation fields to extend to continuous domains like 3D scenes. A 2010 review of eight image retargeting methods found that seam carving produced output that was ranked among the worst of the tested algorithms. It was, however, a part of one of the highest-ranking algorithms: the multi-operator extension mentioned above (combined with cropping and scaling).

    Read more →
  • Reservoir computing

    Reservoir computing

    Reservoir computing is a framework for computation derived from recurrent neural network theory that maps input signals into higher dimensional computational spaces through the dynamics of a fixed, non-linear system called a reservoir. After the input signal is fed into the reservoir, which is treated as a "black box," a simple readout mechanism is trained to read the state of the reservoir and map it to the desired output. The first key benefit of this framework is that training is performed only at the readout stage, as the reservoir dynamics are fixed. The second is that the computational power of naturally available systems, both classical and quantum mechanical, can be used to reduce the effective computational cost. == History == The first examples of reservoir neural networks demonstrated that randomly connected recurrent neural networks could be used for sensorimotor sequence learning, and simple forms of interval and speech discrimination. In these early models the memory in the network took the form of both short-term synaptic plasticity and activity mediated by recurrent connections. In other early reservoir neural network models the memory of the recent stimulus history was provided solely by the recurrent activity. Overall, the general concept of reservoir computing stems from the use of recursive connections within neural networks to create a complex dynamical system. It is a generalisation of earlier neural network architectures such as recurrent neural networks, liquid-state machines and echo-state networks. Reservoir computing also extends to physical systems that are not networks in the classical sense, but rather continuous systems in space and/or time: e.g. a literal "bucket of water" can serve as a reservoir that performs computations on inputs given as perturbations of the surface. The resultant complexity of such recurrent neural networks was found to be useful in solving a variety of problems including language processing and dynamic system modeling. However, training of recurrent neural networks is challenging and computationally expensive. Reservoir computing reduces those training-related challenges by fixing the dynamics of the reservoir and only training the linear output layer. A large variety of nonlinear dynamical systems can serve as a reservoir that performs computations. In recent years semiconductor lasers have attracted considerable interest as computation can be fast and energy efficient compared to electrical components. Recent advances in both AI and quantum information theory have given rise to the concept of quantum neural networks. These hold promise in quantum information processing, which is challenging to classical networks, but can also find application in solving classical problems. In 2018, a physical realization of a quantum reservoir computing architecture was demonstrated in the form of nuclear spins within a molecular solid. However, the nuclear spin experiments in did not demonstrate quantum reservoir computing per se as they did not involve processing of sequential data. Rather the data were vector inputs, which makes this more accurately a demonstration of quantum implementation of a random kitchen sink algorithm (also going by the name of extreme learning machines in some communities). In 2019, another possible implementation of quantum reservoir processors was proposed in the form of two-dimensional fermionic lattices. In 2020, realization of reservoir computing on gate-based quantum computers was proposed and demonstrated on cloud-based IBM superconducting near-term quantum computers. Reservoir computers have been used for time-series analysis purposes. In particular, some of their usages involve chaotic time-series prediction, separation of chaotic signals, and link inference of networks from their dynamics. == Classical reservoir computing == === Reservoir === The 'reservoir' in reservoir computing is the internal structure of the computer, and must have two properties: it must be made up of individual, non-linear units, and it must be capable of storing information. The non-linearity describes the response of each unit to input, which is what allows reservoir computers to solve complex problems. Reservoirs are able to store information by connecting the units in recurrent loops, where the previous input affects the next response. The change in reaction due to the past allows the computers to be trained to complete specific tasks. Reservoirs can be virtual or physical. Virtual reservoirs are typically randomly generated and are designed like neural networks. Virtual reservoirs can be designed to have non-linearity and recurrent loops, but, unlike neural networks, the connections between units are randomized and remain unchanged throughout computation. Physical reservoirs are possible because of the inherent non-linearity of certain natural systems. The interaction between ripples on the surface of water contains the nonlinear dynamics required in reservoir creation, and a pattern recognition RC was developed by first inputting ripples with electric motors then recording and analyzing the ripples in the readout. === Readout === The readout is a neural network layer that performs a linear transformation on the output of the reservoir. The weights of the readout layer are trained by analyzing the spatiotemporal patterns of the reservoir after excitation by known inputs, and by utilizing a training method such as a linear regression or a Ridge regression. As its implementation depends on spatiotemporal reservoir patterns, the details of readout methods are tailored to each type of reservoir. For example, the readout for a reservoir computer using a container of liquid as its reservoir might entail observing spatiotemporal patterns on the surface of the liquid. === Types === ==== Context reverberation network ==== An early example of reservoir computing was the context reverberation network. In this architecture, an input layer feeds into a high dimensional dynamical system which is read out by a trainable single-layer perceptron. Two kinds of dynamical system were described: a recurrent neural network with fixed random weights, and a continuous reaction–diffusion system inspired by Alan Turing's model of morphogenesis. At the trainable layer, the perceptron associates current inputs with the signals that reverberate in the dynamical system; the latter were said to provide a dynamic "context" for the inputs. In the language of later work, the reaction–diffusion system served as the reservoir. ==== Echo state network ==== The tree echo state network (TreeESN) model represents a generalization of the reservoir computing framework to tree structured data. ==== Liquid-state machine ==== Chaotic liquid state machine The liquid (i.e. reservoir) of a chaotic liquid state machine (CLSM), or chaotic reservoir, is made from chaotic spiking neurons but which stabilize their activity by settling to a single hypothesis that describes the trained inputs of the machine. This is in contrast to general types of reservoirs that don't stabilize. The liquid stabilization occurs via synaptic plasticity and chaos control that govern neural connections inside the liquid. CLSM showed promising results in learning sensitive time series data. ==== Nonlinear transient computation ==== This type of information processing is most relevant when time-dependent input signals depart from the mechanism's internal dynamics. These departures cause transients or temporary altercations which are represented in the device's output. ==== Deep reservoir computing ==== The extension of the reservoir computing framework towards deep learning, with the introduction of deep reservoir computing and of the deep echo state network (DeepESN) model allows to develop efficiently trained models for hierarchical processing of temporal data, at the same time enabling the investigation on the inherent role of layered composition in recurrent neural networks. == Quantum reservoir computing == Quantum reservoir computing may use the nonlinear nature of quantum mechanical interactions or processes to form the characteristic nonlinear reservoirs but may also be done with linear reservoirs when the injection of the input to the reservoir creates the nonlinearity. The marriage of machine learning and quantum devices is leading to the emergence of quantum neuromorphic computing as a new research area. === Types === ==== Gaussian states of interacting quantum harmonic oscillators ==== Gaussian states are a paradigmatic class of states of continuous variable quantum systems. Although they can nowadays be created and manipulated in, e.g, state-of-the-art optical platforms, naturally robust to decoherence, it is well-known that they are not sufficient for, e.g., universal quantum computing because transformations that preserve the Gaussian nature of a state are linear. Normally, linear dynamics would not be sufficient for nontrivial reser

    Read more →
  • Shattered set

    Shattered set

    A class of sets is said to shatter another set if it is possible to "pick out" any element of that set using intersection. The concept of shattered sets plays an important role in Vapnik–Chervonenkis theory, also known as VC-theory. Shattering and VC-theory are used in the study of empirical processes as well as in statistical computational learning theory. == Definition == Suppose A is a set and C is a class of sets. The class C shatters the set A if for each subset a of A, there is some element c of C such that a = c ∩ A . {\displaystyle a=c\cap A.} Equivalently, C shatters A when their intersection is equal to A's power set: P(A) = { c ∩ A | c ∈ C }. We employ the letter C to refer to a "class" or "collection" of sets, as in a Vapnik–Chervonenkis class (VC-class). The set A is often assumed to be finite because, in empirical processes, we are interested in the shattering of finite sets of data points. == Example == We will show that the class of all discs in the plane (two-dimensional space) does not shatter every set of four points on the unit circle, yet the class of all convex sets in the plane does shatter every finite set of points on the unit circle. Let A be a set of four points on the unit circle and let C be the class of all discs. To test where C shatters A, we attempt to draw a disc around every subset of points in A. First, we draw a disc around the subsets of each isolated point. Next, we try to draw a disc around every subset of point pairs. This turns out to be doable for adjacent points, but impossible for points on opposite sides of the circle. Any attempt to include those points on the opposite side will necessarily include other points not in that pair. Hence, any pair of opposite points cannot be isolated out of A using intersections with class C and so C does not shatter A. As visualized below: Because there is some subset which can not be isolated by any disc in C, we conclude then that A is not shattered by C. And, with a bit of thought, we can prove that no set of four points is shattered by this C. However, if we redefine C to be the class of all elliptical discs, we find that we can still isolate all the subsets from above, as well as the points that were formerly problematic. Thus, this specific set of 4 points is shattered by the class of elliptical discs. Visualized below: With a bit of thought, we could generalize that any set of finite points on a unit circle could be shattered by the class of all convex sets (visualize connecting the dots). == Shatter coefficient == To quantify the richness of a collection C of sets, we use the concept of shattering coefficients (also known as the growth function). For a collection C of sets s ⊂ Ω {\displaystyle s\subset \Omega } , Ω {\displaystyle \Omega } being any space, often a sample space, we define the nth shattering coefficient of C as S C ( n ) = max ∀ x 1 , x 2 , … , x n ∈ Ω card ⁡ { { x 1 , x 2 , … , x n } ∩ s , s ∈ C } {\displaystyle S_{C}(n)=\max _{\forall x_{1},x_{2},\dots ,x_{n}\in \Omega }\operatorname {card} \{\,\{\,x_{1},x_{2},\dots ,x_{n}\}\cap s,s\in C\}} where card {\displaystyle \operatorname {card} } denotes the cardinality of the set and x 1 , x 2 , … , x n ∈ Ω {\displaystyle x_{1},x_{2},\dots ,x_{n}\in \Omega } is any set of n points,. S C ( n ) {\displaystyle S_{C}(n)} is the largest number of subsets of any set A of n points that can be formed by intersecting A with the sets in collection C. For example, if set A contains 3 points, its power set, P ( A ) {\displaystyle P(A)} , contains 2 3 = 8 {\displaystyle 2^{3}=8} elements. If C shatters A, its shattering coefficient(3) would be 8 and S C ( 2 ) {\displaystyle S_{C}(2)} would be 2 2 = 4 {\displaystyle 2^{2}=4} . However, if one of those sets in P ( A ) {\displaystyle P(A)} cannot be obtained through intersections in c, then S C ( 3 ) {\displaystyle S_{C}(3)} would only be 7. If none of those sets can be obtained, S C ( 3 ) {\displaystyle S_{C}(3)} would be 0. Additionally, if S C ( 2 ) = 3 {\displaystyle S_{C}(2)=3} , for example, then there is an element in the set of all 2-point sets from A that cannot be obtained from intersections with C. It follows from this that S C ( 3 ) {\displaystyle S_{C}(3)} would also be less than 8 (i.e. C would not shatter A) because we have already located a "missing" set in the smaller power set of 2-point sets. This example illustrates some properties of S C ( n ) {\displaystyle S_{C}(n)} : S C ( n ) ≤ 2 n {\displaystyle S_{C}(n)\leq 2^{n}} for all n because { s ∩ A | s ∈ C } ⊆ P ( A ) {\displaystyle \{s\cap A|s\in C\}\subseteq P(A)} for any A ⊆ Ω {\displaystyle A\subseteq \Omega } . If S C ( n ) = 2 n {\displaystyle S_{C}(n)=2^{n}} , that means there is a set of cardinality n, which can be shattered by C. If S C ( N ) < 2 N {\displaystyle S_{C}(N)<2^{N}} for some N > 1 {\displaystyle N>1} then S C ( n ) < 2 n {\displaystyle S_{C}(n)<2^{n}} for all n ≥ N {\displaystyle n\geq N} . The third property means that if C cannot shatter any set of cardinality N then it can not shatter sets of larger cardinalities. == Vapnik–Chervonenkis class == If A cannot be shattered by C, there will be a smallest value of n that makes the shatter coefficient(n) less than 2 n {\displaystyle 2^{n}} because as n gets larger, there are more sets that could be missed. Alternatively, there is also a largest value of n for which the S C ( n ) {\displaystyle S_{C}(n)} is still 2 n {\displaystyle 2^{n}} , because as n gets smaller, there are fewer sets that could be omitted. The extreme of this is S C ( 0 ) {\displaystyle S_{C}(0)} (the shattering coefficient of the empty set), which must always be 2 0 = 1 {\displaystyle 2^{0}=1} . These statements lends themselves to defining the VC dimension of a class C as: V C ( C ) = min n { n : S C ( n ) < 2 n } {\displaystyle VC(C)={\underset {n}{\min }}\{n:S_{C}(n)<2^{n}\}\,} or, alternatively, as V C 0 ( C ) = max n { n : S C ( n ) = 2 n } . {\displaystyle VC_{0}(C)={\underset {n}{\max }}\{n:S_{C}(n)=2^{n}\}.\,} Note that V C ( C ) = V C 0 ( C ) + 1. {\displaystyle VC(C)=VC_{0}(C)+1.} . The VC dimension is usually defined as V C 0 {\displaystyle VC_{0}} , the largest cardinality of points chosen that will still shatter A (i.e. n such that S C ( n ) = 2 n {\displaystyle S_{C}(n)=2^{n}} ). Altneratively, if for any n there is a set of cardinality n which can be shattered by C, then S C ( n ) = 2 n {\displaystyle S_{C}(n)=2^{n}} for all n and the VC dimension of this class C is infinite. A class with finite VC dimension is called a Vapnik–Chervonenkis class or VC class. A class C is uniformly Glivenko–Cantelli if and only if it is a VC class.

    Read more →