AI Headshot Generator For Linkedin

AI Headshot Generator For Linkedin — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Video browsing

    Video browsing

    Video browsing, also known as exploratory video search, is the interactive process of skimming through video content in order to satisfy some information need or to interactively check if the video content is relevant. While originally proposed to help users inspecting a single video through visual thumbnails, modern video browsing tools enable users to quickly find desired information in a video archive by iterative human–computer interaction through an exploratory search approach. Many of these tools presume a smart user that wants features to interactively inspect video content, as well as automatic content filtering features. For that purpose, several video interaction features are usually provided, such as sophisticated navigation in video or search by a content-based query. Video browsing tools often build on lower-level video content analysis, such as shot transition detection, keyframe extraction, semantic concept detection, and create a structured content overview of the video file or video archive. Furthermore, they usually provide sophisticated navigation features, such as advanced timelines, visual seeker bars or a list of selected thumbnails, as well as means for content querying. Examples of content queries are shot filtering through visual concepts (e.g., only shots showing cars), through some specific characteristics (e.g., color or motion filtering), through user-provided sketches (e.g., a visually drawn sketch), or through content-based similarity search. == History == Video browsing was originally proposed by Iranian engineer Farshid Arman, Taiwanese computer scientist Arding Hsu, and computer scientist Ming-Yee Chiu, while working at Siemens, and it was presented at the ACM International Conference in August 1993. They described a shot detection algorithm for compressed video that was originally encoded with discrete cosine transform (DCT) video coding standards such as JPEG, MPEG and H.26x. The basic idea was that, since the DCT coefficients are mathematically related to the spatial domain and represent the content of each frame, they can be used to detect the differences between video frames. In the algorithm, a subset of blocks in a frame and a subset of DCT coefficients for each block are used as motion vector representation for the frame. By operating on compressed DCT representations, the algorithm significantly reduces the computational requirements for decompression and enables effective video browsing. The algorithm represents separate shots of a video sequence by an r-frame, a thumbnail of the shot framed by a motion tracking region. A variation of this concept was later adopted for QBIC video content mosaics, where each r-frame is a salient still from the shot it represents. === Video Notebook === Modern video browsing solutions include Video Notebook, a Menlo Park startup founded in 2021 by Mike Lanza, which uses computer vision to extract slides and optical character recognition and speech recognition to facilitate video search. The software can be either used on the client side (using a browser extension), where the slides and text are extracted while the video is watched (e.g. on a video platform like YouTube or Udemy), or on the server side. Processed videos, which can be viewed in the Video Notebook web app, feature a video browsing user interface with extracted timestamped slides, a search bar for querying the video (or a collection of videos), and text chapters. Video Notebook customers include organisations like Ernst & Young. === Video Browser Showdown === The Video Browser Showdown (VBS) is an annual live evaluation competition for exploratory video search tools, where international researchers use video browsing tools to solve ad-hoc video search tasks on a moderately large data set as fast as possible. The main goal of the VBS, which started in 2012 at the International Conference on MultiMedia Modeling (MMM), is to advance the performance of video browsing tools. Since 2016, the VBS also collaborates with TRECVID. The aim of the VBS is to evaluate video browsing tools for efficiency at known-item search (KIS) tasks with a well-defined data set in direct comparison to other tools.

    Read more →
  • Boosting (machine learning)

    Boosting (machine learning)

    In machine learning (ML), boosting is an ensemble learning method that combines a set of less accurate models (called "weak learners") to create a single, highly accurate model (a "strong learner"). Unlike other ensemble methods that build models in parallel (such as bagging), boosting algorithms build models sequentially. Each new model in the sequence is trained to correct the errors made by its predecessors. This iterative process allows the overall model to improve its accuracy, particularly by reducing bias. Boosting is a popular and effective technique used in supervised learning for both classification and regression tasks. The theoretical foundation for boosting came from a question posed by Kearns and Valiant (1988, 1989): "Can a set of weak learners create a single strong learner?" A weak learner is defined as a classifier that performs only slightly better than random guessing, whereas a strong learner is a classifier that is highly correlated with the true classification. Robert Schapire's affirmative answer to this question in a 1990 paper led to the development of practical boosting algorithms. The first such algorithm was developed by Schapire, with Freund and Schapire later developing AdaBoost, which remains a foundational example of boosting. == Algorithms == While boosting is not algorithmically constrained, most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. When they are added, they are weighted in a way that is related to the weak learners' accuracy. After a weak learner is added, the data weights are readjusted, known as "re-weighting". Misclassified input data gain a higher weight and examples that are classified correctly lose weight. Thus, future weak learners focus more on the examples that previous weak learners misclassified. There are many boosting algorithms. The original ones, proposed by Robert Schapire (a recursive majority gate formulation), and Yoav Freund (boost by majority), were not adaptive and could not take full advantage of the weak learners. Schapire and Freund then developed AdaBoost, an adaptive boosting algorithm that won the prestigious Gödel Prize. Only algorithms that are provable boosting algorithms in the probably approximately correct learning formulation can accurately be called boosting algorithms. Other algorithms that are similar in spirit to boosting algorithms are sometimes called "leveraging algorithms", although they are also sometimes incorrectly called boosting algorithms. The main variation between many boosting algorithms is their method of weighting training data points and hypotheses. AdaBoost is very popular and the most significant historically as it was the first algorithm that could adapt to the weak learners. It is often the basis of introductory coverage of boosting in university machine learning courses. There are many more recent algorithms such as LPBoost, TotalBoost, BrownBoost, xgboost, MadaBoost, LogitBoost, CatBoost and others. Many boosting algorithms fit into the AnyBoost framework, which shows that boosting performs gradient descent in a function space using a convex cost function. == Object categorization in computer vision == Given images containing various known objects in the world, a classifier can be learned from them to automatically classify the objects in future images. Simple classifiers built based on some image feature of the object tend to be weak in categorization performance. Using boosting methods for object categorization is a way to unify the weak classifiers in a special way to boost the overall ability of categorization. === Problem of object categorization === Object categorization is a typical task of computer vision that involves determining whether or not an image contains some specific category of object. The idea is closely related with recognition, identification, and detection. Appearance based object categorization typically contains feature extraction, learning a classifier, and applying the classifier to new examples. There are many ways to represent a category of objects, e.g. from shape analysis, bag of words models, or local descriptors such as SIFT, etc. Examples of supervised classifiers are Naive Bayes classifiers, support vector machines, mixtures of Gaussians, and neural networks. However, research has shown that object categories and their locations in images can be discovered in an unsupervised manner as well. === Status quo for object categorization === The recognition of object categories in images is a challenging problem in computer vision, especially when the number of categories is large. This is due to high intra class variability and the need for generalization across variations of objects within the same category. Objects within one category may look quite different. Even the same object may appear unalike under different viewpoint, scale, and illumination. Background clutter and partial occlusion add difficulties to recognition as well. Humans are able to recognize thousands of object types, whereas most of the existing object recognition systems are trained to recognize only a few, e.g. human faces, cars, simple objects, etc. Research has been very active on dealing with more categories and enabling incremental additions of new categories, and although the general problem remains unsolved, several multi-category objects detectors (for up to hundreds or thousands of categories) have been developed. One means is by feature sharing and boosting. === Boosting for binary categorization === AdaBoost can be used for face detection as an example of binary categorization. The two categories are faces versus background. The general algorithm is as follows: Form a large set of simple features Initialize weights for training images For T rounds Normalize the weights For available features from the set, train a classifier using a single feature and evaluate the training error Choose the classifier with the lowest error Update the weights of the training images: increase if classified wrongly by this classifier, decrease if correctly Form the final strong classifier as the linear combination of the T classifiers (coefficient larger if training error is small) After boosting, a classifier constructed from 200 features could yield a 95% detection rate under a 10 − 5 {\displaystyle 10^{-5}} false positive rate. Another application of boosting for binary categorization is a system that detects pedestrians using patterns of motion and appearance. This work is the first to combine both motion information and appearance information as features to detect a walking person. It takes a similar approach to the Viola-Jones object detection framework. === Boosting for multi-class categorization === Compared with binary categorization, multi-class categorization looks for common features that can be shared across the categories at the same time. They turn to be more generic edge like features. During learning, the detectors for each category can be trained jointly. Compared with training separately, it generalizes better, needs less training data, and requires fewer features to achieve the same performance. The main flow of the algorithm is similar to the binary case. What is different is that a measure of the joint training error shall be defined in advance. During each iteration the algorithm chooses a classifier of a single feature (features that can be shared by more categories shall be encouraged). This can be done via converting multi-class classification into a binary one (a set of categories versus the rest), or by introducing a penalty error from the categories that do not have the feature of the classifier. In the paper "Sharing visual features for multiclass and multiview object detection", A. Torralba et al. used GentleBoost for boosting and showed that when training data is limited, learning via sharing features does a much better job than no sharing, given same boosting rounds. Also, for a given performance level, the total number of features required (and therefore the run time cost of the classifier) for the feature sharing detectors, is observed to scale approximately logarithmically with the number of class, i.e., slower than linear growth in the non-sharing case. Similar results are shown in the paper "Incremental learning of object detectors using a visual shape alphabet", yet the authors used AdaBoost for boosting. == Convex vs. non-convex boosting algorithms == Boosting algorithms can be based on convex or non-convex optimization algorithms. Convex algorithms, such as AdaBoost and LogitBoost, can be "defeated" by random noise such that they can't learn basic and learnable combinations of weak hypotheses. This limitation was pointed out by Long & Servedio in 2008. However, by 2009, multiple authors demonstrated that boosting algorithms based on non-convex optimization, such as BrownBoost, can learn from nois

    Read more →
  • SqueezeNet

    SqueezeNet

    SqueezeNet is a deep neural network for image classification released in 2016. SqueezeNet was developed by researchers at DeepScale, University of California, Berkeley, and Stanford University. In designing SqueezeNet, the authors' goal was to create a smaller neural network with fewer parameters while achieving competitive accuracy. Their best-performing model achieved the same accuracy as AlexNet on ImageNet classification, but has a size 510x less than it. == Version history == SqueezeNet was originally released on February 22, 2016. This original version of SqueezeNet was implemented on top of the Caffe deep learning software framework. Shortly thereafter, the open-source research community ported SqueezeNet to a number of other deep learning frameworks. On February 26, 2016, Eddie Bell released a port of SqueezeNet for the Chainer deep learning framework. On March 2, 2016, Guo Haria released a port of SqueezeNet for the Apache MXNet framework. On June 3, 2016, Tammy Yang released a port of SqueezeNet for the Keras framework. In 2017, companies including Baidu, Xilinx, Imagination Technologies, and Synopsys demonstrated SqueezeNet running on low-power processing platforms such as smartphones, FPGAs, and custom processors. As of 2018, SqueezeNet ships "natively" as part of the source code of a number of deep learning frameworks such as PyTorch, Apache MXNet, and Apple CoreML. In addition, third party developers have created implementations of SqueezeNet that are compatible with frameworks such as TensorFlow. Below is a summary of frameworks that support SqueezeNet. == Relationship to other networks == === AlexNet === SqueezeNet was originally described in SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. AlexNet is a deep neural network that has 240 MB of parameters, and SqueezeNet has just 5 MB of parameters. This small model size can more easily fit into computer memory and can more easily be transmitted over a computer network. However, it's important to note that SqueezeNet is not a "squeezed version of AlexNet." Rather, SqueezeNet is an entirely different DNN architecture than AlexNet. What SqueezeNet and AlexNet have in common is that both of them achieve approximately the same level of accuracy when evaluated on the ImageNet image classification validation dataset. === Model compression === Model compression (e.g. quantization and pruning of model parameters) can be applied to a deep neural network after it has been trained. In the SqueezeNet paper, the authors demonstrated that a model compression technique called Deep Compression can be applied to SqueezeNet to further reduce the size of the parameter file from 5 MB to 500 KB. Deep Compression has also been applied to other DNNs, such as AlexNet and VGG. == Variants == Some of the members of the original SqueezeNet team have continued to develop resource-efficient deep neural networks for a variety of applications. A few of these works are noted in the following table. As with the original SqueezeNet model, the open-source research community has ported and adapted these newer "squeeze"-family models for compatibility with multiple deep learning frameworks. In addition, the open-source research community has extended SqueezeNet to other applications, including semantic segmentation of images and style transfer.

    Read more →
  • Markov blanket

    Markov blanket

    In statistics and machine learning, a Markov blanket of a random variable is a set of variables that renders the variable conditionally independent of all other variables in the system. This concept is central in probabilistic graphical models and feature selection. If a Markov blanket is minimal—meaning that no variable in it can be removed without losing this conditional independence—it is called a Markov boundary. Identifying a Markov blanket or boundary allows for efficient inference and helps isolate relevant variables for prediction or causal reasoning. The terms Markov blanket and Markov boundary were coined by Judea Pearl in 1988. A Markov blanket may be derived from the structure of a probabilistic graphical model such as a Bayesian network or Markov random field. == Definition == A Markov blanket of a random variable Y {\displaystyle Y} in a random variable set S = { X 1 , … , X n } {\displaystyle {\mathcal {S}}=\{X_{1},\ldots ,X_{n}\}} is any subset S 1 {\displaystyle {\mathcal {S}}_{1}} of S {\displaystyle {\mathcal {S}}} , conditioned on which other variables are independent with Y {\displaystyle Y} : Y ⊥ ⊥ S ∖ S 1 ∣ S 1 {\displaystyle Y\perp \!\!\!\perp {\mathcal {S}}\smallsetminus {\mathcal {S}}_{1}\mid {\mathcal {S}}_{1}} It means that S 1 {\displaystyle {\mathcal {S}}_{1}} contains at least all the information one needs to infer Y {\displaystyle Y} , where the variables in S ∖ S 1 {\displaystyle {\mathcal {S}}\smallsetminus {\mathcal {S}}_{1}} are redundant. In general, a given Markov blanket is not unique. Any set in S {\displaystyle {\mathcal {S}}} that contains a Markov blanket is also a Markov blanket itself. Specifically, S {\displaystyle {\mathcal {S}}} is a Markov blanket of Y {\displaystyle Y} in S {\displaystyle {\mathcal {S}}} . === Example === In a Bayesian network, the Markov blanket of a node consists of its parents, its children, and its children's other parents (i.e., co-parents). Knowing the values of these nodes makes the target node conditionally independent of the rest of the network. In a Markov random field, the Markov blanket of a node is simply its immediate neighbors. == Markov condition == The concept of a Markov blanket is rooted in the Markov condition, which states that in a probabilistic graphical model, each variable is conditionally independent of its non-descendants given its parents. This condition implies the existence of a minimal separating set — the Markov blanket — that shields a variable from the rest of the network. For instance, when a person holds an object stationary against gravity, the object’s acceleration is fully determined by its direct causes—namely, the upward force from the hand and the downward gravitational pull. Other variables such as air pressure or temperature are causally irrelevant. == Markov boundary == A Markov boundary of Y {\displaystyle Y} in S {\displaystyle {\mathcal {S}}} is a subset S 2 {\displaystyle {\mathcal {S}}_{2}} of S {\displaystyle {\mathcal {S}}} , such that S 2 {\displaystyle {\mathcal {S}}_{2}} itself is a Markov blanket of Y {\displaystyle Y} , but any proper subset of S 2 {\displaystyle {\mathcal {S}}_{2}} is not a Markov blanket of Y {\displaystyle Y} . In other words, a Markov boundary is a minimal Markov blanket. The Markov boundary of a node A {\displaystyle A} in a Bayesian network is the set of nodes composed of A {\displaystyle A} 's parents, A {\displaystyle A} 's children, and A {\displaystyle A} 's children's other parents. In a Markov random field, the Markov boundary for a node is the set of its neighboring nodes. In a dependency network, the Markov boundary for a node is the set of its parents. === Uniqueness of Markov boundary === The Markov boundary always exists. Under some mild conditions, the Markov boundary is unique. However, for most practical and theoretical scenarios multiple Markov boundaries may provide alternative solutions. When there are multiple Markov boundaries, quantities measuring causal effect could fail. == In cognitive science == In the study of consciousness, brain function, and complex adaptive systems, Markov blankets are proposed as a mathematical mechanism which delimits the extent of cognitive entities, whether it be physical or causal.

    Read more →
  • AIOps

    AIOps

    AIOps (Artificial Intelligence for IT Operations) refers to the use of artificial intelligence, machine learning, and big data analytics to automate and enhance data center management. It helps organizations manage complex IT environments by detecting, diagnosing, and resolving issues more efficiently than traditional methods. == History == AIOps was first defined by Gartner in 2016, combining "artificial intelligence" and "IT operations" to describe the application of AI and machine learning to enhance IT operations. This concept was introduced to address the increasing complexity and data volume in IT environments, aiming to automate processes such as event correlation, anomaly detection, and causality determination. == Definition == AIOps refers to multi-layered, complex technology platforms that enhance and automate IT operations by using machine learning and analytics to analyze the large amounts of data collected from various DevOps devices and tools, automatically identifying and responding to issues in real-time. AIOps represents a shift from isolated IT data to aggregated observational data (e.g., job logs and monitoring systems) and interaction data (such as ticketing, events, or incident records) within a big data platform. AIOps applies machine learning and analytics to this data, resulting in continuous visibility that, when combined with automation, can lead to ongoing improvements. AIOps connects three IT disciplines (automation, service management, and performance management) to achieve continuous visibility and improvement. This new approach in modern, accelerated, and hyper-scaled IT environments leverages advances in machine learning and big data to overcome previous limitations. == Components == AIOps includes, but is not limited to, the following processes and techniques: Anomaly Detection Log Analysis Root Cause Analysis Cohort Analysis Event Correlation Predictive Analytics Hardware Failure Prediction Automated Remediation Performance Prediction Incident Management Causality Determination Queue Management Resource Scheduling and Optimization Predictive Capacity Management Resource Allocation Service Quality Monitoring Deployment and Integration Testing System Configuration Auto-diagnosis and Problem Localization Efficient ML Training and Inferencing Using LLMs for Cloud Ops Auto Service Healing Data Center Management Customer Support Security and Privacy in Cloud Operations == Comparison with DevOps == AIOps is increasingly compared with DevOps in terms of impact on operational efficiency. While DevOps focuses on collaboration between development and operations teams to accelerate software delivery, AIOps integrates artificial intelligence to enhance monitoring, automation, and predictive capabilities. Various industry analyses have explored the similarities and differences between the two approaches, including discussions on how organizations can combine them to improve incident management and resource optimization. == Results == AI optimizes IT operations in five ways: First, intelligent monitoring powered by AI helps identify potential issues before they cause outages, improving metrics like Mean Time to Detect (MTTD) by 15-20%. Second, performance data analysis and insights enable quick decision-making by ingesting and analyzing large data sets in real time. Third, AI-driven automated infrastructure optimization efficiently allocates resources and thereby reducing cloud costs. Fourth, enhanced IT service management reduces critical incidents by over 50% through AI-driven end-to-end service management. Lastly, intelligent task automation accelerates problem resolution and automates remedial actions with minimal human intervention. In 2025, Atera Networks was identified as a leader in AIOps by the software review platform G2. == AIOps vs. MLOps == AIOps tools use big data analytics, machine learning algorithms, and predictive analytics to detect anomalies, correlate events, and provide proactive insights. This automation reduces the burden on IT teams, allowing them to focus on strategic tasks rather than routine operational issues. AIOps is widely used by IT operations teams, DevOps, network administrators, and IT service management (ITSM) teams to enhance visibility and enable quicker incident resolution in hybrid cloud environments, data centers, and other IT infrastructures. In contrast to MLOps (Machine Learning Operations), which focuses on the lifecycle management and operational aspects of machine learning models, AIOps focuses on optimizing IT operations using a variety of analytics and AI-driven techniques. While both disciplines rely on AI and data-driven methods, AIOps primarily targets IT operations, whereas MLOps is concerned with the deployment, monitoring, and maintenance of ML models. == Conferences == There are several conferences that are specific to AIOps: AIOps Summit AI Dev Summit IBM Think conference

    Read more →
  • Sliced inverse regression

    Sliced inverse regression

    Sliced inverse regression (SIR) is a tool for dimensionality reduction in the field of multivariate statistics. In statistics, regression analysis is a method of studying the relationship between a response variable y and its input variable x _ {\displaystyle {\underline {x}}} , which is a p-dimensional vector. There are several approaches in the category of regression. For example, parametric methods include multiple linear regression, and non-parametric methods include local smoothing. As the number of observations needed to use local smoothing methods scales exponentially with high-dimensional data (as p grows), reducing the number of dimensions can make the operation computable. Dimensionality reduction aims to achieve this by showing only the most important dimension of the data. SIR uses the inverse regression curve, E ( x _ | y ) {\displaystyle E({\underline {x}}\,|\,y)} , to perform a weighted principal component analysis. == Model == Given a response variable Y {\displaystyle \,Y} and a (random) vector X ∈ R p {\displaystyle X\in \mathbb {R} ^{p}} of explanatory variables, SIR is based on the model Y = f ( β 1 ⊤ X , … , β k ⊤ X , ε ) ( 1 ) {\displaystyle Y=f(\beta _{1}^{\top }X,\ldots ,\beta _{k}^{\top }X,\varepsilon )\quad \quad \quad \quad \quad (1)} where β 1 , … , β k {\displaystyle \beta _{1},\ldots ,\beta _{k}} are unknown projection vectors, k {\displaystyle \,k} is an unknown number smaller than p {\displaystyle \,p} , f {\displaystyle \;f} is an unknown function on R k + 1 {\displaystyle \mathbb {R} ^{k+1}} as it only depends on k {\displaystyle \,k} arguments, and ε {\displaystyle \varepsilon } is a random variable representing error with E [ ε | X ] = 0 {\displaystyle E[\varepsilon |X]=0} and a finite variance of σ 2 {\displaystyle \sigma ^{2}} . The model describes an ideal solution, where Y {\displaystyle \,Y} depends on X ∈ R p {\displaystyle X\in \mathbb {R} ^{p}} only through a k {\displaystyle \,k} dimensional subspace; i.e., one can reduce the dimension of the explanatory variables from p {\displaystyle \,p} to a smaller number k {\displaystyle \,k} without losing any information. An equivalent version of ( 1 ) {\displaystyle \,(1)} is: the conditional distribution of Y {\displaystyle \,Y} given X {\displaystyle \,X} depends on X {\displaystyle \,X} only through the k {\displaystyle \,k} dimensional random vector ( β 1 ⊤ X , … , β k ⊤ X ) {\displaystyle (\beta _{1}^{\top }X,\ldots ,\beta _{k}^{\top }X)} . It is assumed that this reduced vector is as informative as the original X {\displaystyle \,X} in explaining Y {\displaystyle \,Y} . The unknown β i ′ s {\displaystyle \,\beta _{i}'s} are called the effective dimension reducing directions (EDR-directions). The space that is spanned by these vectors is denoted by the effective dimension reducing space (EDR-space). == Relevant linear algebra background == Given a _ 1 , … , a _ r ∈ R n {\displaystyle {\underline {a}}_{1},\ldots ,{\underline {a}}_{r}\in \mathbb {R} ^{n}} , then V := L ( a _ 1 , … , a _ r ) {\displaystyle V:=L({\underline {a}}_{1},\ldots ,{\underline {a}}_{r})} , the set of all linear combinations of these vectors is called a linear subspace and is therefore a vector space. The equation says that vectors a _ 1 , … , a _ r {\displaystyle {\underline {a}}_{1},\ldots ,{\underline {a}}_{r}} span V {\displaystyle \,V} , but the vectors that span space V {\displaystyle \,V} are not unique. The dimension of V ( ∈ R n ) {\displaystyle \,V(\in \mathbb {R} ^{n})} is equal to the maximum number of linearly independent vectors in V {\displaystyle \,V} . A set of n {\displaystyle \,n} linear independent vectors of R n {\displaystyle \mathbb {R} ^{n}} makes up a basis of R n {\displaystyle \mathbb {R} ^{n}} . The dimension of a vector space is unique, but the basis itself is not. Several bases can span the same space. Dependent vectors can still span a space, but the linear combinations of the latter are only suitable to a set of vectors lying on a straight line. == Inverse regression == Computing the inverse regression curve (IR) means instead of looking for E [ Y | X = x ] {\displaystyle \,E[Y|X=x]} , which is a curve in R p {\displaystyle \mathbb {R} ^{p}} it is actually E [ X | Y = y ] {\displaystyle \,E[X|Y=y]} , which is also a curve in R p {\displaystyle \mathbb {R} ^{p}} , but consisting of p {\displaystyle \,p} one-dimensional regressions. The center of the inverse regression curve is located at E [ E [ X | Y ] ] = E [ X ] {\displaystyle \,E[E[X|Y]]=E[X]} . Therefore, the centered inverse regression curve is E [ X | Y = y ] − E [ X ] {\displaystyle \,E[X|Y=y]-E[X]} which is a p {\displaystyle \,p} dimensional curve in R p {\displaystyle \mathbb {R} ^{p}} . == Inverse regression versus dimension reduction == The centered inverse regression curve lies on a k {\displaystyle \,k} -dimensional subspace spanned by Σ x x β i ′ s {\displaystyle \,\Sigma _{xx}\beta _{i}\,'s} . This is a connection between the model and inverse regression. Given this condition and ( 1 ) {\displaystyle \,(1)} , the centered inverse regression curve E [ X | Y = y ] − E [ X ] {\displaystyle \,E[X|Y=y]-E[X]} is contained in the linear subspace spanned by Σ x x β k ( k = 1 , … , K ) {\displaystyle \,\Sigma _{xx}\beta _{k}(k=1,\ldots ,K)} , where Σ x x = C o v ( X ) {\displaystyle \,\Sigma _{xx}=Cov(X)} . == Estimation of the EDR-directions == After having had a look at all the theoretical properties, the aim now is to estimate the EDR-directions. For that purpose, weighted principal component analyses are needed. If the sample means m ^ h ′ s {\displaystyle \,{\hat {m}}_{h}\,'s} , X {\displaystyle \,X} would have been standardized to Z = Σ x x − 1 / 2 { X − E ( X ) } {\displaystyle \,Z=\Sigma _{xx}^{-1/2}\{X-E(X)\}} . Corresponding to the theorem above, the IR-curve m 1 ( y ) = E [ Z | Y = y ] {\displaystyle \,m_{1}(y)=E[Z|Y=y]} lies in the space spanned by ( η 1 , … , η k ) {\displaystyle \,(\eta _{1},\ldots ,\eta _{k})} , where η i = Σ x x 1 / 2 β i {\displaystyle \,\eta _{i}=\Sigma _{xx}^{1/2}\beta _{i}} . As a consequence, the covariance matrix c o v [ E [ Z | Y ] ] {\displaystyle \,cov[E[Z|Y]]} is degenerate in any direction orthogonal to the η i ′ s {\displaystyle \,\eta _{i}\,'s} . Therefore, the eigenvectors η k ( k = 1 , … , K ) {\displaystyle \,\eta _{k}(k=1,\ldots ,K)} associated with the largest K {\displaystyle \,K} eigenvalues are the standardized EDR-directions. == Algorithm == === SIR algorithm === The algorithm from Li, K-C. (1991) to estimate the EDR-directions via SIR is as follows. 1. Let Σ x x {\displaystyle \,\Sigma _{xx}} be the covariance matrix of X {\displaystyle \,X} . Standardize X {\displaystyle \,X} to Z = Σ x x − 1 / 2 { X − E ( X ) } {\displaystyle \,Z=\Sigma _{xx}^{-1/2}\{X-E(X)\}} ( 1 ) {\displaystyle \,(1)} can also be rewritten as Y = f ( η 1 ⊤ Z , … , η k ⊤ Z , ε ) {\displaystyle Y=f(\eta _{1}^{\top }Z,\ldots ,\eta _{k}^{\top }Z,\varepsilon )} where η k = β k Σ x x 1 / 2 ∀ k {\displaystyle \,\eta _{k}=\beta _{k}\Sigma _{xx}^{1/2}\quad \forall \;k} .) 2. Divide the range of y i {\displaystyle \,y_{i}} into S {\displaystyle \,S} non-overlapping slices H s ( s = 1 , … , S ) . n s {\displaystyle \,H_{s}(s=1,\ldots ,S).\;n_{s}} is the number of observations within each slice and I H s {\displaystyle \,I_{H_{s}}} is the indicator function for the slice: n s = ∑ i = 1 n I H s ( y i ) {\displaystyle n_{s}=\sum _{i=1}^{n}I_{H_{s}}(y_{i})} 3. Compute the mean of z i {\displaystyle \,z_{i}} over all slices, which is a crude estimate m ^ 1 {\displaystyle \,{\hat {m}}_{1}} of the inverse regression curve m 1 {\displaystyle \,m_{1}} : z ¯ s = n s − 1 ∑ i = 1 n z i I H s ( y i ) {\displaystyle \,{\bar {z}}_{s}=n_{s}^{-1}\sum _{i=1}^{n}z_{i}I_{H_{s}}(y_{i})} 4. Calculate the estimate for C o v { m 1 ( y ) } {\displaystyle \,Cov\{m_{1}(y)\}} : V ^ = n − 1 ∑ i = 1 S n s z ¯ s z ¯ s ⊤ {\displaystyle \,{\hat {V}}=n^{-1}\sum _{i=1}^{S}n_{s}{\bar {z}}_{s}{\bar {z}}_{s}^{\top }} 5. Identify the eigenvalues λ ^ i {\displaystyle \,{\hat {\lambda }}_{i}} and the eigenvectors η ^ i {\displaystyle \,{\hat {\eta }}_{i}} of V ^ {\displaystyle \,{\hat {V}}} , which are the standardized EDR-directions. 6. Transform the standardized EDR-directions back to the original scale. The estimates for the EDR-directions are given by: β ^ i = Σ ^ x x − 1 / 2 η ^ i {\displaystyle \,{\hat {\beta }}_{i}={\hat {\Sigma }}_{xx}^{-1/2}{\hat {\eta }}_{i}} (which are not necessarily orthogonal.)

    Read more →
  • Random neural network

    Random neural network

    The Random Neural Network (RNN) is a mathematical representation of an interconnected network of neurons or cells which exchange spiking signals. It was invented by Erol Gelenbe and is linked to the G-network model of queueing networks which Erol Gelenbe also invented, and with his Gene Regulatory Network models. In this model, each neuronal cell state is represented by an integer whose value rises when the cell receives an excitatory spike and drops when it receives an inhibitory spike. The spikes can originate outside the network itself, or they can come from other cells in the networks. Cells whose internal excitatory state has a positive value are allowed to send out spikes of either kind to other cells in the network according to specific cell-dependent spiking rates. The model has a mathematical solution in steady-state which provides the joint probability distribution of the network in terms of the individual probabilities that each cell is excited and able to send out spikes. Computing this solution is based on solving a set of non-linear algebraic equations whose parameters are related to the spiking rates of individual cells and their connectivity to other cells, as well as the arrival rates of spikes from outside the network. The RNN is a recurrent model, i.e. a neural network that is allowed to have complex feedback loops. A highly energy-efficient implementation of random neural networks was demonstrated by Krishna Palem et al. using the Probabilistic CMOS or PCMOS technology and was shown to be c. 226–300 times more efficient in terms of Energy-Performance-Product. RNNs are also related to artificial neural networks, which (like the random neural network) have gradient-based learning algorithms. The learning algorithm for an n-node random neural network that includes feedback loops (it is also a recurrent neural network) is of computational complexity O(n^3) (the number of computations is proportional to the cube of n, the number of neurons). The random neural network can also be used with other learning algorithms such as reinforcement learning. The RNN has been shown to be a universal approximator for bounded and continuous functions.

    Read more →
  • Policy gradient method

    Policy gradient method

    Policy gradient methods are a class of reinforcement learning algorithms and a sub-class of policy optimization methods. Unlike value-based methods which learn a value function to derive a policy, policy optimization methods directly learn a policy function π {\displaystyle \pi } that selects actions without consulting a value function. For policy gradient to apply, the policy function π θ {\displaystyle \pi _{\theta }} is parameterized by a differentiable parameter θ {\displaystyle \theta } . == Overview == In policy-based RL, the actor is a parameterized policy function π θ {\displaystyle \pi _{\theta }} , where θ {\displaystyle \theta } are the parameters of the actor. The actor takes as argument the state of the environment s {\displaystyle s} and produces a probability distribution π θ ( ⋅ ∣ s ) {\displaystyle \pi _{\theta }(\cdot \mid s)} . If the action space is discrete, then ∑ a π θ ( a ∣ s ) = 1 {\displaystyle \sum _{a}\pi _{\theta }(a\mid s)=1} . If the action space is continuous, then ∫ a π θ ( a ∣ s ) d a = 1 {\displaystyle \int _{a}\pi _{\theta }(a\mid s)\mathrm {d} a=1} . The goal of policy optimization is to find some θ {\displaystyle \theta } that maximizes the expected episodic reward J ( θ ) {\displaystyle J(\theta )} : J ( θ ) = E π θ [ ∑ t = 0 T γ t R t | S 0 = s 0 ] {\displaystyle J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\gamma ^{t}R_{t}{\Big |}S_{0}=s_{0}\right]} where γ {\displaystyle \gamma } is the discount factor, R t {\displaystyle R_{t}} is the reward at step t {\displaystyle t} , s 0 {\displaystyle s_{0}} is the starting state, and T {\displaystyle T} is the time-horizon (which can be infinite). The policy gradient is defined as ∇ θ J ( θ ) {\displaystyle \nabla _{\theta }J(\theta )} . Different policy gradient methods stochastically estimate the policy gradient in different ways. The goal of any policy gradient method is to iteratively maximize J ( θ ) {\displaystyle J(\theta )} by gradient ascent. Since the key part of any policy gradient method is the stochastic estimation of the policy gradient, they are also studied under the title of "Monte Carlo gradient estimation". == REINFORCE == === Policy gradient === The REINFORCE algorithm, introduced by Ronald J. Williams in 1992, was the first policy gradient method. It is based on the identity for the policy gradient ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t ∣ S t ) ∑ t = 0 T ( γ t R t ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}\mid S_{t})\;\sum _{t=0}^{T}(\gamma ^{t}R_{t}){\Big |}S_{0}=s_{0}\right]} which can be improved via the "causality trick" ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t ∣ S t ) ∑ τ = t T ( γ τ R τ ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}\mid S_{t})\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau }){\Big |}S_{0}=s_{0}\right]} Thus, we have an unbiased estimator of the policy gradient: ∇ θ J ( θ ) ≈ 1 N ∑ n = 1 N [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t , n ∣ S t , n ) ∑ τ = t T ( γ τ − t R τ , n ) ] {\displaystyle \nabla _{\theta }J(\theta )\approx {\frac {1}{N}}\sum _{n=1}^{N}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t,n}\mid S_{t,n})\sum _{\tau =t}^{T}(\gamma ^{\tau -t}R_{\tau ,n})\right]} where the index n {\displaystyle n} ranges over N {\displaystyle N} rollout trajectories using the policy π θ {\displaystyle \pi _{\theta }} . The score function ∇ θ ln ⁡ π θ ( A t ∣ S t ) {\displaystyle \nabla _{\theta }\ln \pi _{\theta }(A_{t}\mid S_{t})} can be interpreted as the direction in the parameter space that increases the probability of taking action A t {\displaystyle A_{t}} in state S t {\displaystyle S_{t}} . The policy gradient, then, is a weighted average of all possible directions to increase the probability of taking any action in any state, but weighted by reward signals, so that if taking a certain action in a certain state is associated with high reward, then that direction would be highly reinforced, and vice versa. === Algorithm === The REINFORCE algorithm is a loop: Rollout N {\displaystyle N} trajectories in the environment, using π θ t {\displaystyle \pi _{\theta _{t}}} as the policy function. Compute the policy gradient estimation: g i ← 1 N ∑ n = 1 N [ ∑ t = 0 T ∇ θ t ln ⁡ π θ ( A t , n ∣ S t , n ) ∑ τ = t T ( γ τ R τ , n ) ] {\displaystyle g_{i}\leftarrow {\frac {1}{N}}\sum _{n=1}^{N}\left[\sum _{t=0}^{T}\nabla _{\theta _{t}}\ln \pi _{\theta }(A_{t,n}\mid S_{t,n})\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau ,n})\right]} Update the policy by gradient ascent: θ i + 1 ← θ i + α i g i {\displaystyle \theta _{i+1}\leftarrow \theta _{i}+\alpha _{i}g_{i}} Here, α i {\displaystyle \alpha _{i}} is the learning rate at update step i {\displaystyle i} . == Variance reduction == REINFORCE is an on-policy algorithm, meaning that the trajectories used for the update must be sampled from the current policy π θ {\displaystyle \pi _{\theta }} . This can lead to high variance in the updates, as the returns R ( τ ) {\displaystyle R(\tau )} can vary significantly between trajectories. Many variants of REINFORCE have been introduced, under the title of variance reduction. === REINFORCE with baseline === A common way for reducing variance is the REINFORCE with baseline algorithm, based on the following identity: ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t | S t ) ( ∑ τ = t T ( γ τ R τ ) − b ( S t ) ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\left(\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau })-b(S_{t})\right){\Big |}S_{0}=s_{0}\right]} for any function b : States → R {\displaystyle b:{\text{States}}\to \mathbb {R} } . This can be proven by applying the previous lemma. The algorithm uses the modified gradient estimator g i ← 1 N ∑ n = 1 N [ ∑ t = 0 T ∇ θ t ln ⁡ π θ ( A t , n | S t , n ) ( ∑ τ = t T ( γ τ R τ , n ) − b i ( S t , n ) ) ] {\displaystyle g_{i}\leftarrow {\frac {1}{N}}\sum _{n=1}^{N}\left[\sum _{t=0}^{T}\nabla _{\theta _{t}}\ln \pi _{\theta }(A_{t,n}|S_{t,n})\left(\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau ,n})-b_{i}(S_{t,n})\right)\right]} and the original REINFORCE algorithm is the special case where b i ≡ 0 {\displaystyle b_{i}\equiv 0} . === Actor-critic methods === If b i {\textstyle b_{i}} is chosen well, such that b i ( S t ) ≈ ∑ τ = t T ( γ τ R τ ) = γ t V π θ i ( S t ) {\textstyle b_{i}(S_{t})\approx \sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau })=\gamma ^{t}V^{\pi _{\theta _{i}}}(S_{t})} , this could significantly decrease variance in the gradient estimation. That is, the baseline should be as close to the value function V π θ i ( S t ) {\displaystyle V^{\pi _{\theta _{i}}}(S_{t})} as possible, approaching the ideal of: ∇ θ J ( θ ) = E π θ [ ∑ t = 0 T ∇ θ ln ⁡ π θ ( A t | S t ) ( ∑ τ = t T ( γ τ R τ ) − γ t V π θ ( S t ) ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=\mathbb {E} _{\pi _{\theta }}\left[\sum _{t=0}^{T}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\left(\sum _{\tau =t}^{T}(\gamma ^{\tau }R_{\tau })-\gamma ^{t}V^{\pi _{\theta }}(S_{t})\right){\Big |}S_{0}=s_{0}\right]} Note that, as the policy π θ t {\displaystyle \pi _{\theta _{t}}} updates, the value function V π θ i ( S t ) {\displaystyle V^{\pi _{\theta _{i}}}(S_{t})} updates as well, so the baseline should also be updated. One common approach is to train a separate function that estimates the value function, and use that as the baseline. This is one of the actor-critic methods, where the policy function is the actor and the value function is the critic. The Q-function Q π {\displaystyle Q^{\pi }} can also be used as the critic, since ∇ θ J ( θ ) = E π θ [ ∑ 0 ≤ t ≤ T γ t ∇ θ ln ⁡ π θ ( A t | S t ) ⋅ Q π θ ( S t , A t ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left[\sum _{0\leq t\leq T}\gamma ^{t}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\cdot Q^{\pi _{\theta }}(S_{t},A_{t}){\Big |}S_{0}=s_{0}\right]} by a similar argument using the tower law. Subtracting the value function as a baseline, we find that the advantage function A π ( S , A ) = Q π ( S , A ) − V π ( S ) {\displaystyle A^{\pi }(S,A)=Q^{\pi }(S,A)-V^{\pi }(S)} can be used as the critic as well: ∇ θ J ( θ ) = E π θ [ ∑ 0 ≤ t ≤ T γ t ∇ θ ln ⁡ π θ ( A t | S t ) ⋅ A π θ ( S t , A t ) | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left[\sum _{0\leq t\leq T}\gamma ^{t}\nabla _{\theta }\ln \pi _{\theta }(A_{t}|S_{t})\cdot A^{\pi _{\theta }}(S_{t},A_{t}){\Big |}S_{0}=s_{0}\right]} In summary, there are many unbiased estimators for ∇ θ J θ {\textstyle \nabla _{\theta }J_{\theta }} , all in the form of: ∇ θ J ( θ ) = E π θ [ ∑ 0 ≤ t ≤ T ∇ θ ln ⁡ π θ ( A t | S t ) ⋅ Ψ t | S 0 = s 0 ] {\displaystyle \nabla _{\theta }J(\theta )=E_{\pi _{\theta }}\left[\su

    Read more →
  • CinePlayer

    CinePlayer

    CinePlayer is a software based media player used to review Digital Cinema Packages (DCP) without the need for a digital cinema server by Doremi Labs. CinePlayer can play back any DCP, not just those created by Doremi Mastering products. In addition to playing DCPs, CinePlayer can also playback JPEG2000 image sequences and many popular multimedia file types. There are two versions of CinePlayer available, standard and Pro. The standard version supports playback of non-encrypted, 2D DCP's up to 2K resolution. The Pro version supports playback of encrypted, 2D or 3D DCP's with subtitles up to 4K resolution. == Supported formats == === Containers === AVI MOV MXF MPG TS WMV M2TS MTS MP4 MKV === Video codecs === JPEG2000 ProRes 422 DNxHD YUV Uncompressed 8-10 bits DIVX XVID MPEG4 AVC / H-264 VC-1 MPEG2 === Supported image sequences === BMP TIFF TGA DPX JPG J2C === Supported audio files === WAV MP3 WMA MP2

    Read more →
  • Semantic mapping (statistics)

    Semantic mapping (statistics)

    Semantic mapping (SM) is a statistical method for dimensionality reduction (the transformation of data from a high-dimensional space into a low-dimensional space). SM can be used in a set of multidimensional vectors of features to extract a few new features that preserves the main data characteristics. SM performs dimensionality reduction by clustering the original features in semantic clusters and combining features mapped in the same cluster to generate an extracted feature. Given a data set, this method constructs a projection matrix that can be used to map a data element from a high-dimensional space into a reduced dimensional space. SM can be applied in construction of text mining and information retrieval systems, as well as systems managing vectors of high dimensionality. SM is an alternative to random mapping, principal components analysis and latent semantic indexing methods.

    Read more →
  • Frequent pattern discovery

    Frequent pattern discovery

    Frequent pattern discovery (or FP discovery, FP mining, or Frequent itemset mining) is part of knowledge discovery in databases, Massive Online Analysis, and data mining; it describes the task of finding the most frequent and relevant patterns in large datasets. The concept was first introduced for mining transaction databases. Frequent patterns are defined as subsets (itemsets, subsequences, or substructures) that appear in a data set with frequency no less than a user-specified or auto-determined threshold. == Techniques == Techniques for FP mining include: market basket analysis cross-marketing catalog design clustering classification recommendation systems For the most part, FP discovery can be done using association rule learning with particular algorithms Eclat, FP-growth and the Apriori algorithm. Other strategies include: Frequent subtree mining Structure mining Sequential pattern mining and respective specific techniques. Implementations exist for various machine learning systems or modules like MLlib for Apache Spark.

    Read more →
  • Elastic net regularization

    Elastic net regularization

    In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods. Nevertheless, elastic net regularization is typically more accurate than both methods with regard to reconstruction. == Specification == The elastic net method overcomes the limitations of the LASSO (least absolute shrinkage and selection operator) method which uses a penalty function based on ‖ β ‖ 1 = ∑ j = 1 p | β j | . {\displaystyle \|\beta \|_{1}=\textstyle \sum _{j=1}^{p}|\beta _{j}|.} Use of this penalty function has several limitations. For example, in the "large p, small n" case (high-dimensional data with few examples), the LASSO selects at most n variables before it saturates. Also if there is a group of highly correlated variables, then the LASSO tends to select one variable from a group and ignore the others. To overcome these limitations, the elastic net adds a quadratic part ( ‖ β ‖ 2 {\displaystyle \|\beta \|^{2}} ) to the penalty, which when used alone is ridge regression (known also as Tikhonov regularization). The estimates from the elastic net method are defined by β ^ ≡ argmin β ( ‖ y − X β ‖ 2 + λ 2 ‖ β ‖ 2 + λ 1 ‖ β ‖ 1 ) . {\displaystyle {\hat {\beta }}\equiv {\underset {\beta }{\operatorname {argmin} }}(\|y-X\beta \|^{2}+\lambda _{2}\|\beta \|^{2}+\lambda _{1}\|\beta \|_{1}).} The quadratic penalty term makes the loss function strongly convex, and it therefore has a unique minimum. The elastic net method includes the LASSO and ridge regression: in other words, each of them is a special case where λ 1 = λ , λ 2 = 0 {\displaystyle \lambda _{1}=\lambda ,\lambda _{2}=0} or λ 1 = 0 , λ 2 = λ {\displaystyle \lambda _{1}=0,\lambda _{2}=\lambda } . Meanwhile, the naive version of elastic net method finds an estimator in a two-stage procedure : first for each fixed λ 2 {\displaystyle \lambda _{2}} it finds the ridge regression coefficients, and then does a LASSO type shrinkage. This kind of estimation incurs a double amount of shrinkage, which leads to increased bias and poor predictions. To improve the prediction performance, sometimes the coefficients of the naive version of elastic net is rescaled by multiplying the estimated coefficients by ( 1 + λ 2 ) {\displaystyle (1+\lambda _{2})} . Examples of where the elastic net method has been applied are: Support vector machine Metric learning Portfolio optimization Cancer prognosis == Reduction to support vector machine == It was proven in 2014 that the elastic net can be reduced to the linear support vector machine. A similar reduction was previously proven for the LASSO in 2014. The authors showed that for every instance of the elastic net, an artificial binary classification problem can be constructed such that the hyper-plane solution of a linear support vector machine (SVM) is identical to the solution β {\displaystyle \beta } (after re-scaling). The reduction immediately enables the use of highly optimized SVM solvers for elastic net problems. It also enables the use of GPU acceleration, which is often already used for large-scale SVM solvers. The reduction is a simple transformation of the original data and regularization constants X ∈ R n × p , y ∈ R n , λ 1 ≥ 0 , λ 2 ≥ 0 {\displaystyle X\in {\mathbb {R} }^{n\times p},y\in {\mathbb {R} }^{n},\lambda _{1}\geq 0,\lambda _{2}\geq 0} into new artificial data instances and a regularization constant that specify a binary classification problem and the SVM regularization constant X 2 ∈ R 2 p × n , y 2 ∈ { − 1 , 1 } 2 p , C ≥ 0. {\displaystyle X_{2}\in {\mathbb {R} }^{2p\times n},y_{2}\in \{-1,1\}^{2p},C\geq 0.} Here, y 2 {\displaystyle y_{2}} consists of binary labels − 1 , 1 {\displaystyle {-1,1}} . When 2 p > n {\displaystyle 2p>n} it is typically faster to solve the linear SVM in the primal, whereas otherwise the dual formulation is faster. Some authors have referred to the transformation as Support Vector Elastic Net (SVEN), and provided the following MATLAB pseudo-code: == Software == "Glmnet: Lasso and elastic-net regularized generalized linear models" is a software which is implemented as an R source package and as a MATLAB toolbox. This includes fast algorithms for estimation of generalized linear models with ℓ1 (the lasso), ℓ2 (ridge regression) and mixtures of the two penalties (the elastic net) using cyclical coordinate descent, computed along a regularization path. JMP Pro 11 includes elastic net regularization, using the Generalized Regression personality with Fit Model. "pensim: Simulation of high-dimensional data and parallelized repeated penalized regression" implements an alternate, parallelised "2D" tuning method of the ℓ parameters, a method claimed to result in improved prediction accuracy. scikit-learn includes linear regression and logistic regression with elastic net regularization. SVEN, a Matlab implementation of Support Vector Elastic Net. This solver reduces the Elastic Net problem to an instance of SVM binary classification and uses a Matlab SVM solver to find the solution. Because SVM is easily parallelizable, the code can be faster than Glmnet on modern hardware. SpaSM, a Matlab implementation of sparse regression, classification and principal component analysis, including elastic net regularized regression. Apache Spark provides support for Elastic Net Regression in its MLlib machine learning library. The method is available as a parameter of the more general LinearRegression class. SAS (software) The SAS procedure Glmselect and SAS Viya procedure Regselect support the use of elastic net regularization for model selection.

    Read more →
  • Availability zone

    Availability zone

    In cloud computing, an availability region is a group of data centres that are located in the same geographical region. Availability regions comprise multiple availability zones, which are groups of data centres that are located far enough from each other to prevent large-scale outages in the event of failure of a single zone, whilst still being close enough to each other to enable low-latency connections. Distributed systems spanning multiple availability zones allow for high availability, even in the event of catastrophic failure, such as natural disasters. Services offering distinct availability zones include Amazon Web Services, Microsoft Azure and Google Cloud.

    Read more →
  • Induction of regular languages

    Induction of regular languages

    In computational learning theory, induction of regular languages refers to the task of learning a formal description (e.g. grammar) of a regular language from a given set of example strings. Although E. Mark Gold has shown that not every regular language can be learned this way (see language identification in the limit), approaches have been investigated for a variety of subclasses. They are sketched in this article. For learning of more general grammars, see Grammar induction. == Definitions == A regular language is defined as a (finite or infinite) set of strings that can be described by one of the mathematical formalisms called "finite automaton", "regular grammar", or "regular expression", all of which have the same expressive power. Since the latter formalism leads to shortest notations, it shall be introduced and used here. Given a set Σ of symbols (a.k.a. alphabet), a regular expression can be any of ∅ (denoting the empty set of strings), ε (denoting the singleton set containing just the empty string), a (where a is any character in Σ; denoting the singleton set just containing the single-character string a), r + s (where r and s are, in turn, simpler regular expressions; denoting their set's union) r ⋅ s (denoting the set of all possible concatenations of strings from r's and s's set), r + (denoting the set of n-fold repetitions of strings from r's set, for any n ≥ 1), or r (similarly denoting the set of n-fold repetitions, but also including the empty string, seen as 0-fold repetition). For example, using Σ = {0,1}, the regular expression (0+1+ε)⋅(0+1) denotes the set of all binary numbers with one or two digits (leading zero allowed), while 1⋅(0+1)⋅0 denotes the (infinite) set of all even binary numbers (no leading zeroes). Given a set of strings (also called "positive examples"), the task of regular language induction is to come up with a regular expression that denotes a set containing all of them. As an example, given {1, 10, 100}, a "natural" description could be the regular expression 1⋅0, corresponding to the informal characterization "a 1 followed by arbitrarily many (maybe even none) 0's". However, (0+1) and 1+(1⋅0)+(1⋅0⋅0) is another regular expression, denoting the largest (assuming Σ = {0,1}) and the smallest set containing the given strings, and called the trivial overgeneralization and undergeneralization, respectively. Some approaches work in an extended setting where also a set of "negative example" strings is given; then, a regular expression is to be found that generates all of the positive, but none of the negative examples. == Lattice of automata == Dupont et al. have shown that the set of all structurally complete finite automata generating a given input set of example strings forms a lattice, with the trivial undergeneralized and the trivial overgeneralized automaton as bottom and top element, respectively. Each member of this lattice can be obtained by factoring the undergeneralized automaton by an appropriate equivalence relation. For the above example string set {1, 10, 100}, the picture shows at its bottom the undergeneralized automaton Aa,b,c,d in grey, consisting of states a, b, c, and d. On the state set {a,b,c,d}, a total of 15 equivalence relations exist, forming a lattice. Mapping each equivalence E to the corresponding quotient automaton language L(Aa,b,c,d / E) obtains the partially ordered set shown in the picture. Each node's language is denoted by a regular expression. The language may be recognized by quotient automata w.r.t. different equivalence relations, all of which are shown below the node. An arrow between two nodes indicates that the lower node's language is a proper subset of the higher node's. If both positive and negative example strings are given, Dupont et al. build the lattice from the positive examples, and then investigate the separation border between automata that generate some negative example and such that do not. Most interesting are those automata immediately below the border. In the picture, separation borders are shown for the negative example strings 11 (green), 1001 (blue), 101 (cyan), and 0 (red). Coste and Nicolas present an own search method within the lattice, which they relate to Mitchell's version space paradigm. To find the separation border, they use a graph coloring algorithm on the state inequality relation induced by the negative examples. Later, they investigate several ordering relations on the set of all possible state fusions. Kudo and Shimbo use the representation by automaton factorizations to give a unique framework for the following approaches (sketched below): k-reversible languages and the "tail clustering" follow-up approach, Successor automata and the predecessor-successor method, and pumping-based approaches (framework-integration challenged by Luzeaux, however). Each of these approaches is shown to correspond to a particular kind of equivalence relations used for factorization. == Approaches == === k-reversible languages === Angluin considers so-called "k-reversible" regular automata, that is, deterministic automata in which each state can be reached from at most one state by following a transition chain of length k. Formally, if Σ, Q, and δ denote the input alphabet, the state set, and the transition function of an automaton A, respectively, then A is called k-reversible if: ∀a0, ..., ak ∈ Σ ∀s1, s2 ∈ Q: δ(s1, a0...ak) = δ(s2, a0...ak) ⇒ s1 = s2, where δ means the homomorphic extension of δ to arbitrary words. Angluin gives a cubic algorithm for learning of the smallest k-reversible language from a given set of input words; for k = 0, the algorithm has even almost linear complexity. The required state uniqueness after k + 1 given symbols forces unifying automaton states, thus leading to a proper generalization different from the trivial undergeneralized automaton. This algorithm has been used to learn simple parts of English syntax; later, an incremental version has been provided. Another approach based on k-reversible automata is the tail clustering method. === Successor automata === From a given set of input strings, Vernadat and Richetin build a so-called successor automaton, consisting of one state for each distinct character and a transition between each two adjacent characters' states. For example, the singleton input set {aabbaabb} leads to an automaton corresponding to the regular expression (a+⋅b+). An extension of this approach is the predecessor-successor method which generalizes each character repetition immediately to a Kleene + and then includes for each character the set of its possible predecessors in its state. Successor automata can learn exactly the class of local languages. Since each regular language is the homomorphic image of a local language, grammars from the former class can be learned by lifting, if an appropriate (depending on the intended application) homomorphism is provided. In particular, there is such a homomorphism for the class of languages learnable by the predecessor-successor method. The learnability of local languages can be reduced to that of k-reversible languages. === Early approaches === Chomsky and Miller (1957) used the pumping lemma: they guess a part v of an input string uvw and try to build a corresponding cycle into the automaton to be learned; using membership queries they ask, for appropriate k, which of the strings uw, uvvw, uvvvw, ..., uvkw also belongs to the language to be learned, thereby refining the structure of their automaton. In 1959, Solomonoff generalized this approach to context-free languages, which also obey a pumping lemma. === Cover automata === Câmpeanu et al. learn a finite automaton as a compact representation of a large finite language. Given such a language F, they search a so-called cover automaton A such that its language L(A) covers F in the following sense: L(A) ∩ Σ≤ l = F, where l is the length of the longest string in F, and Σ≤ l denotes the set of all strings not longer than l. If such a cover automaton exists, F is uniquely determined by A and l. For example, F = {ad, read, reread } has l = 6 and a cover automaton corresponding to the regular expression (r⋅e)⋅a⋅d. For two strings x and y, Câmpeanu et al. define x ~ y if xz ∈ F ⇔ yz ∈ F for all strings z of a length such that both xz and yz are not longer than l. Based on this relation, whose lack of transitivity causes considerable technical problems, they give an O(n4) algorithm to construct from F a cover automaton A of minimal state count. Moreover, for union, intersection, and difference of two finite languages they provide corresponding operations on their cover automata. Păun et al. improve the time complexity to O(n2). === Residual automata === For a set S of strings and a string u, the Brzozowski derivative u−1S is defined as the set of all rest-strings obtainable from a string in S by cutting off its prefix u (if possible), formally: u−1S = {v ∈ Σ: uv ∈ S}, cf. picture. Denis et al. define a

    Read more →
  • Memtransistor

    Memtransistor

    The memtransistor (a blend word from Memory Transfer Resistor) is an experimental multi-terminal passive electronic component that might be used in the construction of artificial neural networks. It is a combination of the memristor and transistor technology. This technology is different from the 1T-1R approach since the devices are merged into one single entity. Multiple memristors can be embedded with a single transistor, enabling it to more accurately model a neuron with its multiple synaptic connections. A neural network produced from these would provide hardware-based artificial intelligence with a good foundation. == Applications == These types of devices would allow for a synapse model that could realise a learning rule, by which the synaptic efficacy is altered by voltages applied to the terminals of the device. An example of such a learning rule is spike-timing-dependant-plasticty by which the weight of the synapse, in this case the conductivity, could be modulated based on the timing of pre and post synaptic spikes arriving at each terminal. The advantage of this approach over two terminal memristive devices is that read and write protocols have the possibility to occur simultaneously and distinctly.

    Read more →