Statistical learning theory

Statistical learning theory

Statistical learning theory is a framework for machine learning drawing from the fields of statistics and functional analysis. Statistical learning theory deals with the statistical inference problem of finding a predictive function based on data. Statistical learning theory has led to successful applications in fields such as computer vision, speech recognition, and bioinformatics. == Introduction == The goals of learning are understanding and prediction. Learning falls into many categories, including supervised learning, unsupervised learning, online learning, and reinforcement learning. From the perspective of statistical learning theory, supervised learning is best understood. Supervised learning involves learning from a training set of data. Every point in the training is an input–output pair, where the input maps to an output. The learning problem consists of inferring the function that maps between the input and the output, such that the learned function can be used to predict the output from future input. Depending on the type of output, supervised learning problems are either problems of regression or problems of classification. If the output takes a continuous range of values, it is a regression problem. Using Ohm's law as an example, a regression could be performed with voltage as input and current as an output. The regression would find the functional relationship between voltage and current to be R {\displaystyle R} , such that V = I R {\displaystyle V=IR} Classification problems are those for which the output will be an element from a discrete set of labels. Classification is very common for machine learning applications. In facial recognition, for instance, a picture of a person's face would be the input, and the output label would be that person's name. The input would be represented by a large multidimensional vector whose elements represent pixels in the picture. After learning a function based on the training set data, that function is validated on a test set of data, data that did not appear in the training set. == Formal description == Take X {\displaystyle X} to be the vector space of all possible inputs, and Y {\displaystyle Y} to be the vector space of all possible outputs. Statistical learning theory takes the perspective that there is some unknown probability distribution over the product space Z = X × Y {\displaystyle Z=X\times Y} , i.e. there exists some unknown p ( z ) = p ( x , y ) {\displaystyle p(z)=p(\mathbf {x} ,y)} . The training set is made up of n {\displaystyle n} samples from this probability distribution, and is notated S = { ( x 1 , y 1 ) , … , ( x n , y n ) } = { z 1 , … , z n } {\displaystyle S=\{(\mathbf {x} _{1},y_{1}),\dots ,(\mathbf {x} _{n},y_{n})\}=\{\mathbf {z} _{1},\dots ,\mathbf {z} _{n}\}} Every x i {\displaystyle \mathbf {x} _{i}} is an input vector from the training data, and y i {\displaystyle y_{i}} is the output that corresponds to it. In this formalism, the inference problem consists of finding a function f : X → Y {\displaystyle f:X\to Y} such that f ( x ) ∼ y {\displaystyle f(\mathbf {x} )\sim y} . Let H {\displaystyle {\mathcal {H}}} be a space of functions f : X → Y {\displaystyle f:X\to Y} called the hypothesis space. The hypothesis space is the space of functions the algorithm will search through. Let V ( f ( x ) , y ) {\displaystyle V(f(\mathbf {x} ),y)} be the loss function, a metric for the difference between the predicted value f ( x ) {\displaystyle f(\mathbf {x} )} and the actual value y {\displaystyle y} . The expected risk is defined to be I [ f ] = ∫ X × Y V ( f ( x ) , y ) p ( x , y ) d x d y {\displaystyle I[f]=\int _{X\times Y}V(f(\mathbf {x} ),y)\,p(\mathbf {x} ,y)\,d\mathbf {x} \,dy} The target function, the best possible function f {\displaystyle f} that can be chosen, is given by the f {\displaystyle f} that satisfies f = argmin h ∈ H ⁡ I [ h ] {\displaystyle f=\mathop {\operatorname {argmin} } _{h\in {\mathcal {H}}}I[h]} Because the probability distribution p ( x , y ) {\displaystyle p(\mathbf {x} ,y)} is unknown, a proxy measure for the expected risk must be used. This measure is based on the training set, a sample from this unknown probability distribution. It is called the empirical risk I S [ f ] = 1 n ∑ i = 1 n V ( f ( x i ) , y i ) {\displaystyle I_{S}[f]={\frac {1}{n}}\sum _{i=1}^{n}V(f(\mathbf {x} _{i}),y_{i})} A learning algorithm that chooses the function f S {\displaystyle f_{S}} that minimizes the empirical risk is called empirical risk minimization. == Loss functions == The choice of loss function is a determining factor on the function f S {\displaystyle f_{S}} that will be chosen by the learning algorithm. The loss function also affects the convergence rate for an algorithm. It is important for the loss function to be convex. Different loss functions are used depending on whether the problem is one of regression or one of classification. === Regression === The most common loss function for regression is the square loss function (also known as the L2-norm). This familiar loss function is used in Ordinary Least Squares regression. The form is: V ( f ( x ) , y ) = ( y − f ( x ) ) 2 {\displaystyle V(f(\mathbf {x} ),y)=(y-f(\mathbf {x} ))^{2}} The absolute value loss (also known as the L1-norm) is also sometimes used: V ( f ( x ) , y ) = | y − f ( x ) | {\displaystyle V(f(\mathbf {x} ),y)=|y-f(\mathbf {x} )|} === Classification === In some sense the 0-1 indicator function is the most natural loss function for classification. It takes the value 0 if the predicted output is the same as the actual output, and it takes the value 1 if the predicted output is different from the actual output. For binary classification with Y = { − 1 , 1 } {\displaystyle Y=\{-1,1\}} , this is: V ( f ( x ) , y ) = θ ( − y f ( x ) ) {\displaystyle V(f(\mathbf {x} ),y)=\theta (-yf(\mathbf {x} ))} where θ {\displaystyle \theta } is the Heaviside step function. == Regularization == In machine learning problems, a major problem that arises is that of overfitting. Because learning is a prediction problem, the goal is not to find a function that most closely fits the (previously observed) data, but to find one that will most accurately predict output from future input. Empirical risk minimization runs this risk of overfitting: finding a function that matches the data exactly but does not predict future output well. Overfitting is symptomatic of unstable solutions; a small perturbation in the training set data would cause a large variation in the learned function. It can be shown that if the stability for the solution can be guaranteed, generalization and consistency are guaranteed as well. Regularization can solve the overfitting problem and give the problem stability. Regularization can be accomplished by restricting the hypothesis space H {\displaystyle {\mathcal {H}}} . A common example would be restricting H {\displaystyle {\mathcal {H}}} to linear functions: this can be seen as a reduction to the standard problem of linear regression. H {\displaystyle {\mathcal {H}}} could also be restricted to polynomial of degree p {\displaystyle p} , exponentials, or bounded functions on L1. Restriction of the hypothesis space avoids overfitting because the form of the potential functions are limited, and so does not allow for the choice of a function that gives empirical risk arbitrarily close to zero. One example of regularization is Tikhonov regularization. This consists of minimizing 1 n ∑ i = 1 n V ( f ( x i ) , y i ) + γ ‖ f ‖ H 2 {\displaystyle {\frac {1}{n}}\sum _{i=1}^{n}V(f(\mathbf {x} _{i}),y_{i})+\gamma \left\|f\right\|_{\mathcal {H}}^{2}} where γ {\displaystyle \gamma } is a fixed and positive parameter, the regularization parameter. Tikhonov regularization ensures existence, uniqueness, and stability of the solution. == Bounding empirical risk == Consider a binary classifier f : X → { 0 , 1 } {\displaystyle f:{\mathcal {X}}\to \{0,1\}} . We can apply Hoeffding's inequality to bound the probability that the empirical risk deviates from the true risk to be a Sub-Gaussian distribution. P ( | R ^ ( f ) − R ( f ) | ≥ ϵ ) ≤ 2 e − 2 n ϵ 2 {\displaystyle \mathbb {P} (|{\hat {R}}(f)-R(f)|\geq \epsilon )\leq 2e^{-2n\epsilon ^{2}}} But generally, when we do empirical risk minimization, we are not given a classifier; we must choose it. Therefore, a more useful result is to bound the probability of the supremum of the difference over the whole class. P ( sup f ∈ F | R ^ ( f ) − R ( f ) | ≥ ϵ ) ≤ 2 S ( F , n ) e − n ϵ 2 / 8 ≈ n d e − n ϵ 2 / 8 {\displaystyle \mathbb {P} {\bigg (}\sup _{f\in {\mathcal {F}}}|{\hat {R}}(f)-R(f)|\geq \epsilon {\bigg )}\leq 2S({\mathcal {F}},n)e^{-n\epsilon ^{2}/8}\approx n^{d}e^{-n\epsilon ^{2}/8}} where S ( F , n ) {\displaystyle S({\mathcal {F}},n)} is the shattering number and n {\displaystyle n} is the number of samples in your dataset. The exponential term comes from Hoeffding but there is an extra cost of taking the supremum over the whole cla

Plug computer

A plug computer is a small-form-factor computer whose chassis contains the AC power plug, and thus plugs directly into the wall. Alternatively, the computer may resemble an AC adapter or a similarly small device. Plug computers are often configured for use in the home or office as compact computer. == Description == Plug computers consist of a high-performance, low-power system-on-a-chip processor, with several I/O hardware ports (USB ports, Ethernet connectors, etc.). Most versions do not have provisions for connecting a display and are best suited to running media servers, back-up services, or file sharing and remote access functions; thus acting as a bridge between in-home protocols (such as Digital Living Network Alliance (DLNA) and Server Message Block (SMB)) and cloud-based services. There are, however, plug computer offerings that have analog VGA monitor and/or HDMI connectors, which, along with multiple USB ports, permit the use of a display, keyboard, and mouse, thus making them full-fledged, low-power alternatives to desktop and laptop computers. They typically run any of a number of Linux distributions. Plug computers typically consume little power and are inexpensive. == History == A number of other devices of this type began to appear at the 2009 Consumer Electronics Show. On January 6, 2009 CTERA Networks launched a device called CloudPlug that provides online backup at local disk speeds and overlays a file sharing service. The device also transforms any external USB hard drive into a network-attached storage device. On January 7, 2009, Cloud Engines unveiled the Pogoplug network access server. On January 8, 2009, Axentra announced availability of their HipServ platform. On February 23, 2009, Marvell Technology Group announced its plans to build a mini-industry around plug computers. On August 19, 2009, CodeLathe announced availability of their TonidoPlug network access server. On November 13, 2009 QuadAxis launched its plug computing device product line and development platform, featuring the QuadPlug and QuadPC and running QuadMix, a modified Linux. On January 5, 2010, Iomega announced their iConnect network access server. On January 7, 2010 Pbxnsip launched its plug computing device the sipJack running pbxnsip: an IP Communications platform.

Stochastic gradient descent

Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data). Especially in high-dimensional optimization problems this reduces the very high computational burden, achieving faster iterations in exchange for a lower convergence rate. The basic idea behind stochastic approximation can be traced back to the Robbins–Monro algorithm of the 1950s. Today, stochastic gradient descent has become an important optimization method in machine learning. == Background == Both statistical estimation and machine learning consider the problem of minimizing an objective function that has the form of a sum: Q ( w ) = 1 n ∑ i = 1 n Q i ( w ) , {\displaystyle Q(w)={\frac {1}{n}}\sum _{i=1}^{n}Q_{i}(w),} where the parameter w {\displaystyle w} that minimizes Q ( w ) {\displaystyle Q(w)} is to be estimated. Each summand function Q i {\displaystyle Q_{i}} is typically associated with the i {\displaystyle i} -th observation in the data set (used for training). In classical statistics, sum-minimization problems arise in least squares and in maximum-likelihood estimation (for independent observations). The general class of estimators that arise as minimizers of sums are called M-estimators. However, in statistics, it has been long recognized that requiring even local minimization is too restrictive for some problems of maximum-likelihood estimation. Therefore, contemporary statistical theorists often consider stationary points of the likelihood function (or zeros of its derivative, the score function, and other estimating equations). The sum-minimization problem also arises for empirical risk minimization. There, Q i ( w ) {\displaystyle Q_{i}(w)} is the value of the loss function at i {\displaystyle i} -th example, and Q ( w ) {\displaystyle Q(w)} is the empirical risk. When used to minimize the above function, a standard (or "batch") gradient descent method would perform the following iterations: w := w − η ∇ Q ( w ) = w − η n ∑ i = 1 n ∇ Q i ( w ) . {\displaystyle w:=w-\eta \,\nabla Q(w)=w-{\frac {\eta }{n}}\sum _{i=1}^{n}\nabla Q_{i}(w).} The step size is denoted by η {\displaystyle \eta } (sometimes called the learning rate in machine learning) and here " := {\displaystyle :=} " denotes the update of a variable in the algorithm. In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-function and the sum gradient. For example, in statistics, one-parameter exponential families allow economical function-evaluations and gradient-evaluations. However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients from all summand functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step. This is very effective in the case of large-scale machine learning problems. == Iterative method == In stochastic (or "on-line") gradient descent, the true gradient of Q ( w ) {\displaystyle Q(w)} is approximated by a gradient at a single sample: w := w − η ∇ Q i ( w ) . {\displaystyle w:=w-\eta \,\nabla Q_{i}(w).} As the algorithm sweeps through the training set, it performs the above update for each training sample. Several passes can be made over the training set until the algorithm converges. If this is done, the data can be shuffled for each pass to prevent cycles. Typical implementations may use an adaptive learning rate so that the algorithm converges. In pseudocode, stochastic gradient descent can be presented as : A compromise between computing the true gradient and the gradient at a single sample is to compute the gradient against more than one training sample (called a "mini-batch") at each step. This can perform significantly better than "true" stochastic gradient descent described, because the code can make use of vectorization libraries rather than computing each step separately as was first shown in where it was called "the bunch-mode back-propagation algorithm". It may also result in smoother convergence, as the gradient computed at each step is averaged over more training samples. The convergence of stochastic gradient descent has been analyzed using the theories of convex minimization and of stochastic approximation. Briefly, when the learning rates η {\displaystyle \eta } decrease with an appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges almost surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise converges almost surely to a local minimum. This is in fact a consequence of the Robbins–Siegmund theorem. == Linear regression == Suppose we want to fit a straight line y ^ = w 1 + w 2 x {\displaystyle {\hat {y}}=w_{1}+w_{2}x} to a training set with observations ( ( x 1 , y 1 ) , ( x 2 , y 2 ) … , ( x n , y n ) ) {\displaystyle ((x_{1},y_{1}),(x_{2},y_{2})\ldots ,(x_{n},y_{n}))} and corresponding estimated responses ( y ^ 1 , y ^ 2 , … , y ^ n ) {\displaystyle ({\hat {y}}_{1},{\hat {y}}_{2},\ldots ,{\hat {y}}_{n})} using least squares. The objective function to be minimized is Q ( w ) = ∑ i = 1 n Q i ( w ) = ∑ i = 1 n ( y ^ i − y i ) 2 = ∑ i = 1 n ( w 1 + w 2 x i − y i ) 2 . {\displaystyle Q(w)=\sum _{i=1}^{n}Q_{i}(w)=\sum _{i=1}^{n}\left({\hat {y}}_{i}-y_{i}\right)^{2}=\sum _{i=1}^{n}\left(w_{1}+w_{2}x_{i}-y_{i}\right)^{2}.} The last line in the above pseudocode for this specific problem will become: [ w 1 w 2 ] ← [ w 1 w 2 ] − η [ ∂ ∂ w 1 ( w 1 + w 2 x i − y i ) 2 ∂ ∂ w 2 ( w 1 + w 2 x i − y i ) 2 ] = [ w 1 w 2 ] − η [ 2 ( w 1 + w 2 x i − y i ) 2 x i ( w 1 + w 2 x i − y i ) ] . {\displaystyle {\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}\leftarrow {\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}-\eta {\begin{bmatrix}{\frac {\partial }{\partial w_{1}}}(w_{1}+w_{2}x_{i}-y_{i})^{2}\\{\frac {\partial }{\partial w_{2}}}(w_{1}+w_{2}x_{i}-y_{i})^{2}\end{bmatrix}}={\begin{bmatrix}w_{1}\\w_{2}\end{bmatrix}}-\eta {\begin{bmatrix}2(w_{1}+w_{2}x_{i}-y_{i})\\2x_{i}(w_{1}+w_{2}x_{i}-y_{i})\end{bmatrix}}.} Note that in each iteration or update step, the gradient is only evaluated at a single x i {\displaystyle x_{i}} . This is the key difference between stochastic gradient descent and batched gradient descent. In general, given a linear regression y ^ = ∑ k ∈ 1 : m w k x k {\displaystyle {\hat {y}}=\sum _{k\in 1:m}w_{k}x_{k}} problem, stochastic gradient descent behaves differently when m < n {\displaystyle m

Concept class

In computational learning theory in mathematics, a concept over a domain X is a total Boolean function over X. A concept class is a class of concepts. Concept classes are a subject of computational learning theory. Concept class terminology frequently appears in model theory associated with probably approximately correct (PAC) learning. In this setting, if one takes a set Y as a set of (classifier output) labels, and X is a set of examples, the map c : X → Y {\displaystyle c:X\to Y} , i.e. from examples to classifier labels (where Y = { 0 , 1 } {\displaystyle Y=\{0,1\}} and where c is a subset of X), c is then said to be a concept. A concept class C {\displaystyle C} is then a collection of such concepts. Given a class of concepts C, a subclass D is reachable if there exists a sample s such that D contains exactly those concepts in C that are extensions to s. Not every subclass is reachable. == Background == A sample s {\displaystyle s} is a partial function from X {\displaystyle X} to { 0 , 1 } {\displaystyle \{0,1\}} . Identifying a concept with its characteristic function mapping X {\displaystyle X} to { 0 , 1 } {\displaystyle \{0,1\}} , it is a special case of a sample. Two samples are consistent if they agree on the intersection of their domains. A sample s ′ {\displaystyle s'} extends another sample s {\displaystyle s} if the two are consistent and the domain of s {\displaystyle s} is contained in the domain of s ′ {\displaystyle s'} . == Examples == Suppose that C = S + ( X ) {\displaystyle C=S^{+}(X)} . Then: the subclass { { x } } {\displaystyle \{\{x\}\}} is reachable with the sample s = { ( x , 1 ) } {\displaystyle s=\{(x,1)\}} ; the subclass S + ( Y ) {\displaystyle S^{+}(Y)} for Y ⊆ X {\displaystyle Y\subseteq X} are reachable with a sample that maps the elements of X − Y {\displaystyle X-Y} to zero; the subclass S ( X ) {\displaystyle S(X)} , which consists of the singleton sets, is not reachable. == Applications == Let C {\displaystyle C} be some concept class. For any concept c ∈ C {\displaystyle c\in C} , we call this concept 1 / d {\displaystyle 1/d} -good for a positive integer d {\displaystyle d} if, for all x ∈ X {\displaystyle x\in X} , at least 1 / d {\displaystyle 1/d} of the concepts in C {\displaystyle C} agree with c {\displaystyle c} on the classification of x {\displaystyle x} . The fingerprint dimension F D ( C ) {\displaystyle FD(C)} of the entire concept class C {\displaystyle C} is the least positive integer d {\displaystyle d} such that every reachable subclass C ′ ⊆ C {\displaystyle C'\subseteq C} contains a concept that is 1 / d {\displaystyle 1/d} -good for it. This quantity can be used to bound the minimum number of equivalence queries needed to learn a class of concepts according to the following inequality: F D ( C ) − 1 ≤ # E Q ( C ) ≤ ⌈ F D ( C ) ln ⁡ ( | C | ) ⌉ {\textstyle FD(C)-1\leq \#EQ(C)\leq \lceil FD(C)\ln(|C|)\rceil } .

Modern Hopfield network

Modern Hopfield networks (also known as Dense Associative Memories) are generalizations of the classical Hopfield networks that break the linear scaling relationship between the number of input features and the number of stored memories. This is achieved by introducing stronger non-linearities (either in the energy function or neurons’ activation functions) leading to super-linear (even an exponential) memory storage capacity as a function of the number of feature neurons. The network still requires a sufficient number of hidden neurons. The key theoretical idea behind the modern Hopfield networks is to use an energy function and an update rule that is more sharply peaked around the stored memories in the space of neuron’s configurations compared to the classical Hopfield network. == Classical Hopfield networks == Hopfield networks are recurrent neural networks with dynamical trajectories converging to fixed point attractor states and described by an energy function. The state of each model neuron i {\textstyle i} is defined by a time-dependent variable V i {\displaystyle V_{i}} , which can be chosen to be either discrete or continuous. A complete model describes the mathematics of how the future state of activity of each neuron depends on the known present or previous activity of all the neurons. In the original Hopfield model of associative memory, the variables were binary, and the dynamics were described by a one-at-a-time update of the state of the neurons. An energy function quadratic in the V i {\displaystyle V_{i}} was defined, and the dynamics consisted of changing the activity of each single neuron i {\displaystyle i} only if doing so would lower the total energy of the system. This same idea was extended to the case of V i {\displaystyle V_{i}} being a continuous variable representing the output of neuron i {\displaystyle i} , and V i {\displaystyle V_{i}} being a monotonic function of an input current. The dynamics became expressed as a set of first-order differential equations for which the "energy" of the system always decreased. The energy in the continuous case has one term which is quadratic in the V i {\displaystyle V_{i}} (as in the binary model), and a second term which depends on the gain function (neuron's activation function). While having many desirable properties of associative memory, both of these classical systems suffer from a small memory storage capacity, which scales linearly with the number of input features. == Discrete variables == A simple example of the Modern Hopfield network can be written in terms of binary variables V i {\displaystyle V_{i}} that represent the active V i = + 1 {\displaystyle V_{i}=+1} and inactive V i = − 1 {\displaystyle V_{i}=-1} state of the model neuron i {\displaystyle i} . E = − ∑ μ = 1 N mem F ( ∑ i = 1 N f ξ μ i V i ) {\displaystyle E=-\sum \limits _{\mu =1}^{N_{\text{mem}}}F{\Big (}\sum \limits _{i=1}^{N_{f}}\xi _{\mu i}V_{i}{\Big )}} In this formula the weights ξ μ i {\textstyle \xi _{\mu i}} represent the matrix of memory vectors (index μ = 1... N mem {\displaystyle \mu =1...N_{\text{mem}}} enumerates different memories, and index i = 1... N f {\displaystyle i=1...N_{f}} enumerates the content of each memory corresponding to the i {\displaystyle i} -th feature neuron), and the function F ( x ) {\displaystyle F(x)} is a rapidly growing non-linear function. The update rule for individual neurons (in the asynchronous case) can be written in the following form V i ( t + 1 ) = sign ⁡ [ ∑ μ = 1 N mem ( F ( ξ μ i + ∑ j ≠ i ξ μ j V j ( t ) ) − F ( − ξ μ i + ∑ j ≠ i ξ μ j V j ( t ) ) ) ] {\displaystyle V_{i}^{(t+1)}=\operatorname {sign} {\bigg [}\sum \limits _{\mu =1}^{N_{\text{mem}}}{\bigg (}F{\Big (}\xi _{\mu i}+\sum \limits _{j\neq i}\xi _{\mu j}V_{j}^{(t)}{\Big )}-F{\Big (}-\xi _{\mu i}+\sum \limits _{j\neq i}\xi _{\mu j}V_{j}^{(t)}{\Big )}{\bigg )}{\bigg ]}} which states that in order to calculate the updated state of the i {\textstyle i} -th neuron the network compares two energies: the energy of the network with the i {\displaystyle i} -th neuron in the ON state and the energy of the network with the i {\displaystyle i} -th neuron in the OFF state, given the states of the remaining neuron. The updated state of the i {\displaystyle i} -th neuron selects the state that has the lowest of the two energies. In the limiting case when the non-linear energy function is quadratic F ( x ) = x 2 {\displaystyle F(x)=x^{2}} these equations reduce to the familiar energy function and the update rule for the classical binary Hopfield network. The memory storage capacity of these networks can be calculated for random binary patterns. For the power energy function F ( x ) = x n {\displaystyle F(x)=x^{n}} the maximal number of memories that can be stored and retrieved from this network without errors is given by N mem max ≈ 1 2 ( 2 n − 3 ) ! ! N f n − 1 ln ⁡ ( N f ) {\displaystyle N_{\text{mem}}^{\max }\approx {\frac {1}{2(2n-3)!!}}{\frac {N_{f}^{n-1}}{\ln(N_{f})}}} For an exponential energy function F ( x ) = e x {\textstyle F(x)=e^{x}} the memory storage capacity is exponential in the number of feature neurons N mem max ≈ 2 N f / 2 {\displaystyle N_{\text{mem}}^{\max }\approx 2^{N_{f}/2}} == Continuous variables == Modern Hopfield networks or Dense Associative Memories can be best understood in continuous variables and continuous time. Consider the network architecture, shown in Fig.1, and the equations for the neurons' state evolutionwhere the currents of the feature neurons are denoted by x i {\textstyle x_{i}} , and the currents of the memory neurons are denoted by h μ {\displaystyle h_{\mu }} ( h {\displaystyle h} stands for hidden neurons). There are no synaptic connections among the feature neurons or the memory neurons. A matrix ξ μ i {\displaystyle \xi _{\mu i}} denotes the strength of synapses from a feature neuron i {\displaystyle i} to the memory neuron μ {\displaystyle \mu } . The synapses are assumed to be symmetric, so that the same value characterizes a different physical synapse from the memory neuron μ {\displaystyle \mu } to the feature neuron i {\displaystyle i} . The outputs of the memory neurons and the feature neurons are denoted by f μ {\displaystyle f_{\mu }} and g i {\displaystyle g_{i}} , which are non-linear functions of the corresponding currents. In general these outputs can depend on the currents of all the neurons in that layer so that f μ = f ( { h μ } ) {\displaystyle f_{\mu }=f(\{h_{\mu }\})} and g i = g ( { x i } ) {\textstyle g_{i}=g(\{x_{i}\})} . It is convenient to define these activation function as derivatives of the Lagrangian functions for the two groups of neuronsThis way the specific form of the equations for neuron's states is completely defined once the Lagrangian functions are specified. Finally, the time constants for the two groups of neurons are denoted by τ f {\displaystyle \tau _{f}} and τ h {\displaystyle \tau _{h}} , I i {\displaystyle I_{i}} is the input current to the network that can be driven by the presented data. General systems of non-linear differential equations can have many complicated behaviors that can depend on the choice of the non-linearities and the initial conditions. For Hopfield networks, however, this is not the case - the dynamical trajectories always converge to a fixed point attractor state. This property is achieved because these equations are specifically engineered so that they have an underlying energy function The terms grouped into square brackets represent a Legendre transform of the Lagrangian function with respect to the states of the neurons. If the Hessian matrices of the Lagrangian functions are positive semi-definite, the energy function is guaranteed to decrease on the dynamical trajectory This property makes it possible to prove that the system of dynamical equations describing temporal evolution of neurons' activities will eventually reach a fixed point attractor state. In certain situations one can assume that the dynamics of hidden neurons equilibrates at a much faster time scale compared to the feature neurons, τ h ≪ τ f {\textstyle \tau _{h}\ll \tau _{f}} . In this case the steady state solution of the second equation in the system (1) can be used to express the currents of the hidden units through the outputs of the feature neurons. This makes it possible to reduce the general theory (1) to an effective theory for feature neurons only. The resulting effective update rules and the energies for various common choices of the Lagrangian functions are shown in Fig.2. In the case of log-sum-exponential Lagrangian function the update rule (if applied once) for the states of the feature neurons is the attention mechanism commonly used in many modern AI systems (see Ref. for the derivation of this result from the continuous time formulation). == Relationship to classical Hopfield network with continuous variables == Classical formulation of continuous Hopfield networks can be understood as a

Free boundary condition

In image processing, the free boundary condition is the convention used when applying a convolution kernel to a digital image in which pixel locations that lie outside the image boundaries are interpreted as having a value of zero.[1] The question of what value to assign out-of-bounds pixels may arise, for instance, when applying a 3×3 kernel to the corner pixel in an image.

One-shot learning (computer vision)

One-shot learning is an object categorization problem, found mostly in computer vision. Whereas most machine learning-based object categorization algorithms require training on hundreds or thousands of examples, one-shot learning aims to classify objects from one, or only a few, examples. The term few-shot learning is also used for these problems, especially when more than one example is needed. == Motivation == The ability to learn object categories from few examples, and at a rapid pace, has been demonstrated in humans. It is estimated that a child learns almost all of the 10 ~ 30 thousand object categories in the world by age six. This is due not only to the human mind's computational power, but also to its ability to synthesize and learn new object categories from existing information about different, previously learned categories. Given two examples from two object categories: one, an unknown object composed of familiar shapes, the second, an unknown, amorphous shape; it is much easier for humans to recognize the former than the latter, suggesting that humans make use of previously learned categories when learning new ones. The key motivation for solving one-shot learning is that systems, like humans, can use knowledge about object categories to classify new objects. == Background == As with most classification schemes, one-shot learning involves three main challenges: Representation: How should objects and categories be described? Learning: How can such descriptions be created? Recognition: How can a known object be filtered from enveloping clutter, irrespective of occlusion, viewpoint, and lighting? One-shot learning differs from single object recognition and standard category recognition algorithms in its emphasis on knowledge transfer, which makes use of previously learned categories. Model parameters: Reuses model parameters, based on the similarity between old and new categories. Categories are first learned on numerous training examples, then new categories are learned using transformations of model parameters from those initial categories or selecting relevant parameters for a classifier. Feature sharing: Shares parts or features of objects across categories. One algorithm extracts "diagnostic information" in patches from already learned categories by maximizing the patches' mutual information, and then applies these features to the learning of a new category. A dog category, for example, may be learned in one shot from previous knowledge of horse and cow categories, because dog objects may contain similar distinguishing patches. Contextual information: Appeals to global knowledge of the scene in which the object appears. Such global information can be used as frequency distributions in a conditional random field framework to recognize objects. Alternatively context can consider camera height and scene geometry. Algorithms of this type have two advantages. First, they learn object categories that are relatively dissimilar; and second, they perform well in ad hoc situations where an image has not been hand-cropped and aligned. == Theory == The Bayesian one-shot learning algorithm represents the foreground and background of images as parametrized by a mixture of constellation models. During the learning phase, the parameters of these models are learned using a conjugate density parameter posterior and variational Bayesian expectation–maximization (VBEM). In this stage the previously learned object categories inform the choice of model parameters via transfer by contextual information. For object recognition on new images, the posterior obtained during the learning phase is used in a Bayesian decision framework to estimate the ratio of p(object | test, train) to p(background clutter | test, train) where p is the probability of the outcome. === Bayesian framework === Given the task of finding a particular object in a query image, the overall objective of the Bayesian one-shot learning algorithm is to compare the probability that object is present vs the probability that only background clutter is present. If the former probability is higher, the algorithm reports the object's presence, otherwise the algorithm reports its absence. To compute these probabilities, the object class must be modeled from a set of (1 ~ 5) training images containing examples. To formalize these ideas, let I {\displaystyle I} be the query image, which contains either an example of the foreground category O f g {\displaystyle O_{fg}} or only background clutter of a generic background category O b g {\displaystyle O_{bg}} . Also let I t {\displaystyle I_{t}} be the set of training images used as the foreground category. The decision of whether I {\displaystyle I} contains an object from the foreground category, or only clutter from the background category is: R = p ( O f g | I , I t ) p ( O b g | I , I t ) = p ( I | I t , O f g ) p ( O f g ) p ( I | I t , O b g ) p ( O b g ) , {\displaystyle R={\frac {p(O_{fg}|I,I_{t})}{p(O_{bg}|I,I_{t})}}={\frac {p(I|I_{t},O_{fg})p(O_{fg})}{p(I|I_{t},O_{bg})p(O_{bg})}},} where the class posteriors p ( O f g | I , I t ) {\displaystyle p(O_{fg}|I,I_{t})} and p ( O b g | I , I t ) {\displaystyle p(O_{bg}|I,I_{t})} have been expanded by Bayes' theorem, yielding a ratio of likelihoods and a ratio of object category priors. We decide that the image I {\displaystyle I} contains an object from the foreground class if R {\displaystyle R} exceeds a certain threshold T {\displaystyle T} . We next introduce parametric models for the foreground and background categories with parameters θ {\displaystyle \theta } and θ b g {\displaystyle \theta _{bg}} respectively. This foreground parametric model is learned during the learning stage from I t {\displaystyle I_{t}} , as well as prior information of learned categories. The background model we assume to be uniform across images. Omitting the constant ratio of category priors, p ( O f g ) p ( O b g ) {\displaystyle {\frac {p(O_{fg})}{p(O_{bg})}}} , and parametrizing over θ {\displaystyle \theta } and θ b g {\displaystyle \theta _{bg}} yields R ∝ ∫ p ( I | θ , O f g ) p ( θ | I t , O f g ) d θ ∫ p ( I | θ b g , O b g ) p ( θ b g | I t , O b g ) d θ b g = ∫ p ( I | θ ) p ( θ | I t , O f g ) d θ ∫ p ( I | θ b g ) p ( θ b g | I t , O b g ) d θ b g {\displaystyle R\propto {\frac {\int {p(I|\theta ,O_{fg})p(\theta |I_{t},O_{fg})}d\theta }{\int {p(I|\theta _{bg},O_{bg})p(\theta _{bg}|I_{t},O_{bg})}d\theta _{bg}}}={\frac {\int {p(I|\theta )p(\theta |I_{t},O_{fg})}d\theta }{\int {p(I|\theta _{bg})p(\theta _{bg}|I_{t},O_{bg})}d\theta _{bg}}}} , having simplified p ( I | θ , O f g ) {\displaystyle p(I|\theta ,O_{fg})} and p ( I | θ , O b g ) {\displaystyle p(I|\theta ,O_{bg})} to p ( I | θ f g ) {\displaystyle p(I|\theta _{fg})} and p ( I | θ b g ) . {\displaystyle p(I|\theta _{bg}).} The posterior distribution of model parameters given the training images, p ( θ | I t , O f g ) {\displaystyle p(\theta |I_{t},O_{fg})} is estimated in the learning phase. In this estimation, one-shot learning differs sharply from more traditional Bayesian estimation models that approximate the integral as δ ( θ M L ) {\displaystyle \delta (\theta ^{ML})} . Instead, it uses a variational approach using prior information from previously learned categories. However, the traditional maximum likelihood estimation of the model parameters is used for the background model and the categories learned in advance through training. === Object category model === For each query image I {\displaystyle I} and training images I t {\displaystyle I_{t}} , a constellation model is used for representation. To obtain this model for a given image I {\displaystyle I} , first a set of N interesting regions is detected in the image using the Kadir–Brady saliency detector. Each region selected is represented by a location in the image, X i {\displaystyle X_{i}} and a description of its appearance, A i {\displaystyle A_{i}} . Letting X = ∑ i = 1 N X i , A = ∑ i = 1 N A i {\displaystyle X=\sum _{i=1}^{N}X_{i},A=\sum _{i=1}^{N}A_{i}} and X t {\displaystyle X_{t}} and A t {\displaystyle A_{t}} the analogous representations for training images, the expression for R becomes: R ∝ ∫ p ( X , A | θ , O f g ) p ( θ | X t , A t , O f g ) d θ ∫ p ( X , A | θ b g , O b g ) p ( θ b g | X t , A t , O b g ) d θ b g = ∫ p ( X , A | θ ) p ( θ | X t , A t , O f g ) d θ ∫ p ( X , A | θ b g ) p ( θ b g | X t , A t , O b g ) d θ b g {\displaystyle R\propto {\frac {\int {p(X,A|\theta ,O_{fg})p(\theta |X_{t},A_{t},O_{fg})}d\theta }{\int {p(X,A|\theta _{bg},O_{bg})p(\theta _{bg}|X_{t},A_{t},O_{bg})}d\theta _{bg}}}={\frac {\int {p(X,A|\theta )p(\theta |X_{t},A_{t},O_{fg})}d\theta }{\int {p(X,A|\theta _{bg})p(\theta _{bg}|X_{t},A_{t},O_{bg})}\,d\theta _{bg}}}} The likelihoods p ( X , A | θ ) {\displaystyle p(X,A|\theta )} and p ( X , A | θ b g ) {\displaystyle p(X,A|\theta _{bg})} are represented as mixtures of constellation models. A typical constellation model has