AI Data Farms

AI Data Farms — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Kernel Assisted Superuser

    Kernel Assisted Superuser

    Kernel Assisted Superuser (short: KernelSU) is an alternative method for obtaining root privileges on Android devices. KernelSU implementations are developed as free and open-source software under the terms of the GPLv3 license. == Technical differences == KernelSU differs from other methods in that root access is implemented directly in the kernel. Compared to other root methods that run in userspace, such as Magisk, this has the advantage that commands with su can be executed like normal commands, but still have root privileges. This is not prevented by SELinux or detected by the PlayIntegrity API check, so applications that use it will continue to function. Unlike Magisk, /system/bin/su is a virtual file implemented by hooking system calls with kprobes, and overlayfs is used for systemless modifications to the system partition instead of magic mount. == History == The planning of KernelSU was started in 2018 by developer Jason Donenfeld, also known as XDA user zx2c4. The lack of a root manager app and the difficulty of creating boot images meant that KernelSU was not suitable for productive use, and for a long time this method remained theoretical and could only be used by developers. In 2021, Google launched Generic Kernel Images (GKI for short), which facilitates the creation of a set of device-independent rooted boot images. In response, the developer known on XDA as weishu, who had also worked on projects such as VirtualXposed, adapted KernelSU for GKI-compatible kernels. The adaptation, which was released in January 2023, ensures that any device booting with Linux kernel version 5.10 or higher should be compatible. In addition, the developer also offers a special manager app that, in addition to managing root privileges, also offers overlay-based modding similar to Magisk modules. As of November 2025, 310 developers have contributed to the development of the KernelSU implementation. == Distribution == KernelSU can be installed on all devices that use GKI, as well as on individually supported devices without GKI. Some custom ROMs already have it integrated by default, including ROMs such as CrDroid, Bliss OS, and Evolution X.

    Read more →
  • Stochastic variance reduction

    Stochastic variance reduction

    (Stochastic) variance reduction is an algorithmic approach to minimizing functions that can be decomposed into finite sums. By exploiting the finite sum structure, variance reduction techniques are able to achieve convergence rates that are impossible to achieve with methods that treat the objective as an infinite sum, as in the classical Stochastic approximation setting. Variance reduction approaches are widely used for training machine learning models such as logistic regression and support vector machines as these problems have finite-sum structure and uniform conditioning that make them ideal candidates for variance reduction. == Finite sum objectives == A function f {\displaystyle f} is considered to have finite sum structure if it can be decomposed into a summation or average: f ( x ) = 1 n ∑ i = 1 n f i ( x ) , {\displaystyle f(x)={\frac {1}{n}}\sum _{i=1}^{n}f_{i}(x),} where the function value and derivative of each f i {\displaystyle f_{i}} can be queried independently. Although variance reduction methods can be applied for any positive n {\displaystyle n} and any f i {\displaystyle f_{i}} structure, their favorable theoretical and practical properties arise when n {\displaystyle n} is large compared to the condition number of each f i {\displaystyle f_{i}} , and when the f i {\displaystyle f_{i}} have similar (but not necessarily identical) Lipschitz smoothness and strong convexity constants. The finite sum structure should be contrasted with the stochastic approximation setting which deals with functions of the form f ( θ ) = E ξ ⁡ [ F ( θ , ξ ) ] {\textstyle f(\theta )=\operatorname {E} _{\xi }[F(\theta ,\xi )]} which is the expected value of a function depending on a random variable ξ {\textstyle \xi } . Any finite sum problem can be optimized using a stochastic approximation algorithm by using F ( ⋅ , ξ ) = f ξ {\displaystyle F(\cdot ,\xi )=f_{\xi }} . == Rapid Convergence == Stochastic variance reduced methods without acceleration are able to find a minima of f {\displaystyle f} within accuracy ϵ > {\displaystyle \epsilon >} , i.e. f ( x ) − f ( x ∗ ) ≤ ϵ {\displaystyle f(x)-f(x_{})\leq \epsilon } in a number of steps of the order: O ( ( L μ + n ) log ⁡ ( 1 ϵ ) ) . {\displaystyle O\left(\left({\frac {L}{\mu }}+n\right)\log \left({\frac {1}{\epsilon }}\right)\right).} The number of steps depends only logarithmically on the level of accuracy required, in contrast to the stochastic approximation framework, where the number of steps O ( L / ( μ ϵ ) ) {\displaystyle O{\bigl (}L/(\mu \epsilon ){\bigr )}} required grows proportionally to the accuracy required. Stochastic variance reduction methods converge almost as fast as the gradient descent method's O ( ( L / μ ) log ⁡ ( 1 / ϵ ) ) {\displaystyle O{\bigl (}(L/\mu )\log(1/\epsilon ){\bigr )}} rate, despite using only a stochastic gradient, at a 1 / n {\displaystyle 1/n} lower cost than gradient descent. Accelerated methods in the stochastic variance reduction framework achieve even faster convergence rates, requiring only O ( ( n L μ + n ) log ⁡ ( 1 ϵ ) ) {\displaystyle O\left(\left({\sqrt {\frac {nL}{\mu }}}+n\right)\log \left({\frac {1}{\epsilon }}\right)\right)} steps to reach ϵ {\displaystyle \epsilon } accuracy, potentially n {\displaystyle {\sqrt {n}}} faster than non-accelerated methods. Lower complexity bounds. for the finite sum class establish that this rate is the fastest possible for smooth strongly convex problems. == Approaches == Variance reduction approaches fall within four main categories: table averaging methods, full-gradient snapshot methods, recursive estimator methods (e.g., SARAH), and dual methods. Each category contains methods designed for dealing with convex, non-smooth, and non-convex problems, each differing in hyper-parameter settings and other algorithmic details. === SAGA === In the SAGA method, the prototypical table averaging approach, a table of size n {\displaystyle n} is maintained that contains the last gradient witnessed for each f i {\displaystyle f_{i}} term, which we denote g i {\displaystyle g_{i}} . At each step, an index i {\displaystyle i} is sampled, and a new gradient ∇ f i ( x k ) {\displaystyle \nabla f_{i}(x_{k})} is computed. The iterate x k {\displaystyle x_{k}} is updated with: x k + 1 = x k − γ [ ∇ f i ( x k ) − g i + 1 n ∑ i = 1 n g i ] , {\displaystyle x_{k+1}=x_{k}-\gamma \left[\nabla f_{i}(x_{k})-g_{i}+{\frac {1}{n}}\sum _{i=1}^{n}g_{i}\right],} and afterwards table entry i {\displaystyle i} is updated with g i = ∇ f i ( x k ) {\displaystyle g_{i}=\nabla f_{i}(x_{k})} . SAGA is among the most popular of the variance reduction methods due to its simplicity, easily adaptable theory, and excellent performance. It is the successor of the SAG method, improving on its flexibility and performance. === SVRG === The stochastic variance reduced gradient method (SVRG), the prototypical snapshot method, uses a similar update except instead of using the average of a table it instead uses a full-gradient that is reevaluated at a snapshot point x ~ {\displaystyle {\tilde {x}}} at regular intervals of m ≥ n {\displaystyle m\geq n} iterations. The update becomes: x k + 1 = x k − γ [ ∇ f i ( x k ) − ∇ f i ( x ~ ) + ∇ f ( x ~ ) ] , {\displaystyle x_{k+1}=x_{k}-\gamma [\nabla f_{i}(x_{k})-\nabla f_{i}({\tilde {x}})+\nabla f({\tilde {x}})],} This approach requires two stochastic gradient evaluations per step, one to compute ∇ f i ( x k ) {\displaystyle \nabla f_{i}(x_{k})} and one to compute ∇ f i ( x ~ ) , {\displaystyle \nabla f_{i}({\tilde {x}}),} where-as table averaging approaches need only one. Despite the high computational cost, SVRG is popular as its simple convergence theory is highly adaptable to new optimization settings. It also has lower storage requirements than tabular averaging approaches, which make it applicable in many settings where tabular methods can not be used. === SARAH === The SARAH (stochastic recursive gradient) method maintains a recursive estimator of the gradient rather than storing a table of past gradients (as in SAGA) or computing periodic full-gradient snapshots (as in SVRG). At the start of an inner loop, a full gradient is computed at a reference point x ~ {\displaystyle {\tilde {x}}} : v 0 = ∇ f ( x ~ ) {\displaystyle v_{0}=\nabla f({\tilde {x}})} . For inner iterations, with a sampled index i k {\displaystyle i_{k}} , the gradient estimator and iterate are updated by: v k = ∇ f i k ( x k ) − ∇ f i k ( x k − 1 ) + v k − 1 , x k + 1 = x k − γ v k . {\displaystyle v_{k}=\nabla f_{i_{k}}(x_{k})-\nabla f_{i_{k}}(x_{k-1})+v_{k-1},\qquad x_{k+1}=x_{k}-\gamma v_{k}.} This recursion requires two component-gradient evaluations per step ∇ f i k ( x k ) {\displaystyle \nabla f_{i_{k}}(x_{k})} and ∇ f i k ( x k − 1 ) {\displaystyle \nabla f_{i_{k}}(x_{k-1})} but does not need to store per-sample gradients, resulting in lower memory cost than table-averaging methods. SARAH admits linear convergence for strongly convex functions and has been extended to more general nonconvex and composite problems. === SDCA === Exploiting the dual representation of the objective leads to another variance reduction approach that is particularly suited to finite-sums where each term has a structure that makes computing the convex conjugate f i ∗ , {\displaystyle f_{i}^{},} or its proximal operator tractable. The standard SDCA method considers finite sums that have additional structure compared to generic finite sum setting: f ( x ) = 1 n ∑ i = 1 n f i ( x T v i ) + λ 2 ‖ x ‖ 2 , {\displaystyle f(x)={\frac {1}{n}}\sum _{i=1}^{n}f_{i}(x^{T}v_{i})+{\frac {\lambda }{2}}\|x\|^{2},} where each f i {\displaystyle f_{i}} is 1 dimensional and each v i {\displaystyle v_{i}} is a data point associated with f i {\displaystyle f_{i}} . SDCA solves the dual problem: max α ∈ R n − 1 n ∑ i = 1 n f i ∗ ( − α i ) − λ 2 ‖ 1 λ n ∑ i = 1 n α i v i ‖ 2 , {\displaystyle \max _{\alpha \in \mathbb {R} ^{n}}-{\frac {1}{n}}\sum _{i=1}^{n}f_{i}^{}(-\alpha _{i})-{\frac {\lambda }{2}}\left\|{\frac {1}{\lambda n}}\sum _{i=1}^{n}\alpha _{i}v_{i}\right\|^{2},} by a stochastic coordinate ascent procedure, where at each step the objective is optimized with respect to a randomly chosen coordinate α i {\displaystyle \alpha _{i}} , leaving all other coordinates the same. An approximate primal solution x {\displaystyle x} can be recovered from the α {\displaystyle \alpha } values: x = 1 λ n ∑ i = 1 n α i v i {\displaystyle x={\frac {1}{\lambda n}}\sum _{i=1}^{n}\alpha _{i}v_{i}} . This method obtains similar theoretical rates of convergence to other stochastic variance reduced methods, while avoiding the need to specify a step-size parameter. It is fast in practice when λ {\displaystyle \lambda } is large, but significantly slower than the other approaches when λ {\displaystyle \lambda } is small. == Accelerated approaches == Accelerated variance reduction methods are built upon the standard methods above. The earliest approaches make use of proximal operators t

    Read more →
  • Q-learning

    Q-learning

    Q-learning is a reinforcement learning algorithm that trains an agent to assign values to its possible actions based on its current state, without requiring a model of the environment (model-free). It can handle problems with stochastic transitions and rewards without requiring adaptations. For example, in a grid maze, an agent learns to reach an exit worth 10 points. At a junction, Q-learning might assign a higher value to moving right than left if right gets to the exit faster, improving this choice by trying both directions over time. For any finite Markov decision process, Q-learning finds an optimal policy in the sense of maximizing the expected value of the total reward over any and all successive steps, starting from the current state. Q-learning can identify an optimal action-selection policy for any given finite Markov decision process, given infinite exploration time and a partly random policy. "Q" refers to the function that the algorithm computes: the expected reward—that is, the quality—of an action taken in a given state. == Reinforcement learning == Reinforcement learning involves an agent, a set of states S {\displaystyle {\mathcal {S}}} , and a set A {\displaystyle {\mathcal {A}}} of actions per state. By performing an action a ∈ A {\displaystyle a\in {\mathcal {A}}} , the agent transitions from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score). The goal of the agent is to maximize its total reward. It does this by adding the maximum reward attainable from future states to the reward for achieving its current state, effectively influencing the current action by the potential future reward. This potential reward is a weighted sum of expected values of the rewards of all future steps starting from the current state. As an example, consider the process of boarding a train, in which the reward is measured by the negative of the total time spent boarding (alternatively, the cost of boarding the train is equal to the boarding time). One strategy is to enter the train door as soon as they open, minimizing the initial wait time for yourself. If the train is crowded, however, then you will have a slow entry after the initial action of entering the door as people are fighting you to depart the train as you attempt to board. The total boarding time, or cost, is then: 0 seconds wait time + 15 seconds fight time On the next day, by random chance (exploration), you decide to wait and let other people depart first. This initially results in a longer wait time. However, less time is spent fighting the departing passengers. Overall, this path has a higher reward than that of the previous day, since the total boarding time is now: 5 second wait time + 0 second fight time Through exploration, despite the initial (patient) action resulting in a larger cost (or negative reward) than in the forceful strategy, the overall cost is lower, thus revealing a more rewarding strategy. == Algorithm == After Δ t {\displaystyle \Delta t} steps into the future the agent will decide some next step. The weight for this step is calculated as γ Δ t {\displaystyle \gamma ^{\Delta t}} , where γ {\displaystyle \gamma } (the discount factor) is a number between 0 and 1 ( 0 ≤ γ ≤ 1 {\displaystyle 0\leq \gamma \leq 1} ). Assuming γ < 1 {\displaystyle \gamma <1} , it has the effect of valuing rewards received earlier higher than those received later (reflecting the value of a "good start"). γ {\displaystyle \gamma } may also be interpreted as the probability to succeed (or survive) at every step Δ t {\displaystyle \Delta t} . The algorithm, therefore, has a function that calculates the quality of a state–action combination: Q : S × A → R {\displaystyle Q:{\mathcal {S}}\times {\mathcal {A}}\to \mathbb {R} } . Before learning begins, ⁠ Q {\displaystyle Q} ⁠ is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time t {\displaystyle t} the agent selects an action A t {\displaystyle A_{t}} , observes a reward R t + 1 {\displaystyle R_{t+1}} , enters a new state S t + 1 {\displaystyle S_{t+1}} (that may depend on both the previous state S t {\displaystyle S_{t}} and the selected action), and Q {\displaystyle Q} is updated. The core of the algorithm is a Bellman equation as a simple value iteration update, using the weighted average of the current value and the new information: Q n e w ( S t , A t ) ← ( 1 − α ⏟ learning rate ) ⋅ Q ( S t , A t ) ⏟ current value + α ⏟ learning rate ⋅ ( R t + 1 ⏟ reward + γ ⏟ discount factor ⋅ max a Q ( S t + 1 , a ) ⏟ estimate of optimal future value ⏟ new value (temporal difference target) ) {\displaystyle Q^{new}(S_{t},A_{t})\leftarrow (1-\underbrace {\alpha } _{\text{learning rate}})\cdot \underbrace {Q(S_{t},A_{t})} _{\text{current value}}+\underbrace {\alpha } _{\text{learning rate}}\cdot {\bigg (}\underbrace {\underbrace {R_{t+1}} _{\text{reward}}+\underbrace {\gamma } _{\text{discount factor}}\cdot \underbrace {\max _{a}Q(S_{t+1},a)} _{\text{estimate of optimal future value}}} _{\text{new value (temporal difference target)}}{\bigg )}} where R t + 1 {\displaystyle R_{t+1}} is the reward received when moving from the state S t {\displaystyle S_{t}} to the state S t + 1 {\displaystyle S_{t+1}} , and α {\displaystyle \alpha } is the learning rate ( 0 < α ≤ 1 ) {\displaystyle (0<\alpha \leq 1)} . Note that Q n e w ( S t , A t ) {\displaystyle Q^{new}(S_{t},A_{t})} is the sum of three terms: ( 1 − α ) Q ( S t , A t ) {\displaystyle (1-\alpha )Q(S_{t},A_{t})} : the current value (weighted by one minus the learning rate) α R t + 1 {\displaystyle \alpha \,R_{t+1}} : the reward R t + 1 {\displaystyle R_{t+1}} to obtain if action A t {\displaystyle A_{t}} is taken when in state S t {\displaystyle S_{t}} (weighted by learning rate) α γ max a Q ( S t + 1 , a ) {\displaystyle \alpha \gamma \max _{a}Q(S_{t+1},a)} : the maximum reward that can be obtained from state S t + 1 {\displaystyle S_{t+1}} (weighted by learning rate and discount factor) An episode of the algorithm ends when state S t + 1 {\displaystyle S_{t+1}} is a final or terminal state. However, Q-learning can also learn in non-episodic tasks (as a result of the property of convergent infinite series). If the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops or paths. For all final states s f {\displaystyle s_{f}} , Q ( s f , a ) {\displaystyle Q(s_{f},a)} is never updated, but is set to the reward value r {\displaystyle r} observed for state s f {\displaystyle s_{f}} . In most cases, Q ( s f , a ) {\displaystyle Q(s_{f},a)} can be taken to equal zero. == Influence of variables == === Learning rate === The learning rate or step size determines to what extent newly acquired information overrides old information. A factor of 0 makes the agent learn nothing (exclusively exploiting prior knowledge), while a factor of 1 makes the agent consider only the most recent information (ignoring prior knowledge to explore possibilities). In fully deterministic environments, a learning rate of α t = 1 {\displaystyle \alpha _{t}=1} is optimal. When the problem is stochastic, the algorithm converges under some technical conditions on the learning rate that require it to decrease to zero. In practice, often a constant learning rate is used, such as α t = 0.1 {\displaystyle \alpha _{t}=0.1} for all t {\displaystyle t} . === Discount factor === The discount factor ⁠ γ {\displaystyle \gamma } ⁠ determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, i.e. r t {\displaystyle r_{t}} (in the update rule above), while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For ⁠ γ = 1 {\displaystyle \gamma =1} ⁠, without a terminal state, or if the agent never reaches one, all environment histories become infinitely long, and utilities with additive, undiscounted rewards generally become infinite. Even with a discount factor only slightly lower than 1, Q-function learning leads to propagation of errors and instabilities when the value function is approximated with an artificial neural network. In that case, starting with a lower discount factor and increasing it towards its final value accelerates learning. === Initial conditions (Q0) === Since Q-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. High initial values, also known as "optimistic initial conditions", can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. The first reward r {\displaystyle r} can be used to reset the initial conditions. According to this idea, the first time an action is taken the reward is used to set the value

    Read more →
  • Ho–Kashyap algorithm

    Ho–Kashyap algorithm

    The Ho–Kashyap algorithm is an iterative method in machine learning for finding a linear decision boundary that separates two linearly separable classes. It was developed by Yu-Chi Ho and Rangasami L. Kashyap in 1965, and usually presented as a problem in linear programming. == Setup == Given a training set consisting of samples from two classes, the Ho–Kashyap algorithm seeks to find a weight vector w {\displaystyle \mathbf {w} } and a margin vector b {\displaystyle \mathbf {b} } such that: Y w = b {\displaystyle \mathbf {Yw} =\mathbf {b} } where Y {\displaystyle \mathbf {Y} } is the augmented data matrix with samples from both classes (with appropriate sign conventions, e.g., samples from class 2 are negated), w {\displaystyle \mathbf {w} } is the weight vector to be determined, and b {\displaystyle \mathbf {b} } is a positive margin vector. The algorithm minimizes the criterion function: J ( w , b ) = | | Y w − b | | 2 {\displaystyle J(\mathbf {w} ,\mathbf {b} )=||\mathbf {Yw} -\mathbf {b} ||^{2}} subject to the constraint that b > 0 {\displaystyle \mathbf {b} >\mathbf {0} } (element-wise). Given a problem of linearly separating two classes, we consider a dataset of elements { ( x i , y i ) } i ∈ 1 : N {\displaystyle \{(\mathbf {x_{i}} ,y_{i})\}_{i\in 1:N}} where y i ∈ { − 1 , + 1 } {\displaystyle y_{i}\in \{-1,+1\}} . Linearly separating them by a perceptron is equivalent to finding weight and bias w , b {\displaystyle \mathbf {w} ,b} for a perceptron, such that: [ y 1 x 1 1 ⋮ ⋮ y N x N 1 ] [ w b ] > 0 {\displaystyle {\begin{bmatrix}y_{1}\mathbf {x} _{1}&1\\\vdots &\vdots \\y_{N}\mathbf {x} _{N}&1\\\end{bmatrix}}{\begin{bmatrix}\mathbf {w} \\b\end{bmatrix}}>0} == Algorithm == The idea of the Ho–Kashyap algorithm is as follows: Given any b {\displaystyle \mathbf {b} } , the corresponding w {\displaystyle \mathbf {w} } is known: It is simply w = Y + b {\displaystyle \mathbf {w} =\mathbf {Y} ^{+}\mathbf {b} } , where Y + {\displaystyle \mathbf {Y} ^{+}} denotes the Moore–Penrose pseudoinverse of Y {\displaystyle \mathbf {Y} } . Therefore, it only remains to find b {\displaystyle \mathbf {b} } by gradient descent. However, the gradient descent may sometimes decrease some of the coordinates of b {\displaystyle \mathbf {b} } , which may cause some coordinates of b {\displaystyle \mathbf {b} } to become negative, which is undesirable. Therefore, whenever some coordinates of b {\displaystyle \mathbf {b} } would have decreased, those coordinates are unchanged instead. As for the coordinates of b {\displaystyle \mathbf {b} } that would increase, those would increase without issue. Formally, the algorithm is as follows: Initialization: Set b ( 0 ) {\displaystyle \mathbf {b} (0)} to an arbitrary positive vector, typically b ( 0 ) = 1 {\displaystyle \mathbf {b} (0)=\mathbf {1} } (a vector of ones). Set the iteration counter k = 0 {\displaystyle k=0} . Set w ( 0 ) = Y + b ( 0 ) {\displaystyle \mathbf {w} (0)=\mathbf {Y} ^{+}\mathbf {b} (0)} Loop until convergence, or until iteration counter exceeds some k m a x {\displaystyle k_{max}} . Error calculation: Compute the error vector: e ( k ) = Y w ( k ) − b ( k ) {\displaystyle \mathbf {e} (k)=\mathbf {Yw} (k)-\mathbf {b} (k)} . Margin update: Update the margin vector: b ( k + 1 ) = b ( k ) + 2 η k ( e ( k ) + | e ( k ) | ) {\displaystyle \mathbf {b} (k+1)=\mathbf {b} (k)+2\eta _{k}(\mathbf {e} (k)+|\mathbf {e} (k)|)} where η k {\displaystyle \eta _{k}} is a positive learning rate parameter, and | e ( k ) | {\displaystyle |\mathbf {e} (k)|} denotes the element-wise absolute value. Weight calculation: Compute the weight vector using the pseudoinverse: w ( k + 1 ) = Y + b ( k + 1 ) {\displaystyle \mathbf {w} (k+1)=\mathbf {Y} ^{+}\mathbf {b} (k+1)} . Convergence check: If | | e ( k ) | | ≤ θ {\displaystyle ||\mathbf {e} (k)||\leq \theta } for some predetermined threshold θ {\displaystyle \theta } (close to zero), then return b ( k + 1 ) , w ( k + 1 ) {\displaystyle \mathbf {b} (k+1),\mathbf {w} (k+1)} . if e ( k ) ≤ 0 {\displaystyle \mathbf {e} (k)\leq \mathbf {0} } (all components non-positive), return "Samples not separable.". Return "Algorithm failed to converge in time.". == Properties == If the training data is linearly separable, the algorithm converges to a solution (where e ( k ) = 0 {\displaystyle \mathbf {e} (k)=\mathbf {0} } ) in a finite number of iterations. If the data is not linearly separable, the algorithm may or may not ever reach the point where e ( k ) = 0 {\displaystyle \mathbf {e} (k)=\mathbf {0} } . However, if it does happen that e ( k ) ≤ 0 {\displaystyle \mathbf {e} (k)\leq \mathbf {0} } at some iteration, this proves non-separability. The convergence rate depends on the choice of the learning rate parameter ρ {\displaystyle \rho } and the degree of linear separability of the data. == Relationship to other algorithms == Perceptron algorithm: Both seek linear separators. The perceptron updates weights incrementally based on individual misclassified samples, while Ho–Kashyap is a batch method that processes all samples to compute the pseudoinverse and updates based on an overall error vector. Linear discriminant analysis (LDA): LDA assumes underlying Gaussian distributions with equal covariances for the classes and derives the decision boundary from these statistical assumptions. Ho–Kashyap makes no explicit distributional assumptions and instead tries to solve a system of linear inequalities directly. Support vector machines (SVM): For linearly separable data, SVMs aim to find the maximum-margin hyperplane. The Ho–Kashyap algorithm finds a separating hyperplane but not necessarily the one with the maximum margin. If the data is not separable, soft-margin SVMs allow for some misclassifications by optimizing a trade-off between margin size and misclassification penalty, while Ho–Kashyap provides a least-squares solution. == Variants == Modified Ho–Kashyap algorithm changes weight calculation step w ( k + 1 ) = Y + b ( k + 1 ) {\displaystyle \mathbf {w} (k+1)=\mathbf {Y} ^{+}\mathbf {b} (k+1)} to w ( k + 1 ) = w ( k ) + η k Y + | e ( k ) | {\displaystyle \mathbf {w} (k+1)=\mathbf {w} (k)+\eta _{k}\mathbf {Y} ^{+}|\mathbf {e} (k)|} . Kernel Ho–Kashyap algorithm: Applies kernel methods (the "kernel trick") to the Ho–Kashyap framework to enable non-linear classification by implicitly mapping data to a higher-dimensional feature space.

    Read more →
  • Seed (programming)

    Seed (programming)

    Seed is a JavaScript interpreter and a library of the GNOME project to create standalone applications in JavaScript. It uses the JavaScript engine JavaScriptCore of the WebKit project. It is possible to easily create modules in C. Seed is integrated in GNOME since the 2.28 version and is used by two games in the GNOME Games package. It is also used by the Web web browser for the design of its extensions. The module is also officially supported by the GTK+ project. == Hello world in Seed == This example uses the standard output to output the string "Hello, World". == A program using GTK+ == This code shows an empty window named "Example". == Modules == To use a module, just instantiate a class having for name imports. followed by the name of the module respecting the case sensitivity. The modules using GObject Introspection, who starts by imports.gi. : Gtk Gst GObject Gio Clutter GLib Gdk WebKit GdkPixbuf, GdkPixbuf Libxml Cairo DBus MPFR Os (system library) Canvas (using Cairo) multiprocessing readline Archived 2009-11-09 at the Wayback Machine ffi sqlite sandbox Archived 2009-11-09 at the Wayback Machine == List of the Seed versions == The names of the versions of Seed are albums of famous rock bands.

    Read more →
  • Minimum Population Search

    Minimum Population Search

    In evolutionary computation, Minimum Population Search (MPS) is a computational method that optimizes a problem by iteratively trying to improve a set of candidate solutions with regard to a given measure of quality. It solves a problem by evolving a small population of candidate solutions by means of relatively simple arithmetical operations. MPS is a metaheuristic as it makes few or no assumptions about the problem being optimized and can search very large spaces of candidate solutions. For problems where finding the precise global optimum is less important than finding an acceptable local optimum in a fixed amount of time, using a metaheuristic such as MPS may be preferable to alternatives such as brute-force search or gradient descent. MPS is used for multidimensional real-valued functions but does not use the gradient of the problem being optimized, which means MPS does not require for the optimization problem to be differentiable as is required by classic optimization methods such as gradient descent and quasi-newton methods. MPS can therefore also be used on optimization problems that are not even continuous, are noisy, change over time, etc. == Background == In a similar way to Differential evolution, MPS uses difference vectors between the members of the population in order to generate new solutions. It attempts to provide an efficient use of function evaluations by maintaining a small population size. If the population size is smaller than the dimensionality of the search space, then the solutions generated through difference vectors will be constrained to the n − 1 {\displaystyle n-1} dimensional hyperplane. A smaller population size will lead to a more restricted subspace. With a population size equal to the dimensionality of the problem ( n = d ) {\displaystyle (n=d)} , the “line/hyperplane points” in MPS will be generated within a d − 1 {\displaystyle d-1} dimensional hyperplane. Taking a step orthogonal to this hyperplane will allow the search process to cover all the dimensions of the search space. Population size is a fundamental parameter in the performance of population-based heuristics. Larger populations promote exploration, but they also allow fewer generations, and this can reduce the chance of convergence. Searching with a small population can increase the chances of convergence and the efficient use of function evaluations, but it can also induce the risk of premature convergence. If the risk of premature convergence can be avoided, then a population-based heuristic could benefit from the efficiency and faster convergence rate of a smaller population. To avoid premature convergence, it is important to have a diversified population. By including techniques for explicitly increasing diversity and exploration, it is possible to have smaller populations with less risk of premature convergence. === Thresheld Convergence === Thresheld Convergence (TC) is a diversification technique which attempts to separate the processes of exploration and exploitation. TC uses a “threshold” function to establish a minimum search step, and managing this step makes it possible to influence the transition from exploration to exploitation, convergence is thus “held” back until the last stages of the search process. The goal of a controlled transition is to avoid an early concentration of the population around a few search regions and avoid the loss of diversity which can cause premature convergence. Thresheld Convergence has been successfully applied to several population-based metaheuristics such as Particle Swarm Optimization, Differential evolution, Evolution strategies, Simulated annealing and Estimation of Distribution Algorithms. The ideal case for Thresheld Convergence is to have one sample solution from each attraction basin, and for each sample solution to have the same relative fitness with respect to its local optimum. Enforcing a minimum step aims to achieve this ideal case. In MPS Thresheld Convergence is specifically used to preserve diversity and avoid premature convergence by establishing a minimum search step. By disallowing new solutions which are too close to members of the current population, TC forces a strong exploration during the early stages of the search while preserving the diversity of the (small) population. == Algorithm == A basic variant of the MPS algorithm works by having a population of size equal to the dimension of the problem. New solutions are generated by exploring the hyperplane defined by the current solutions (by means of difference vectors) and performing an additional orthogonal step in order to avoid getting caught in this hyperplane. The step sizes are controlled by the Thresheld Convergence technique, which gradually reduces step sizes as the search process advances. An outline for the algorithm is given below: Generate the first initial population. Allowing these solutions to lie near the bounds of the search space generally gives good results: s k = ( r s 1 ∗ b o u n d 1 / 2 , r s 2 ∗ b o u n d 2 / 2 , . . . , r s n ∗ b o u n d n / 2 ) {\displaystyle s_{k}=(rs_{1}bound_{1}/2,rs_{2}bound_{2}/2,...,rs_{n}bound_{n}/2)} where s k {\displaystyle s_{k}} is the k {\displaystyle k} -th population member, r s i {\displaystyle rs_{i}} are random numbers which can be −1 or 1, and the b o u n d i {\displaystyle bound_{i}} are the lower and upper bounds on each dimension. While a stop condition is not reached: Update threshold convergence values ( m i n _ s t e p {\displaystyle min\_step} and m a x _ s t e p {\displaystyle max\_step} ) Calculate the centroid of the current population ( x c {\displaystyle x_{c}} ) For each member of the population ( x i {\displaystyle x_{i}} ), generate a new offspring as follows: Uniformly generate a scaling factor ( F i {\displaystyle F_{i}} ) between − m a x _ s t e p {\displaystyle -max\_step} and m a x _ s t e p {\displaystyle max\_step} Generate a vector ( x o {\displaystyle x_{o}} ) orthogonal to the difference vector between x i {\displaystyle x_{i}} and x c {\displaystyle x_{c}} Calculate a scaling factor for the orthogonal vector: m i n _ o r t h = s q r t ( m a x ( m i n _ s t e p 2 − F i 2 , 0 ) ) {\displaystyle min\_orth=sqrt(max(min\_step^{2}-F_{i}^{2},0))} m a x _ o r t h = s q r t ( m a x ( m a x _ s t e p 2 − F i 2 , 0 ) ) {\displaystyle max\_orth=sqrt(max(max\_step^{2}-F_{i}^{2},0))} o r t h _ s t e p = u n i f o r m ( m i n _ o r t h , m a x _ o r t h ) {\displaystyle orth\_step=uniform(min\_orth,max\_orth)} Generate the new solution by adding the difference and the orthogonal vectors to the original solution n e w _ s o l u t i o n = x i + F i ∗ ( x i − x c ) ∗ o r t h _ s t e p ∗ x o {\displaystyle new\_solution=x_{i}+F_{i}(x_{i}-x_{c})orth\_stepx_{o}} Pick the best members between the old population and the new one by discarding the least fit members. Return the single best solution or the best population found as the final result.

    Read more →
  • Multimodal learning

    Multimodal learning

    Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning. Multimodal learning was proposed in 2011 at the beginning of the deep learning period. Large multimodal models, such as Google Gemini and GPT-4o, have become increasingly popular since 2023, enabling increased versatility and a broader understanding of real-world phenomena. == Motivation == Data usually comes with different modalities which carry different information. For example, it is very common to caption an image to convey the information not presented in the image itself. Similarly, sometimes it is more straightforward to use an image to describe information which may not be obvious from text. As a result, if different words appear in similar images, then these words likely describe the same thing. Conversely, if a word is used to describe seemingly dissimilar images, then these images may represent the same object. Thus, in cases dealing with multi-modal data, it is important to use a model which is able to jointly represent the information such that the model can capture the combined information from different modalities. == Multimodal transformers == Models such as CLIP (Contrastive Language–Image Pretraining) learn joint representations of images and text by optimizing contrastive objectives, allowing the model to match images with their corresponding textual descriptions. == Multimodal deep Boltzmann machines == A Boltzmann machine is a type of stochastic neural network invented by Geoffrey Hinton and Terry Sejnowski in 1985. Boltzmann machines can be seen as the stochastic, generative counterpart of Hopfield nets. They are named after the Boltzmann distribution in statistical mechanics. The units in Boltzmann machines are divided into two groups: visible units and hidden units. Each unit is like a neuron with a binary output that represents whether it is activated or not. General Boltzmann machines allow connection between any units. However, learning is impractical using general Boltzmann Machines because the computational time is exponential to the size of the machine. A more efficient architecture is called restricted Boltzmann machine where connection is only allowed between hidden unit and visible unit, which is described in the next section. Multimodal deep Boltzmann machines can process and learn from different types of information, such as images and text, simultaneously. This can notably be done by having a separate deep Boltzmann machine for each modality, for example one for images and one for text, joined at an additional top hidden layer. == Applications == Multimodal machine learning has numerous applications across various domains: Cross-modal retrieval: cross-modal retrieval allows users to search for data across different modalities (e.g., retrieving images based on text descriptions), improving multimedia search engines and content recommendation systems. Classification and missing data retrieval: multimodal Deep Boltzmann Machines outperform traditional models like support vector machines and latent Dirichlet allocation in classification tasks and can predict missing data in multimodal datasets, such as images and text. Healthcare diagnostics: multimodal models integrate medical imaging, genomic data, and patient records to improve diagnostic accuracy and early disease detection, especially in cancer screening. Content generation: models like DALL·E generate images from textual descriptions, benefiting creative industries, while cross-modal retrieval enables dynamic multimedia searches. Robotics and human-computer interaction: multimodal learning improves interaction in robotics and AI by integrating sensory inputs like speech, vision, and touch, aiding autonomous systems and human-computer interaction. Emotion recognition: combining visual, audio, and text data, multimodal systems enhance sentiment analysis and emotion recognition, applied in customer service, social media, and marketing.

    Read more →
  • Evolutionary attractor

    Evolutionary attractor

    An evolutionary attractor is a point in an evolutionary space where a selection process will always drive trait values towards that point from the region around it. Because of the importance of evolution through natural selection, often such an evolutionary space will be defined by genetic or phenotypic traits, or possibly both. In this case the selection process will be a form of natural selection. The existence of an evolutionary attractor in a biological evolutionary space does not always imply that it can be reached from all points in that evolutionary space, nor does it identify what will happen when the evolutionary attractor is reached. While an evolutionary attractor may represent a point in evolutionary space that is resistant to further selection, such as an evolutionarily stable strategy, other possibilities are available. Because identification of an evolutionary attractor on its own does not describe everything about the evolutionary space in which it lies, this has led to interest in the evolutionary dynamics surrounding evolutionary attractors and in evolutionary spaces in general. (Theoretical biologists and mathematicians working in the area may prefer the terms adaptive dynamics or evolutionary invasion analysis to evolutionary dynamics.) These fields use differential equations which allows a more complete understanding of the dynamics in evolutionary spaces including the existence or otherwise of evolutionary attractors. Advances in the study of molecular evolution have also led to the identification of evolutionary attractors at a molecular level. Because biological evolutionary processes have been studied using evolutionary game theory, a technique inspired by game theory originally derived to address economic problems, not only can evolutionary attractors be found in biology but economists studying evolutionary economic models have also identified evolutionary attractors. Evolution in biology has also inspired evolutionary computation in computer science. Many algorithms in this field use a form of selection inspired by natural selection to generate results through evolutionary algorithms. This is therefore another area in which evolutionary attractors have been identified. == Evolutionary attractors in biology == It is not probably not surprising that biology is the field where most examples of evolutionary attractors have been identified, given the importance of evolution through natural selection. === Evolutionary attractors in adaptive landscapes === An evolutionary attractor is a point in genetic and/or phenotypic trait space, that evolution will always drive trait values towards via a selection process. The concept of an evolutionary attractor arose in population genetics following the origin of the adaptive landscape originally proposed by Sewall Wright in 1932. The height of a point in an adaptive landscape is a measure of evolutionary fitness. If a point in an adaptive landscape is a peak, then selection will always drive traits towards it and it will be an evolutionary attractor. While population genetics deals with discrete genetic traits, quantitative genetics extended such concepts to deal with continuous genetic traits, where the concept of evolutionary attractor is also valid. === Evolutionary attractors in evolutionary game models === Evolutionary game theory introduced into evolutionary biology concepts originally used in economics, with the advantage that evolution could be studied in relation to strategic choices made in animal conflicts. This is of particular interest because of the concept of the evolutionarily stable strategy or ESS, a strategy that once established is resistant to invasion by other strategies. ESSs will not always be evolutionary attractors, but if they are they will persist over evolutionary time. === Dynamics around evolutionary attractors in biology === Evolutionary attractors in biology do not exist in isolation. By definition they must exist in an evolutionary trait space where selection drives all traits towards them from a region immediately around them. That is, they must be convergence stable. Eshel (1983) modified the definition of an ESS by considering individually advantageous reduction from a majority deviation: he created the term continuous stability. A continuously stable ESS can be shown to be convergence stable, therefore it will act as an evolutionary attractor. But the nature of evolutionary trait spaces in biology means that it is not possible to guarantee that the region of convergence to the evolutionary attractor covers the whole of the trait space, nor that there is only one evolutionary attractor in a particular trait space. These issues have led to the emergence of the related fields of evolutionary dynamics, adaptive dynamics and evolutionary invasion analysis, all of which use differential equations to understand the dynamics in evolutionary trait spaces. Hence, if one or more evolutionary attractor exists in an evolutionary trait space, they provide techniques to understand the dynamics in that trait space around the evolutionary attractor. === Evolutionary attractors in an ecological context === Evolution in biology does not take place in single species in isolation. Ecological interaction of species leads to coevolution. Important examples of this are host-parasite or host-pathogen interaction, which can make both the dynamics around evolutionary attractors more complex, and the occurrence and number of evolutionary attractors more diverse. Evolutionary attractors have been identified in the analysis of evolutionary epidemiology of plant pathogens. In the above study working on plant populations the authors were able to identify evolutionary attractors using methods from adaptive dynamics. A model applied to the analysis of a maize (Zea mays L.) virus identified convergence stable equilibria through simulation modelling. A related model identified evolutionary attractors in the interaction of plants with fungal pathogens. === Evolutionary attractors in molecular genetics === As mentioned above much of the consideration of evolutionary attractors in biology has been through investigation of selection at a genetic or phenotypic level or both, in a single species or in coevolving species. Advances in the study of molecular genetics now allow the study of evolutionary attractors to be taken to a molecular genetic level. Wilson et. al (2019) studied the evolution of gene regulatory networks and identified the emergence of evolutionary attractors. == Evolutionary attractors in economics == Evolutionary game theory as applied in biology was inspired by game theory originally devised for applications in economics. Game theory remains an active field of research outside of biology, and thus it is not surprising that researchers in evolutionary economics use evolutionary game theory. Evolutionary attractors have been demonstrated by economists studying the evolutionary dynamics of market entry with market dynamics based on the replicator dynamics of biological evolutionary games. == Evolutionary attractors in computing == Evolutionary computation is a branch of computer science inspired by biological evolution. Many algorithms in evolutionary computation use a form of selection. Thus evolutionary attractors have been identified in computer science as well as in biology and economics. Evolutionary algorithms have generated evolutionary attractors, probably because of the similarity between adaptive hill-climbing in evolutionary heuristics and the adaptive landscape originated to explain evolution through natural selection.

    Read more →
  • Semantic folding

    Semantic folding

    Semantic folding theory describes a procedure for encoding the semantics of natural language text in a semantically grounded binary representation. This approach provides a framework for modelling how language data is processed by the neocortex. == Theory == Semantic folding theory draws inspiration from Douglas R. Hofstadter's Analogy as the Core of Cognition which suggests that the brain makes sense of the world by identifying and applying analogies. The theory hypothesises that semantic data must therefore be introduced to the neocortex in such a form as to allow the application of a similarity measure and offers, as a solution, the sparse binary vector employing a two-dimensional topographic semantic space as a distributional reference frame. The theory builds on the computational theory of the human cortex known as hierarchical temporal memory (HTM), and positions itself as a complementary theory for the representation of language semantics. A particular strength claimed by this approach is that the resulting binary representation enables complex semantic operations to be performed simply and efficiently at the most basic computational level. == Two-dimensional semantic space == Analogous to the structure of the neocortex, Semantic Folding theory posits the implementation of a semantic space as a two-dimensional grid. This grid is populated by context-vectors in such a way as to place similar context-vectors closer to each other, for instance, by using competitive learning principles. This vector space model is presented in the theory as an equivalence to the well known word space model described in the information retrieval literature. Given a semantic space (implemented as described above) a word-vector can be obtained for any given word Y by employing the following algorithm: For each position X in the semantic map (where X represents cartesian coordinates) if the word Y is contained in the context-vector at position X then add 1 to the corresponding position in the word-vector for Y else add 0 to the corresponding position in the word-vector for Y The result of this process will be a word-vector containing all the contexts in which the word Y appears and will therefore be representative of the semantics of that word in the semantic space. It can be seen that the resulting word-vector is also in a sparse distributed representation (SDR) format [Schütze, 1993] & [Sahlgreen, 2006]. Some properties of word-SDRs that are of particular interest with respect to computational semantics are: high noise resistance: As a result of similar contexts being placed closer together in the underlying map, word-SDRs are highly tolerant of false or shifted "bits". boolean logic: It is possible to manipulate word-SDRs in a meaningful way using boolean (OR, AND, exclusive-OR) and/or arithmetical (SUBtract) functions . sub-sampling: Word-SDRs can be sub-sampled to a high degree without any appreciable loss of semantic information. topological two-dimensional representation: The SDR representation maintains the topological distribution of the underlying map therefore words with similar meanings will have similar word-vectors. This suggests that a variety of measures can be applied to the calculation of semantic similarity, from a simple overlap of vector elements, to a range of distance measures such as: Euclidean distance, Hamming distance, Jaccard distance, cosine similarity, Levenshtein distance, Sørensen-Dice index, etc. == Semantic spaces == Semantic spaces in the natural language domain aim to create representations of natural language that are capable of capturing meaning. The original motivation for semantic spaces stems from two core challenges of natural language: Vocabulary mismatch (the fact that the same meaning can be expressed in many ways) and ambiguity of natural language (the fact that the same term can have several meanings). The application of semantic spaces in natural language processing (NLP) aims at overcoming limitations of rule-based or model-based approaches operating on the keyword level. The main drawback with these approaches is their brittleness, and the large manual effort required to create either rule-based NLP systems or training corpora for model learning. Rule-based and machine learning-based models are fixed on the keyword level and break down if the vocabulary differs from that defined in the rules or from the training material used for the statistical models. Research in semantic spaces dates back more than 20 years. In 1996, two papers were published that raised a lot of attention around the general idea of creating semantic spaces: latent semantic analysis from Microsoft and Hyperspace Analogue to Language from the University of California. However, their adoption was limited by the large computational effort required to construct and use those semantic spaces. A breakthrough with regard to the accuracy of modelling associative relations between words (e.g. "spider-web", "lighter-cigarette", as opposed to synonymous relations such as "whale-dolphin", "astronaut-driver") was achieved by explicit semantic analysis (ESA) in 2007. ESA was a novel (non-machine learning) based approach that represented words in the form of vectors with 100,000 dimensions (where each dimension represents an Article in Wikipedia). However practical applications of the approach are limited due to the large number of required dimensions in the vectors. More recently, advances in neural networking techniques in combination with other new approaches (tensors) led to a host of new recent developments: Word2vec from Google and GloVe from Stanford University. Semantic folding represents a novel, biologically inspired approach to semantic spaces where each word is represented as a sparse binary vector with 16,000 dimensions (a semantic fingerprint) in a 2D semantic map (the semantic universe). Sparse binary representation are advantageous in terms of computational efficiency, and allow for the storage of very large numbers of possible patterns. == Visualization == The topological distribution over a two-dimensional grid (outlined above) lends itself to a bitmap type visualization of the semantics of any word or text, where each active semantic feature can be displayed as e.g. a pixel. As can be seen in the images shown here, this representation allows for a direct visual comparison of the semantics of two (or more) linguistic items. Image 1 clearly demonstrates that the two disparate terms "dog" and "car" have, as expected, very obviously different semantics. Image 2 shows that only one of the meaning contexts of "jaguar", that of "Jaguar" the car, overlaps with the meaning of Porsche (indicating partial similarity). Other meaning contexts of "jaguar" e.g. "jaguar" the animal clearly have different non-overlapping contexts. The visualization of semantic similarity using Semantic Folding bears a strong resemblance to the fMRI images produced in a research study conducted by A.G. Huth et al., where it is claimed that words are grouped in the brain by meaning. voxels, little volume segments of the brain, were found to follow a pattern were semantic information is represented along the boundary of the visual cortex with visual and linguistic categories represented on posterior and anterior side respectively.

    Read more →
  • Sharpness aware minimization

    Sharpness aware minimization

    Sharpness Aware Minimization (SAM) is an optimization algorithm used in machine learning that aims to improve model generalization. The method seeks to find model parameters that are located in regions of the loss landscape with uniformly low loss values, rather than parameters that only achieve a minimal loss value at a single point. This approach is described as finding "flat" minima instead of "sharp" ones. The rationale is that models trained this way are less sensitive to variations between training and test data, which can lead to better performance on unseen data. The algorithm was introduced in a 2020 paper by a team of researchers including Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. == Underlying Principle == SAM modifies the standard training objective by minimizing a "sharpness-aware" loss. This is formulated as a minimax problem where the inner objective seeks to find the highest loss value in the immediate neighborhood of the current model weights, and the outer objective minimizes this value: min w max ‖ ϵ ‖ p ≤ ρ L train ( w + ϵ ) + λ ‖ w ‖ 2 2 {\displaystyle \min _{w}\max _{\|\epsilon \|_{p}\leq \rho }L_{\text{train}}(w+\epsilon )+\lambda \|w\|_{2}^{2}} In this formulation: w {\displaystyle w} represents the model's parameters (weights). L train {\displaystyle L_{\text{train}}} is the loss calculated on the training data. ϵ {\displaystyle \epsilon } is a perturbation applied to the weights. ρ {\displaystyle \rho } is a hyperparameter that defines the radius of the neighborhood (an L p {\displaystyle L_{p}} ball) to search for the highest loss. An optional L2 regularization term, scaled by λ {\displaystyle \lambda } , can be included. A direct solution to the inner maximization problem is computationally expensive. SAM approximates it by taking a single gradient ascent step to find the perturbation ϵ {\displaystyle \epsilon } . This is calculated as: ϵ ( w ) = ρ ∇ L train ( w ) ‖ ∇ L train ( w ) ‖ 2 {\displaystyle \epsilon (w)=\rho {\frac {\nabla L_{\text{train}}(w)}{\|\nabla L_{\text{train}}(w)\|_{2}}}} The optimization process for each training step involves two stages. First, an "ascent step" computes a perturbed set of weights, w adv = w + ϵ ( w ) {\displaystyle w_{\text{adv}}=w+\epsilon (w)} , by moving towards the direction of the highest local loss. Second, a "descent step" updates the original weights w {\displaystyle w} using the gradient calculated at these perturbed weights, ∇ L train ( w adv ) {\displaystyle \nabla L_{\text{train}}(w_{\text{adv}})} . This update is typically performed using a standard optimizer like SGD or Adam. == Application and Performance == SAM has been applied in various machine learning contexts, primarily in computer vision. Research has shown it can improve generalization performance in models such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) on image datasets including ImageNet, CIFAR-10, and CIFAR-100. The algorithm has also been found to be effective in training models with noisy labels, where it performs comparably to methods designed specifically for this problem. Some studies indicate that SAM and its variants can improve out-of-distribution (OOD) generalization, which is a model's ability to perform well on data from distributions not seen during training. Other areas where it has been applied include gradual domain adaptation and mitigating overfitting in scenarios with repeated exposure to training examples. == Limitations == A primary limitation of SAM is its computational cost. By requiring two gradient computations (one for the ascent and one for the descent) per optimization step, it approximately doubles the training time compared to standard optimizers. The theoretical convergence properties of SAM are still under investigation. Some research suggests that with a constant step size, SAM may not converge to a stationary point. The accuracy of the single gradient step approximation for finding the worst-case perturbation may also decrease during the training process. The effectiveness of SAM can also be domain-dependent. While it has shown benefits for computer vision tasks, its impact on other areas, such as GPT-style language models where each training example is seen only once, has been reported as limited in some studies. Furthermore, while SAM seeks flat minima, some research suggests that not all flat minima necessarily lead to good generalization. The algorithm also introduces the neighborhood size ρ {\displaystyle \rho } as a new hyperparameter, which requires tuning. == Research, Variants, and Enhancements == Active research on SAM focuses on reducing its computational overhead and improving its performance. Several variants have been proposed to make the algorithm more efficient. These include methods that attempt to parallelize the two gradient computations, apply the perturbation to only a subset of parameters, or reduce the number of computation steps required. Other approaches use historical gradient information or apply SAM steps intermittently to lower the computational burden. To improve performance and robustness, variants have been developed that adapt the neighborhood size based on model parameter scales (Adaptive SAM or ASAM) or incorporate information about the curvature of the loss landscape (Curvature Regularized SAM or CR-SAM). Other research explores refining the perturbation step by focusing on specific components of the gradient or combining SAM with techniques like random smoothing. Theoretical work continues to analyze the algorithm's behavior, including its implicit bias towards flatter minima and the development of broader frameworks for sharpness-aware optimization that use different measures of sharpness.

    Read more →
  • Physical neural network

    Physical neural network

    A physical neural network is a type of artificial neural network in which an electrically adjustable material is used to emulate the function of a neural synapse or a higher-order (dendritic) neuron model. "Physical" neural network is used to emphasize the reliance on physical hardware used to emulate neurons as opposed to software-based approaches. More generally the term is applicable to other artificial neural networks in which a memristor or other electrically adjustable resistance material is used to emulate a neural synapse. == Types of physical neural networks == === ADALINE === In the 1960s Bernard Widrow and Ted Hoff developed ADALINE (Adaptive Linear Neuron) which used electrochemical cells called memistors (memory resistors) to emulate synapses of an artificial neuron. The memistors were implemented as 3-terminal devices operating based on the reversible electroplating of copper such that the resistance between two of the terminals is controlled by the integral of the current applied via the third terminal. The ADALINE circuitry was briefly commercialized by the Memistor Corporation in the 1960s enabling some applications in pattern recognition. However, since the memistors were not fabricated using integrated circuit fabrication techniques the technology was not scalable and was eventually abandoned as solid-state electronics became mature. === Analog VLSI === In 1989 Carver Mead published his book Analog VLSI and Neural Systems, which spun off perhaps the most common variant of analog neural networks. The physical realization is implemented in analog VLSI. This is often implemented as field effect transistors in low inversion. Such devices can be modelled as translinear circuits. This is a technique described by Barrie Gilbert in several papers around mid 1970th, and in particular his Translinear Circuits from 1981. With this method circuits can be analyzed as a set of well-defined functions in steady-state, and such circuits assembled into complex networks. === Physical Neural Network === Alex Nugent describes a physical neural network as one or more nonlinear neuron-like nodes used to sum signals and nanoconnections formed from nanoparticles, nanowires, or nanotubes which determine the signal strength input to the nodes. Alignment or self-assembly of the nanoconnections is determined by the history of the applied electric field performing a function analogous to neural synapses. Numerous applications for such physical neural networks are possible. For example, a temporal summation device can be composed of one or more nanoconnections having an input and an output thereof, wherein an input signal provided to the input causes one or more of the nanoconnection to experience an increase in connection strength thereof over time. Another example of a physical neural network is taught by U.S. Patent No. 7,039,619 entitled "Utilized nanotechnology apparatus using a neural network, a solution and a connection gap," which issued to Alex Nugent by the U.S. Patent & Trademark Office on May 2, 2006. A further application of physical neural network is shown in U.S. Patent No. 7,412,428 entitled "Application of hebbian and anti-hebbian learning to nanotechnology-based physical neural networks," which issued on August 12, 2008. Nugent and Molter have shown that universal computing and general-purpose machine learning are possible from operations available through simple memristive circuits operating the AHaH plasticity rule. More recently, it has been argued that also complex networks of purely memristive circuits can serve as neural networks. === Phase change neural network === In 2002, Stanford Ovshinsky described an analog neural computing medium in which phase-change material has the ability to cumulatively respond to multiple input signals. An electrical alteration of the resistance of the phase change material is used to control the weighting of the input signals. === Memristive neural network === Greg Snider of HP Labs describes a system of cortical computing with memristive nanodevices. The memristors (memory resistors) are implemented by thin film materials in which the resistance is electrically tuned via the transport of ions or oxygen vacancies within the film. DARPA's SyNAPSE project has funded IBM Research and HP Labs, in collaboration with the Boston University Department of Cognitive and Neural Systems (CNS), to develop neuromorphic architectures which may be based on memristive systems. === Protonic artificial synapses === In 2022, researchers reported the development of nanoscale brain-inspired artificial synapses, using the ion proton (H+), for 'analog deep learning'.

    Read more →
  • Genetic representation

    Genetic representation

    In computer programming, genetic representation is a way of presenting solutions/individuals in evolutionary computation methods. The term encompasses both the concrete data structures and data types used to realize the genetic material of the candidate solutions in the form of a genome, and the relationships between search space and problem space. In the simplest case, the search space corresponds to the problem space (direct representation). The choice of problem representation is tied to the choice of genetic operators, both of which have a decisive effect on the efficiency of the optimization. Genetic representation can encode appearance, behavior, physical qualities of individuals. Difference in genetic representations is one of the major criteria drawing a line between known classes of evolutionary computation. Terminology is often analogous with natural genetics. The block of computer memory that represents one candidate solution is called an individual. The data in that block is called a chromosome. Each chromosome consists of genes. The possible values of a particular gene are called alleles. A programmer may represent all the individuals of a population using binary encoding, permutational encoding, encoding by tree, or any one of several other representations. == Representations in some popular evolutionary algorithms == Genetic algorithms (GAs) are typically linear representations; these are often, but not always, binary. Holland's original description of GA used arrays of bits. Arrays of other types and structures can be used in essentially the same way. The main property that makes these genetic representations convenient is that their parts are easily aligned due to their fixed size. This facilitates simple crossover operation. Depending on the application, variable-length representations have also been successfully used and tested in evolutionary algorithms (EA) in general and genetic algorithms in particular, although the implementation of crossover is more complex in this case. Evolution strategy uses linear real-valued representations, e.g., an array of real values. It uses mostly gaussian mutation and blending/averaging crossover. Genetic programming (GP) pioneered tree-like representations and developed genetic operators suitable for such representations. Tree-like representations are used in GP to represent and evolve functional programs with desired properties. Human-based genetic algorithm (HBGA) offers a way to avoid solving hard representation problems by outsourcing all genetic operators to outside agents, in this case, humans. The algorithm has no need for knowledge of a particular fixed genetic representation as long as there are enough external agents capable of handling those representations, allowing for free-form and evolving genetic representations. === Common genetic representations === binary array integer or real-valued array binary tree natural language parse tree directed graph == Distinction between search space and problem space == Analogous to biology, EAs distinguish between problem space (corresponds to phenotype) and search space (corresponds to genotype). The problem space contains concrete solutions to the problem being addressed, while the search space contains the encoded solutions. The mapping from search space to problem space is called genotype-phenotype mapping. The genetic operators are applied to elements of the search space, and for evaluation, elements of the search space are mapped to elements of the problem space via genotype-phenotype mapping. == Relationships between search space and problem space == The importance of an appropriate choice of search space for the success of an EA application was recognized early on. The following requirements can be placed on a suitable search space and thus on a suitable genotype-phenotype mapping: === Completeness === All possible admissible solutions must be contained in the search space. === Redundancy === When more possible genotypes exist than phenotypes, the genetic representation of the EA is called redundant. In nature, this is termed a degenerate genetic code. In the case of a redundant representation, neutral mutations are possible. These are mutations that change the genotype but do not affect the phenotype. Thus, depending on the use of the genetic operators, there may be phenotypically unchanged offspring, which can lead to unnecessary fitness determinations, among other things. Since the evaluation in real-world applications usually accounts for the lion's share of the computation time, it can slow down the optimization process. In addition, this can cause the population to have higher genotypic diversity than phenotypic diversity, which can also hinder evolutionary progress. In biology, the Neutral Theory of Molecular Evolution states that this effect plays a dominant role in natural evolution. This has motivated researchers in the EA community to examine whether neutral mutations can improve EA functioning by giving populations that have converged to a local optimum a way to escape that local optimum through genetic drift. This is discussed controversially and there are no conclusive results on neutrality in EAs. On the other hand, there are other proven measures to handle premature convergence. === Locality === The locality of a genetic representation corresponds to the degree to which distances in the search space are preserved in the problem space after genotype-phenotype mapping. That is, a representation has a high locality exactly when neighbors in the search space are also neighbors in the problem space. In order for successful schemata not to be destroyed by genotype-phenotype mapping after a minor mutation, the locality of a representation must be high. === Scaling === In genotype-phenotype mapping, the elements of the genotype can be scaled (weighted) differently. The simplest case is uniform scaling: all elements of the genotype are equally weighted in the phenotype. A common scaling is exponential. If integers are binary coded, the individual digits of the resulting binary number have exponentially different weights in representing the phenotype. Example: The number 90 is written in binary (i.e., in base two) as 1011010. If now one of the front digits is changed in the binary notation, this has a significantly greater effect on the coded number than any changes at the rear digits (the selection pressure has an exponentially greater effect on the front digits). For this reason, exponential scaling has the effect of randomly fixing the "posterior" locations in the genotype before the population gets close enough to the optimum to adjust for these subtleties. == Hybridization and repair in genotype-phenotype mapping == When mapping the genotype to the phenotype being evaluated, domain-specific knowledge can be used to improve the phenotype and/or ensure that constraints are met. This is a commonly used method to improve EA performance in terms of runtime and solution quality. It is illustrated below by two of the three examples. == Examples == === Example of a direct representation === An obvious and commonly used encoding for the traveling salesman problem and related tasks is to number the cities to be visited consecutively and store them as integers in the chromosome. The genetic operators must be suitably adapted so that they only change the order of the cities (genes) and do not cause deletions or duplications. Thus, the gene order corresponds to the city order and there is a simple one-to-one mapping. === Example of a complex genotype-phenotype mapping === In a scheduling task with heterogeneous and partially alternative resources to be assigned to a set of subtasks, the genome must contain all necessary information for the individual scheduling operations or it must be possible to derive them from it. In addition to the order of the subtasks to be executed, this includes information about the resource selection. A phenotype then consists of a list of subtasks with their start times and assigned resources. In order to be able to create this, as many allocation matrices must be created as resources can be allocated to one subtask at most. In the simplest case this is one resource, e.g., one machine, which can perform the subtask. An allocation matrix is a two-dimensional matrix, with one dimension being the available time units and the other being the resources to be allocated. Empty matrix cells indicate availability, while an entry indicates the number of the assigned subtask. The creation of allocation matrices ensures firstly that there are no inadmissible multiple allocations. Secondly, the start times of the subtasks can be read from it as well as the assigned resources. A common constraint when scheduling resources to subtasks is that a resource can only be allocated once per time unit and that the reservation must be for a contiguous period of time. To achieve this in a timely manner, which is a c

    Read more →
  • Materialized view

    Materialized view

    In computing, a materialized view is a database object that contains the results of a query. For example, it may be a local copy of data located remotely, or may be a subset of the rows and/or columns of a table or join result, or may be a summary using an aggregate function. The process of setting up a materialized view is sometimes called materialization. This is a form of caching the results of a query, similar to memoization of the value of a function in functional languages, and it is sometimes described as a form of precomputation. As with other forms of precomputation, database users typically use materialized views for performance reasons, i.e. as a form of optimization. Materialized views that store data based on remote tables were also known as snapshots (deprecated Oracle terminology). In any database management system following the relational model, a view is a virtual table representing the result of a database query. Whenever a query or an update addresses an ordinary view's virtual table, the DBMS converts these into queries or updates against the underlying base tables. A materialized view takes a different approach: the query result is cached as a concrete ("materialized") table (rather than a view as such) that may be updated from the original base tables from time to time. This enables much more efficient access, at the cost of extra storage and of some data being potentially out-of-date. Materialized views find use especially in data warehousing scenarios, where frequent queries of the actual base tables can be expensive. In a materialized view, indexes can be built on any column. In contrast, in a normal view, it's typically only possible to exploit indexes on columns that come directly from (or have a mapping to) indexed columns in the base tables; often this functionality is not offered at all. == Implementations == === Oracle === Materialized views were implemented first by the Oracle Database: the Query rewrite feature was added from version 8i. Example syntax to create a materialized view in Oracle: === PostgreSQL === In PostgreSQL, version 9.3 and newer natively support materialized views. In version 9.3, a materialized view is not auto-refreshed, and is populated only at time of creation (unless WITH NO DATA is used). It may be refreshed later manually using REFRESH MATERIALIZED VIEW. In version 9.4, the refresh may be concurrent with selects on the materialized view if CONCURRENTLY is used. Example syntax to create a materialized view in PostgreSQL: === SQL Server === Microsoft SQL Server differs from other RDBMS by the way of implementing materialized view via a concept known as "Indexed Views". The main difference is that such views do not require a refresh because they are in fact always synchronized to the original data of the tables that compound the view. To achieve this, it is necessary that the lines of origin and destination are "deterministic" in their mapping, which limits the types of possible queries to do this. This mechanism has been realised since the 2000 version of SQL Server. Example syntax to create a materialized view in SQL Server: === Stream processing frameworks === Apache Kafka (since v0.10.2), Apache Spark (since v2.0), Apache Flink, Kinetica DB, Materialize, RisingWave, and Epsio all support materialized views on streams of data. === Others === Materialized views are also supported in Sybase SQL Anywhere. In IBM Db2, they are called "materialized query tables". ClickHouse supports materialized views that automatically refresh on merges. MySQL doesn't support materialized views natively, but workarounds can be implemented by using triggers or stored procedures or by using the open-source application Flexviews. Materialized views can be implemented in Amazon DynamoDB using data modification events captured by DynamoDB Streams. Google announced in 8 April 2020 the availability of materialized views for BigQuery as a beta release.

    Read more →
  • Multilayer perceptron

    Multilayer perceptron

    In deep learning, a multilayer perceptron (MLP) is a kind of modern feedforward neural network consisting of fully connected neurons with nonlinear activation functions, organized in layers, notable for being able to distinguish data that is not linearly separable. Modern neural networks are trained using backpropagation and are colloquially referred to as "vanilla" networks. MLPs grew out of an effort to improve on single-layer perceptrons, which could only be applied to linearly separable data. A perceptron traditionally used a Heaviside step function as its nonlinear activation function. However, the backpropagation algorithm requires that modern MLPs use continuous activation functions such as sigmoid or ReLU. Multilayer perceptrons form the basis of deep learning, and are applicable across a vast set of diverse domains. == Timeline == In 1943, Warren McCulloch and Walter Pitts proposed the binary artificial neuron as a logical model of biological neural networks. In 1958, Frank Rosenblatt proposed the multilayered perceptron model, consisting of an input layer, a hidden layer with randomized weights that did not learn, and an output layer with learnable connections. In 1962, Rosenblatt published many variants and experiments on perceptrons in his book Principles of Neurodynamics, including up to 2 trainable layers by "back-propagating errors". However, it was not the backpropagation algorithm, and he did not have a general method for training multiple layers. In 1965, Alexey Grigorevich Ivakhnenko and Valentin Lapa published Group Method of Data Handling. It was one of the first deep learning methods, used to train an eight-layer neural net in 1971. In 1967, Shun'ichi Amari reported the first multilayered neural network trained by stochastic gradient descent, was able to classify non-linearily separable pattern classes. Amari's student Saito conducted the computer experiments, using a five-layered feedforward network with two learning layers. Backpropagation was independently developed multiple times in early 1970s. The earliest published instance was Seppo Linnainmaa's master thesis (1970). Paul Werbos developed it independently in 1971, but had difficulty publishing it until 1982. In 1986, David E. Rumelhart et al. popularized backpropagation. In 2003, interest in backpropagation networks returned due to the successes of deep learning being applied to language modelling by Yoshua Bengio with co-authors. In 2021, a very simple NN architecture combining two deep MLPs with skip connections and layer normalizations was designed and called MLP-Mixer; its realizations featuring 19 to 431 millions of parameters were shown to be comparable to vision transformers of similar size on ImageNet and similar image classification tasks. == Mathematical foundations == === Activation function === If a multilayer perceptron has a linear activation function in all neurons, that is, a linear function that maps the weighted inputs to the output of each neuron, then linear algebra shows that any number of layers can be reduced to a two-layer input-output model. In MLPs some neurons use a nonlinear activation function that was developed to model the frequency of action potentials, or firing, of biological neurons. The two historically common activation functions are both sigmoids, and are described by y ( v i ) = tanh ⁡ ( v i ) and y ( v i ) = ( 1 + e − v i ) − 1 {\displaystyle y(v_{i})=\tanh(v_{i})~~{\textrm {and}}~~y(v_{i})=(1+e^{-v_{i}})^{-1}} . The first is a hyperbolic tangent that ranges from −1 to 1, while the other is the logistic function, which is similar in shape but ranges from 0 to 1. Here y i {\displaystyle y_{i}} is the output of the i {\displaystyle i} th node (neuron) and v i {\displaystyle v_{i}} is the weighted sum of the input connections. Alternative activation functions have been proposed, including the rectifier and softplus functions. More specialized activation functions include radial basis functions (used in radial basis networks, another class of supervised neural network models). In recent developments of deep learning the rectified linear unit (ReLU) is more frequently used as one of the possible ways to overcome the numerical problems related to the sigmoids. === Layers === The MLP consists of three or more layers (an input and an output layer with one or more hidden layers) of nonlinearly-activating nodes. Since MLPs are fully connected, each node in one layer connects with a certain weight w i j {\displaystyle w_{ij}} to every node in the following layer. === Learning === Learning occurs in the perceptron by changing connection weights after each piece of data is processed, based on the amount of error in the output compared to the expected result. This is an example of supervised learning, and is carried out through backpropagation, a generalization of the least mean squares algorithm in the linear perceptron. We can represent the degree of error in an output node j {\displaystyle j} in the n {\displaystyle n} th data point (training example) by e j ( n ) = d j ( n ) − y j ( n ) {\displaystyle e_{j}(n)=d_{j}(n)-y_{j}(n)} , where d j ( n ) {\displaystyle d_{j}(n)} is the desired target value for n {\displaystyle n} th data point at node j {\displaystyle j} , and y j ( n ) {\displaystyle y_{j}(n)} is the value produced by the perceptron at node j {\displaystyle j} when the n {\displaystyle n} th data point is given as an input. The node weights can then be adjusted based on corrections that minimize the error in the entire output for the n {\displaystyle n} th data point, given by E ( n ) = 1 2 ∑ output node j e j 2 ( n ) {\displaystyle {\mathcal {E}}(n)={\frac {1}{2}}\sum _{{\text{output node }}j}e_{j}^{2}(n)} . Using gradient descent, the change in each weight w i j {\displaystyle w_{ij}} is Δ w j i ( n ) = − η ∂ E ( n ) ∂ v j ( n ) y i ( n ) {\displaystyle \Delta w_{ji}(n)=-\eta {\frac {\partial {\mathcal {E}}(n)}{\partial v_{j}(n)}}y_{i}(n)} where y i ( n ) {\displaystyle y_{i}(n)} is the output of the previous neuron i {\displaystyle i} , and η {\displaystyle \eta } is the learning rate, which is selected to ensure that the weights quickly converge to a response, without oscillations. In the previous expression, ∂ E ( n ) ∂ v j ( n ) {\displaystyle {\frac {\partial {\mathcal {E}}(n)}{\partial v_{j}(n)}}} denotes the partial derivate of the error E ( n ) {\displaystyle {\mathcal {E}}(n)} according to the weighted sum v j ( n ) {\displaystyle v_{j}(n)} of the input connections of neuron i {\displaystyle i} . The derivative to be calculated depends on the induced local field v j {\displaystyle v_{j}} , which itself varies. It is easy to prove that for an output node this derivative can be simplified to − ∂ E ( n ) ∂ v j ( n ) = e j ( n ) ϕ ′ ( v j ( n ) ) {\displaystyle -{\frac {\partial {\mathcal {E}}(n)}{\partial v_{j}(n)}}=e_{j}(n)\phi ^{\prime }(v_{j}(n))} where ϕ ′ {\displaystyle \phi ^{\prime }} is the derivative of the activation function described above, which itself does not vary. The analysis is more difficult for the change in weights to a hidden node, but it can be shown that the relevant derivative is − ∂ E ( n ) ∂ v j ( n ) = ϕ ′ ( v j ( n ) ) ∑ k − ∂ E ( n ) ∂ v k ( n ) w k j ( n ) {\displaystyle -{\frac {\partial {\mathcal {E}}(n)}{\partial v_{j}(n)}}=\phi ^{\prime }(v_{j}(n))\sum _{k}-{\frac {\partial {\mathcal {E}}(n)}{\partial v_{k}(n)}}w_{kj}(n)} . This depends on the change in weights of the k {\displaystyle k} th nodes, which represent the output layer. So to change the hidden layer weights, the output layer weights change according to the derivative of the activation function, and so this algorithm represents a backpropagation of the activation function.

    Read more →
  • Self-organizing map

    Self-organizing map

    A self-organizing map (SOM) or self-organizing feature map (SOFM) is an unsupervised machine learning technique used to produce a low-dimensional (typically two-dimensional) representation of a higher-dimensional data set while preserving the topological structure of the data. For example, a data set with p {\displaystyle p} variables measured in n {\displaystyle n} observations could be represented as clusters of observations with similar values for the variables. These clusters then could be visualized as a two-dimensional "map" such that observations in proximal clusters have more similar values than observations in distal clusters. This can make high-dimensional data easier to visualize and analyze. A SOM is a type of artificial neural network but is trained using competitive learning rather than the error-correction learning (e.g., backpropagation with gradient descent) used by other artificial neural networks. The SOM was introduced by the Finnish professor Teuvo Kohonen in the 1980s and therefore is sometimes called a Kohonen map or Kohonen network. The Kohonen map or network is a computationally convenient abstraction building on biological models of neural systems from the 1970s and morphogenesis models dating back to Alan Turing in the 1950s. SOMs create internal representations reminiscent of the cortical homunculus, a distorted representation of the human body, based on a neurological "map" of the areas and proportions of the human brain dedicated to processing sensory functions, for different parts of the body. == Overview == Self-organizing maps, like most artificial neural networks, operate in two modes: training and mapping. First, training uses an input data set (the "input space") to generate a lower-dimensional representation of the input data (the "map space"). Second, mapping classifies additional input data using the generated map. The goal of training is to represent an input space with p dimensions as a map space with n dimensions, where p > n. Specifically, an input space with p variables is said to have p dimensions. A map space consists of components called "nodes" or "neurons", which are arranged as a hexagonal or rectangular grid with two dimensions. The number of nodes and their arrangement are specified beforehand based on the larger goals of the analysis and exploration of the data. Each node in the map space is associated with a "weight" vector, which is the position of the node in the input space. While nodes in the map space stay fixed, training consists in moving weight vectors toward the input data (reducing a distance metric such as Euclidean distance) without spoiling the topology induced from the map space. After training, the map can be used to classify additional observations for the input space by finding the node with the closest weight vector (smallest distance metric) to the input space vector. == Learning algorithm == The goal of learning in the self-organizing map is to cause different parts of the network to respond similarly to certain input patterns. This is partly motivated by how visual, auditory or other sensory information is handled in separate parts of the cerebral cortex in the human brain. The weights of the neurons are initialized either to small random values or sampled evenly from the subspace spanned by the two largest principal component eigenvectors. With the latter alternative, learning is much faster because the initial weights already give a good approximation of SOM weights. The network must be fed a large number of example vectors that represent, as close as possible, the kinds of vectors expected during mapping. The examples are usually administered several times as iterations. The training utilizes competitive learning. When a training example is fed to the network, its Euclidean distance to all weight vectors is computed. The neuron whose weight vector is most similar to the input is called the best matching unit (BMU). The weights of the BMU and neurons close to it in the SOM grid are adjusted towards the input vector. The magnitude of the change decreases with time and with the grid-distance from the BMU. The update formula for a neuron v with weight vector Wv(s) is W v ( s + 1 ) = W v ( s ) + θ ( u , v , s ) ⋅ α ( s ) ⋅ ( D ( t ) − W v ( s ) ) {\displaystyle W_{v}(s+1)=W_{v}(s)+\theta (u,v,s)\cdot \alpha (s)\cdot (D(t)-W_{v}(s))} , where s is the step index, t is an index into the training sample, u is the index of the BMU for the input vector D(t), α(s) is a monotonically decreasing learning coefficient; θ(u, v, s) is the neighborhood function which gives the distance between the neuron u and the neuron v in step s. Depending on the implementations, t can scan the training data set systematically (t is 0, 1, 2...T-1, then repeat, T being the training sample's size), be randomly drawn from the data set (bootstrap sampling), or implement some other sampling method (such as jackknifing). The neighborhood function θ(u, v, s) (also called function of lateral interaction) depends on the grid-distance between the BMU (neuron u) and neuron v. In the simplest form, it is 1 for all neurons close enough to BMU and 0 for others, but the Gaussian and Mexican-hat functions are common choices, too. Regardless of the functional form, the neighborhood function shrinks with time. At the beginning when the neighborhood is broad, the self-organizing takes place on the global scale. When the neighborhood has shrunk to just a couple of neurons, the weights are converging to local estimates. In some implementations, the learning coefficient α and the neighborhood function θ decrease steadily with increasing s, in others (in particular those where t scans the training data set) they decrease in step-wise fashion, once every T steps. This process is repeated for each input vector for a (usually large) number of cycles λ. The network winds up associating output nodes with groups or patterns in the input data set. If these patterns can be named, the names can be attached to the associated nodes in the trained net. During mapping, there will be one single winning neuron: the neuron whose weight vector lies closest to the input vector. This can be simply determined by calculating the Euclidean distance between input vector and weight vector. While representing input data as vectors has been emphasized in this article, any kind of object which can be represented digitally, which has an appropriate distance measure associated with it, and in which the necessary operations for training are possible can be used to construct a self-organizing map. This includes matrices, continuous functions or even other self-organizing maps. === Algorithm === Randomize the node weight vectors in a map For s = 0 , 1 , 2 , . . . , λ {\displaystyle s=0,1,2,...,\lambda } Randomly pick an input vector D ( t ) {\displaystyle {D}(t)} Find the node in the map closest to the input vector. This node is the best matching unit (BMU). Denote it by u {\displaystyle u} For each node v {\displaystyle v} , update its vector by pulling it closer to the input vector: W v ( s + 1 ) = W v ( s ) + θ ( u , v , s ) ⋅ α ( s ) ⋅ ( D ( t ) − W v ( s ) ) {\displaystyle W_{v}(s+1)=W_{v}(s)+\theta (u,v,s)\cdot \alpha (s)\cdot (D(t)-W_{v}(s))} The variable names mean the following, with vectors in bold, s {\displaystyle s} is the current iteration λ {\displaystyle \lambda } is the iteration limit t {\displaystyle t} is the index of the target input data vector in the input data set D {\displaystyle \mathbf {D} } D ( t ) {\displaystyle {D}(t)} is a target input data vector v {\displaystyle v} is the index of the node in the map W v {\displaystyle \mathbf {W} _{v}} is the current weight vector of node v {\displaystyle v} u {\displaystyle u} is the index of the best matching unit (BMU) in the map θ ( u , v , s ) {\displaystyle \theta (u,v,s)} is the neighbourhood function, α ( s ) {\displaystyle \alpha (s)} is the learning rate schedule. The key design choices are the shape of the SOM, the neighbourhood function, and the learning rate schedule. The idea of the neighborhood function is to make it such that the BMU is updated the most, its immediate neighbors are updated a little less, and so on. The idea of the learning rate schedule is to make it so that the map updates are large at the start, and gradually stop updating. For example, if we want to learn a SOM using a square grid, we can index it using ( i , j ) {\displaystyle (i,j)} where both i , j ∈ 1 : N {\displaystyle i,j\in 1:N} . The neighborhood function can make it so that the BMU updates in full, the nearest neighbors update in half, and their neighbors update in half again, etc. θ ( ( i , j ) , ( i ′ , j ′ ) , s ) = 1 2 | i − i ′ | + | j − j ′ | = { 1 if i = i ′ , j = j ′ 1 / 2 if | i − i ′ | + | j − j ′ | = 1 1 / 4 if | i − i ′ | + | j − j ′ | = 2 ⋯ ⋯ {\displaystyle \theta ((i,j),(i',j'),s)={\frac {1}{2^{|i-i'|+|j-j'|}}}={\begin{cases}1&{\text{if }}i=i',j=j'\\1/2&{\text{if

    Read more →