Astrostatistics

Astrostatistics

Astrostatistics is a discipline which spans astrophysics, statistical analysis and data mining. It is used to process the vast amount of data produced by automated scanning of the cosmos, to characterize complex datasets, and to link astronomical data to astrophysical theory. Many branches of statistics are involved in astronomical analysis including nonparametrics, multivariate regression and multivariate classification, time series analysis, and especially Bayesian inference. The field is closely related to astroinformatics.

Easyrec

easyrec is an open-source program that provides personalized recommendations using RESTful Web services to be integrated into Web enabled applications. It is distributed under the GNU General Public License by the Studio Smart Agent Technologies and hosted at SourceForge. It is written in Java, uses a MySQL database and comes with an administration tool. == History == The development of easyrec, an implementation of the Adaptive Personalization approach, started in the course of several research and development projects conducted by the Studio Smart Agent Technologies in close cooperation with international companies. During the year of 2008 the core functionality of easyrec was developed forming the basis of research prototypes focusing on the music domain (e.g. MusicExplorer). In June 2009 a beta version of easyrec, containing basic administration features, was integrated into a movie streaming portal for evaluation purposes. Furthermore, in September 2009 easyrec was awarded a special recognition in the category “Award for Innovations – IT Innovations for an economic upswing” by the jury of the Austrian state prize for multimedia and e-business. After a comprehensive refactoring phase and the integration of the evaluation results easyrec was published on SourceForge on 18 February 2010. In course of the CeBIT tradeshow 2011 in Hanover easyrec has been awarded the German “INNOVATIONSPREIS-IT 2011”. == Principles == The following five primary goals guided the development of easyrec. It should be a ready-to-use application, not another algorithmic framework It should be easy to use, concerning installation, integration and administration It should be robust and scalable for serving real world applications It should be free of charge, so that anyone can profit from personalization features It should rely on a community-driven development == Uses == Although easyrec is a domain-agnostic, general purpose personalization system, the current Web service API is customized for providing online shops with item recommendations. Especially for small and medium enterprises, easyrec provides a low barrier entrance to personalization. == Features == A major feature of easyrec is a set of usage statistics and other business relevant information presented via an administration and management interface. Furthermore, the easyrec administrator is supported by a variety of administration and configuration functions including the manual import or adaptation of business rules. Integrators or developers benefit from the lightweight Web service APIs (REST and SOAP) as well as from the guided installation wizard. Concerning personalization functionality easyrec is providing the following services unpersonalized recommendations of the form "other users also bought/viewed/...", etc. personalized recommendation depending on individual preferences rankings such as "most bought items", "most viewed...", etc. Additionally, as an integration showcase, a MediaWiki extension was developed and is bundled with the application. Currently additional features like further recommender algorithms and a plugin-system are evaluated and prepared for integration into the easyrec system. == Architecture == The underlying architecture of easyrec is designed to be robust and scalable—separating time-consuming computations from the task of online assembling of recommendations. easyrec is designed as a multi-layer system consisting of a database layer as storage of user actions and pre-calculated business rules an application layer for hosting online and offline recommendation services and an API layer for various Web service interfaces. Moreover, the generator server contains different item association generators which create business rules that define a relation between two items.

Fitness function

A fitness function is a particular type of objective or cost function that is used to summarize, as a single figure of merit, how close a given candidate solution is to achieving the set aims. It is an important component of evolutionary algorithms (EA), such as genetic programming, evolution strategies or genetic algorithms. An EA is a metaheuristic that reproduces the basic principles of biological evolution as a computer algorithm in order to solve challenging optimization or planning tasks, at least approximately. For this purpose, many candidate solutions are generated, which are evaluated using a fitness function in order to guide the evolutionary development towards the desired goal. Similar quality functions are also used in other metaheuristics, such as ant colony optimization or particle swarm optimization. In the field of EAs, each candidate solution, also called an individual, is commonly represented as a string of numbers (referred to as a chromosome). After each round of testing or simulation the idea is to delete the n worst individuals, and to breed n new ones from the best solutions. Each individual must therefore to be assigned a quality number indicating how close it has come to the overall specification, and this is generated by applying the fitness function to the test or simulation results obtained from that candidate solution. Two main classes of fitness functions exist: one where the fitness function does not change, as in optimizing a fixed function or testing with a fixed set of test cases; and one where the fitness function is mutable, as in niche differentiation or co-evolving the set of test cases. Another way of looking at fitness functions is in terms of a fitness landscape, which shows the fitness for each possible chromosome. In the following, it is assumed that the fitness is determined based on an evaluation that remains unchanged during an optimization run. A fitness function does not necessarily have to be able to calculate an absolute value, as it is sometimes sufficient to compare candidates in order to select the better one. A relative indication of fitness (candidate a is better than b) is sufficient in some cases, such as tournament selection or Pareto optimization. == Requirements of evaluation and fitness function == The quality of the evaluation and calculation of a fitness function is fundamental to the success of an EA optimisation. It implements Darwin's principle of "survival of the fittest". Without fitness-based selection mechanisms for mate selection and offspring acceptance, EA search would be blind and hardly distinguishable from the Monte Carlo method. When setting up a fitness function, one must always be aware that it is about more than just describing the desired target state. Rather, the evolutionary search on the way to the optimum should also be supported as much as possible (see also section on auxiliary objectives), if and insofar as this is not already done by the fitness function alone. If the fitness function is designed badly, the algorithm will either converge on an inappropriate solution, or will have difficulty converging at all. Definition of the fitness function is not straightforward in many cases and often is performed iteratively if the fittest solutions produced by an EA is not what is desired. Interactive genetic algorithms address this difficulty by outsourcing evaluation to external agents which are normally humans. == Computational efficiency == The fitness function should not only closely align with the designer's goal, but also be computationally efficient. Execution speed is crucial, as a typical evolutionary algorithm must be iterated many times in order to produce a usable result for a non-trivial problem. Fitness approximation may be appropriate, especially in the following cases: Fitness computation time of a single solution is extremely high Precise model for fitness computation is missing The fitness function is uncertain or noisy. Alternatively or also in addition to the fitness approximation, the fitness calculations can also be distributed to a parallel computer in order to reduce the execution times. Depending on the population model of the EA used, both the EA itself and the fitness calculations of all offspring of one generation can be executed in parallel. == Multi-objective optimization == Practical applications usually aim at optimizing multiple and at least partially conflicting objectives. Two fundamentally different approaches are often used for this purpose, Pareto optimization and optimization based on fitness calculated using the weighted sum. === Weighted sum and penalty functions === When optimizing with the weighted sum, the single values of the O {\displaystyle O} objectives are first normalized so that they can be compared. This can be done with the help of costs or by specifying target values and determining the current value as the degree of fulfillment. Costs or degrees of fulfillment can then be compared with each other and, if required, can also be mapped to a uniform fitness scale. Without loss of generality, fitness is assumed to represent a value to be maximized. Each objective o i {\displaystyle o_{i}} is assigned a weight w i {\displaystyle w_{i}} in the form of a percentage value so that the overall raw fitness f r a w {\displaystyle f_{raw}} can be calculated as a weighted sum: f r a w = ∑ i = 1 O o i ⋅ w i w i t h ∑ i = 1 O w i = 1 {\displaystyle f_{raw}=\sum _{i=1}^{O}{o_{i}\cdot w_{i}}\quad {\mathsf {with}}\quad \sum _{i=1}^{O}{w_{i}}=1} A violation of R {\displaystyle R} restrictions r j {\displaystyle r_{j}} can be included in the fitness determined in this way in the form of penalty functions. For this purpose, a function p f j ( r j ) {\displaystyle pf_{j}(r_{j})} can be defined for each restriction which returns a value between 0 {\displaystyle 0} and 1 {\displaystyle 1} depending on the degree of violation, with the result being 1 {\displaystyle 1} if there is no violation. The previously determined raw fitness is multiplied by the penalty function(s) and the result is then the final fitness f f i n a l {\displaystyle f_{final}} : f f i n a l = f r a w ⋅ ∏ j = 1 R p f j ( r j ) = ∑ i = 1 O ( o i ⋅ w i ) ⋅ ∏ j = 1 R p f j ( r j ) {\displaystyle f_{final}=f_{raw}\cdot \prod _{j=1}^{R}{pf_{j}(r_{j})}=\sum _{i=1}^{O}{(o_{i}\cdot w_{i})}\cdot \prod _{j=1}^{R}{pf_{j}(r_{j})}} This approach is simple and has the advantage of being able to combine any number of objectives and restrictions. The disadvantage is that different objectives can compensate each other and that the weights have to be defined before the optimization. This means that the compromise lines must be defined before optimization, which is why optimization with the weighted sum is also referred to as the a priori method. In addition, certain solutions may not be obtained, see the section on the comparison of both types of optimization. === Pareto optimization === A solution is called Pareto-optimal if the improvement of one objective is only possible with a deterioration of at least one other objective. The set of all Pareto-optimal solutions, also called Pareto set, represents the set of all optimal compromises between the objectives. The figure below on the right shows an example of the Pareto set of two objectives f 1 {\displaystyle f_{1}} and f 2 {\displaystyle f_{2}} to be maximized. The elements of the set form the Pareto front (green line). From this set, a human decision maker must subsequently select the desired compromise solution. Constraints are included in Pareto optimization in that solutions without constraint violations are per se better than those with violations. If two solutions to be compared each have constraint violations, the respective extent of the violations decides. It was recognized early on that EAs with their simultaneously considered solution set are well suited to finding solutions in one run that cover the Pareto front sufficiently well. They are therefore well suited as a-posteriori methods for multi-objective optimization, in which the final decision is made by a human decision maker after optimization and determination of the Pareto front. Besides the SPEA2, the NSGA-II and NSGA-III have established themselves as standard methods. The advantage of Pareto optimization is that, in contrast to the weighted sum, it provides all alternatives that are equivalent in terms of the objectives as an overall solution. The disadvantage is that a visualization of the alternatives becomes problematic or even impossible from four objectives on. Furthermore, the effort increases exponentially with the number of objectives. If there are more than three or four objectives, some have to be combined using the weighted sum or other aggregation methods. === Comparison of both types of assessment === With the help of the weighted sum, the total Pareto front can be obtained by a suitable choice of weights, provided that it is convex

Discrete diffusion model

In machine learning, discrete diffusion models are a class of diffusion models, which themselves are a class of latent variable generative models. Each discrete diffusion model consists of two major components: the forward jump diffusion process, and the reverse jump diffusion process. The goal of diffusion modeling is, given a given dataset and a forward process, to learn a model for the reverse process, such that the reverse process can generate new elements that are distributed similarly as the original dataset. A trained discrete diffusion model can be sampled in many ways, which trades off computational efficiency and sample quality. In general, higher quality data can be obtained, but at the price of higher computational cost. In standard diffusion modeling, the diffusion process takes place over a state space that is continuous space of R n {\displaystyle \mathbb {R} ^{n}} , but over a discrete set S {\displaystyle S} . A discrete set is simply a set where one cannot speak of "infinitesimally close" points. Points can be more or less separated from each other, but the separation is always a finite number. This in particular means the standard framework of continuous diffusion does not apply, since it uses gaussian noise, which is continuous. Nevertheless, an analogous theory can be produced. Discrete diffusion is usually used for language modeling. In practice, the state space S {\displaystyle S} is not only discrete, but finite, so this is what we will assume from now on. == Continuous time Markov process == In the case of continuous state space, during the forward discrete diffusion process, at each step t → t + d t {\displaystyle t\to t+dt} , we mix in an infinitesimal amount of gaussian noise d x t = − 1 2 β ( t ) x t d t + β ( t ) d W t {\displaystyle dx_{t}=-{\frac {1}{2}}\beta (t)x_{t}dt+{\sqrt {\beta (t)}}dW_{t}} . This changes the probability density function, by first a convolution with the density of a gaussian, followed by a scaling. In the case of discrete state space, the gaussian noise must be replaced by a noise that takes values over a finite set. For example, if the noise is the uniform distribution over S {\displaystyle S} , then the probability distribution at time t + d t {\displaystyle t+dt} satisfies q t + d t ( x ) = ( 1 − d t ) q t ( x ) + d t ( 1 | S | ∑ y ∈ S q t ( y ) ) {\displaystyle q_{t+dt}(x)=(1-dt)q_{t}(x)+dt\left({\frac {1}{|S|}}\sum _{y\in S}q_{t}(y)\right)} More succinctly, ∂ t q t ( x ) = − ( 1 − 1 | S | ) q t ( x ) + ∑ y ∈ S , y ≠ x 1 | S | q t ( y ) {\displaystyle \partial _{t}q_{t}(x)=-\left(1-{\frac {1}{|S|}}\right)q_{t}(x)+\sum _{y\in S,y\neq x}{\frac {1}{|S|}}q_{t}(y)} In general, we do not need to convolve with a uniformly distributed noise, but with an arbitrary noise process. That is, we use an arbitrary matrix Q t {\displaystyle Q_{t}} such that ∂ t q t ( y ) = ∑ x ∈ S Q t ( y , x ) q t ( x ) {\displaystyle \partial _{t}q_{t}(y)=\sum _{x\in S}Q_{t}(y,x)q_{t}(x)} where Q t {\displaystyle Q_{t}} is called the rate matrix. Any matrix may be used as a rate matrix if it has non-negative off-diagonals, and each column sums to 0: Q t ( y , x ) ≥ 0 ∀ y ≠ x , ∑ y ∈ S Q t ( y , x ) = 0 ∀ x {\displaystyle Q_{t}(y,x)\geq 0\quad \forall y\neq x,\quad \sum _{y\in S}Q_{t}(y,x)=0\quad \forall x} A continuous time Markov chain (CTMC) is defined by a continuous function Q {\displaystyle Q} that maps any time t ∈ [ 0 , T ) {\displaystyle t\in [0,T)} to a rate matrix Q t {\displaystyle Q_{t}} . Given the function Q {\displaystyle Q} , time-evolution under the CTMC is done as follows: Given state x t {\displaystyle x_{t}} at time t {\displaystyle t} , and given an infinitesimal d t {\displaystyle dt} , the state at t + d t {\displaystyle t+dt} is x t + d t {\displaystyle x_{t+dt}} , such that Pr ( x t + d t | x t ) = { 1 + Q t ( x t + d t , x t ) d t if x t + d t = x t Q t ( x t + d t , x t ) d t else {\displaystyle \Pr(x_{t+dt}|x_{t})={\begin{cases}1+Q_{t}(x_{t+dt},x_{t})dt&{\text{if }}x_{t+dt}=x_{t}\\Q_{t}(x_{t+dt},x_{t})dt&{\text{else}}\end{cases}}} This implies that the probability distribution function evolves according to ∂ t q t ( y ) = ∑ x ∈ S Q t ( y , x ) q t ( x ) {\displaystyle \partial _{t}q_{t}(y)=\sum _{x\in S}Q_{t}(y,x)q_{t}(x)} which is what we previously specified. === Backward process === Similarly to the case of continuous diffusion, in discrete diffusion, there exists a backward diffusion process Q ¯ t {\displaystyle {\bar {Q}}_{t}} : s ( x , t ) y := q t ( y ) q t ( x ) , Q ¯ t ( y , x ) := { s ( x , t ) y Q t ( x , y ) if y ≠ x − ∑ y : y ≠ x Q ¯ t ( y , x ) if y = x {\displaystyle s(x,t)_{y}:={\frac {q_{t}(y)}{q_{t}(x)}},\quad {\bar {Q}}_{t}(y,x):={\begin{cases}s(x,t)_{y}Q_{t}(x,y)&{\text{if }}y\neq x\\-\sum _{y:y\neq x}{\bar {Q}}_{t}(y,x)&{\text{if }}y=x\end{cases}}} where s ( x , t ) y {\displaystyle s(x,t)_{y}} should be interpreted as the discrete score or concrete score, since, abusing notation a bit, the score function is ∇ ln ⁡ ρ t ( x ) = 1 d x ( ρ t ( x + d x ) ρ t ( x ) − 1 ) {\displaystyle \nabla \ln \rho _{t}(x)={\frac {1}{dx}}\left({\frac {\rho _{t}(x+dx)}{\rho _{t}(x)}}-1\right)} . If we picture the distribution q t {\displaystyle q_{t}} as a bunch of point-masses, one per state x ∈ S {\displaystyle x\in S} , then the forward diffusion from time t {\displaystyle t} to t + d t {\displaystyle t+dt} is performed by removing Q t ( x , y ) q t ( y ) d t {\displaystyle Q_{t}(x,y)q_{t}(y)dt} from the mass at y {\displaystyle y} and moving it to the mass at x {\displaystyle x} , for each pair x ≠ y {\displaystyle x\neq y} . Thus, the process is reversed in detail by the CTMC defined by Q ¯ {\displaystyle {\bar {Q}}} , since Q ¯ t ( y , x ) q t ( x ) = Q t ( x , y ) q t ( y ) {\displaystyle {\bar {Q}}_{t}(y,x)q_{t}(x)=Q_{t}(x,y)q_{t}(y)} . Given Q ¯ t {\displaystyle {\bar {Q}}_{t}} , if we have a way to sample from q t {\displaystyle q_{t}} , then we can sample from q t − d t {\displaystyle q_{t-dt}} by first sampling x t ∼ q t {\displaystyle x_{t}\sim q_{t}} , then sampling x t − d t {\displaystyle x_{t-dt}} according to Pr ( x t − d t | x t ) = { 1 + Q ¯ t ( x t − d t , x t ) d t if x t − d t = x t Q ¯ t ( x t − d t , x t ) d t else {\displaystyle \Pr(x_{t-dt}|x_{t})={\begin{cases}1+{\bar {Q}}_{t}(x_{t-dt},x_{t})dt&{\text{if }}x_{t-dt}=x_{t}\\{\bar {Q}}_{t}(x_{t-dt},x_{t})dt&{\text{else}}\end{cases}}} === Overall plan of score-matching discrete diffusion modeling === Similar to score-matching continuous diffusion, score-matching discrete diffusion is a method to sample an initial distribution. If we have a certain function s θ {\displaystyle s_{\theta }} that approximates the true score function s θ ( x , t ) y ≈ s ( x , t ) y {\displaystyle s_{\theta }(x,t)_{y}\approx s(x,t)_{y}} , then it allows a corresponding Q ¯ θ {\displaystyle {\bar {Q}}^{\theta }} to be defined in the same way. If we also have a base distribution q base {\displaystyle q_{\text{base}}} such that it is easy to sample from, and approximately equal to the true terminal distribution q base ≈ q T {\displaystyle q_{\text{base}}\approx q_{T}} , then we can perform the backward CTMC with Q ¯ θ {\displaystyle {\bar {Q}}^{\theta }} and q T θ := q terminal {\displaystyle q_{T}^{\theta }:=q_{\text{terminal}}} . When both approximations are good, the backward CTMC would give q 0 θ ≈ q 0 {\displaystyle q_{0}^{\theta }\approx q_{0}} . This is the idea of score-matching discrete diffusion modeling. If q data {\displaystyle q_{\text{data}}} is sharp, in the sense that for some x , x ′ {\displaystyle x,x'} , we have q data ( x ) ≫ q data ( x ′ ) {\displaystyle q_{\text{data}}(x)\gg q_{\text{data}}(x')} , then the score function would diverge as 1 / t {\displaystyle 1/t} at the t → 0 {\displaystyle t\to 0} limit. To avoid this in practice, it is common to use early stopping, which is to stop the backward process at some time δ > 0 {\displaystyle \delta >0} , and sample from q δ θ {\displaystyle q_{\delta }^{\theta }} instead of q 0 θ {\displaystyle q_{0}^{\theta }} . === Tractable forward processes === The theory of CTMC works for any continuous choice of rate matrices Q {\displaystyle Q} . However, most choices are computationally expensive and cannot be used in practice. In the case of continuous diffusion, the gaussian noise is used for the simple reason that the sum of any number of gaussians is still a gaussian. This allows one to sample any x t ∼ ρ t {\displaystyle x_{t}\sim \rho _{t}} by sampling a single x 0 ∼ ρ 0 {\displaystyle x_{0}\sim \rho _{0}} , followed by a single gaussian noise z ∼ N ( 0 , I ) {\displaystyle z\sim {\mathcal {N}}(0,I)} , and let x t = α ¯ t x 0 + σ t z {\displaystyle x_{t}={\sqrt {{\bar {\alpha }}_{t}}}x_{0}+\sigma _{t}z} , without needing any x s {\displaystyle x_{s}} for any 0 < s < t {\displaystyle 0

Sharpness aware minimization

Sharpness Aware Minimization (SAM) is an optimization algorithm used in machine learning that aims to improve model generalization. The method seeks to find model parameters that are located in regions of the loss landscape with uniformly low loss values, rather than parameters that only achieve a minimal loss value at a single point. This approach is described as finding "flat" minima instead of "sharp" ones. The rationale is that models trained this way are less sensitive to variations between training and test data, which can lead to better performance on unseen data. The algorithm was introduced in a 2020 paper by a team of researchers including Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. == Underlying Principle == SAM modifies the standard training objective by minimizing a "sharpness-aware" loss. This is formulated as a minimax problem where the inner objective seeks to find the highest loss value in the immediate neighborhood of the current model weights, and the outer objective minimizes this value: min w max ‖ ϵ ‖ p ≤ ρ L train ( w + ϵ ) + λ ‖ w ‖ 2 2 {\displaystyle \min _{w}\max _{\|\epsilon \|_{p}\leq \rho }L_{\text{train}}(w+\epsilon )+\lambda \|w\|_{2}^{2}} In this formulation: w {\displaystyle w} represents the model's parameters (weights). L train {\displaystyle L_{\text{train}}} is the loss calculated on the training data. ϵ {\displaystyle \epsilon } is a perturbation applied to the weights. ρ {\displaystyle \rho } is a hyperparameter that defines the radius of the neighborhood (an L p {\displaystyle L_{p}} ball) to search for the highest loss. An optional L2 regularization term, scaled by λ {\displaystyle \lambda } , can be included. A direct solution to the inner maximization problem is computationally expensive. SAM approximates it by taking a single gradient ascent step to find the perturbation ϵ {\displaystyle \epsilon } . This is calculated as: ϵ ( w ) = ρ ∇ L train ( w ) ‖ ∇ L train ( w ) ‖ 2 {\displaystyle \epsilon (w)=\rho {\frac {\nabla L_{\text{train}}(w)}{\|\nabla L_{\text{train}}(w)\|_{2}}}} The optimization process for each training step involves two stages. First, an "ascent step" computes a perturbed set of weights, w adv = w + ϵ ( w ) {\displaystyle w_{\text{adv}}=w+\epsilon (w)} , by moving towards the direction of the highest local loss. Second, a "descent step" updates the original weights w {\displaystyle w} using the gradient calculated at these perturbed weights, ∇ L train ( w adv ) {\displaystyle \nabla L_{\text{train}}(w_{\text{adv}})} . This update is typically performed using a standard optimizer like SGD or Adam. == Application and Performance == SAM has been applied in various machine learning contexts, primarily in computer vision. Research has shown it can improve generalization performance in models such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) on image datasets including ImageNet, CIFAR-10, and CIFAR-100. The algorithm has also been found to be effective in training models with noisy labels, where it performs comparably to methods designed specifically for this problem. Some studies indicate that SAM and its variants can improve out-of-distribution (OOD) generalization, which is a model's ability to perform well on data from distributions not seen during training. Other areas where it has been applied include gradual domain adaptation and mitigating overfitting in scenarios with repeated exposure to training examples. == Limitations == A primary limitation of SAM is its computational cost. By requiring two gradient computations (one for the ascent and one for the descent) per optimization step, it approximately doubles the training time compared to standard optimizers. The theoretical convergence properties of SAM are still under investigation. Some research suggests that with a constant step size, SAM may not converge to a stationary point. The accuracy of the single gradient step approximation for finding the worst-case perturbation may also decrease during the training process. The effectiveness of SAM can also be domain-dependent. While it has shown benefits for computer vision tasks, its impact on other areas, such as GPT-style language models where each training example is seen only once, has been reported as limited in some studies. Furthermore, while SAM seeks flat minima, some research suggests that not all flat minima necessarily lead to good generalization. The algorithm also introduces the neighborhood size ρ {\displaystyle \rho } as a new hyperparameter, which requires tuning. == Research, Variants, and Enhancements == Active research on SAM focuses on reducing its computational overhead and improving its performance. Several variants have been proposed to make the algorithm more efficient. These include methods that attempt to parallelize the two gradient computations, apply the perturbation to only a subset of parameters, or reduce the number of computation steps required. Other approaches use historical gradient information or apply SAM steps intermittently to lower the computational burden. To improve performance and robustness, variants have been developed that adapt the neighborhood size based on model parameter scales (Adaptive SAM or ASAM) or incorporate information about the curvature of the loss landscape (Curvature Regularized SAM or CR-SAM). Other research explores refining the perturbation step by focusing on specific components of the gradient or combining SAM with techniques like random smoothing. Theoretical work continues to analyze the algorithm's behavior, including its implicit bias towards flatter minima and the development of broader frameworks for sharpness-aware optimization that use different measures of sharpness.

Color vision

Color vision (CV), a feature of visual perception, is an ability to perceive differences between light composed of different frequencies independently of light intensity. Color perception is a part of the larger visual system and is mediated by a complex process between neurons that begins with differential stimulation of different types of photoreceptors by light entering the eye. Those photoreceptors then emit outputs that are propagated through many layers of neurons ultimately leading to higher cognitive functions in the brain. Color vision is found in many animals and is mediated by similar underlying mechanisms with common types of biological molecules and a complex history of the evolution of color vision within different animal taxa. In primates, color vision may have evolved under selective pressure for a variety of visual tasks including the foraging for nutritious young leaves, ripe fruit, and flowers, as well as detecting predator camouflage and emotional states in other primates. == Wavelength == Isaac Newton discovered that white light after being split into its component colors when passed through a dispersive prism could be recombined to make white light by passing them through a different prism. The visible light spectrum ranges from about 380 to 740 nanometers. Spectral colors (colors that are produced by a narrow band of wavelengths) such as red, orange, yellow, green, cyan, blue, and violet can be found in this range. These spectral colors do not refer to a single wavelength, but rather to a set of wavelengths: red, 625–740 nm; orange, 590–625 nm; yellow, 565–590 nm; green, 500–565 nm; cyan, 485–500 nm; blue, 450–485 nm; violet, 380–450 nm. Wavelengths longer or shorter than this range are called infrared or ultraviolet, respectively. Humans cannot generally see these wavelengths, but other animals may. === Hue detection === Sufficient differences in wavelength cause a difference in the perceived hue; the just-noticeable difference in wavelength varies from about 1 nm in the blue-green and yellow wavelengths to 10 nm and more in the longer red and shorter blue wavelengths. Although the human eye can distinguish up to a few hundred hues, when those pure spectral colors are mixed together or diluted with white light, the number of distinguishable chromaticities can be much higher. In very low light levels, vision is scotopic: light is detected by rod cells of the retina. Rods are maximally sensitive to wavelengths near 500 nm and play little, if any, role in color vision. In brighter light, such as daylight, vision is photopic: light is detected by cone cells which are responsible for color vision. Cones are sensitive to a range of wavelengths, but are most sensitive to wavelengths near 555 nm. Between these regions, mesopic vision comes into play and both rods and cones provide signals to the retinal ganglion cells. The shift in color perception from dim light to daylight gives rise to differences known as the Purkinje effect. The perception of "white" is formed by the entire spectrum of visible light, or by mixing colors of just a few wavelengths in animals with few types of color receptors. In humans, white light can be perceived by combining wavelengths such as red, green, and blue, or just a pair of complementary colors such as blue and yellow. === Non-spectral colors === There are a variety of colors in addition to spectral colors and their hues. These include grayscale colors, shades of colors obtained by mixing grayscale colors with spectral colors, violet-red colors, impossible colors, and metallic colors. Grayscale colors include white, gray, and black. Rods contain rhodopsin, which reacts to light intensity, providing grayscale coloring. Shades include colors such as pink or brown. Pink is obtained from mixing red and white. Brown may be obtained from mixing orange with gray or black. Navy is obtained from mixing blue and black. Violet-red colors include hues and shades of magenta. The light spectrum is a line on which violet is one end and the other is red, and yet we see hues of purple that connect those two colors. Impossible colors are a combination of cone responses that cannot be naturally produced. For example, medium cones cannot be activated completely on their own; if they were, we would see a 'hyper-green' color. == Dimensionality == Color vision is categorized foremost according to the dimensionality of the color gamut, which is defined by the number of primaries required to represent the color vision. This is generally equal to the number of photopsins expressed: a correlation that holds for vertebrates but not invertebrates. The common vertebrate ancestor possessed four photopsins (expressed in cones) plus rhodopsin (expressed in rods), so was tetrachromatic. However, many vertebrate lineages have lost one or many photopsin genes, leading to lower-dimension color vision. The dimensions of color vision range from 1-dimensional and up: == Physiology of color perception == Perception of color begins with specialized retinal cells known as cone cells. Cone cells contain different forms of opsin – a pigment protein – that have different spectral sensitivities. Humans contain three types, resulting in trichromatic color vision. Each individual cone contains pigments composed of opsin apoprotein covalently linked to a light-absorbing prosthetic group: either 11-cis-hydroretinal or, more rarely, 11-cis-dehydroretinal. The cones are conventionally labeled according to the ordering of the wavelengths of the peaks of their spectral sensitivities: short (S), medium (M), and long (L) cone types. These three types do not correspond well to particular colors as we know them. Rather, the perception of color is achieved by a complex process that starts with the differential output of these cells in the retina and which is finalized in the visual cortex and associative areas of the brain. For example, while the L cones have been referred to simply as red receptors, microspectrophotometry has shown that their peak sensitivity is in the greenish-yellow region of the spectrum. Similarly, the S cones and M cones do not directly correspond to blue and green, although they are often described as such. The RGB color model, therefore, is a convenient means for representing color but is not directly based on the types of cones in the human eye. The peak response of human cone cells varies, even among individuals with typical color vision; in some non-human species this polymorphic variation is even greater, and it may well be adaptive. === Theories === Two complementary theories of color vision are the trichromatic theory and the opponent process theory. The trichromatic theory, or Young–Helmholtz theory, proposed in the 19th century by Thomas Young and Hermann von Helmholtz, posits three types of cones preferentially sensitive to blue, green, and red, respectively. Others have suggested that the trichromatic theory is not specifically a theory of color vision but a theory of receptors for all vision, including color but not specific or limited to it. Equally, it has been suggested that the relationship between the phenomenal opponency described by Ewald Hering and the physiological opponent processes are not straightforward (see below), making of physiological opponency a mechanism that is relevant to the whole of vision, and not just to color vision alone. Hering proposed the opponent process theory in 1872. It states that the visual system interprets color in an antagonistic way: red vs. green, blue vs. yellow, black vs. white. Both theories are generally accepted as valid, describing different stages in visual physiology, visualized in the adjacent diagram. Green–magenta and blue–yellow are scales with mutually exclusive boundaries. In the same way that there cannot exist a "slightly negative" positive number, a single eye cannot perceive a bluish-yellow or a reddish-green. Although these two theories are both currently widely accepted theories, past and more recent work has led to criticism of the opponent process theory, stemming from a number of what are presented as discrepancies in the standard opponent process theory. For example, the phenomenon of an after-image of complementary color can be induced by fatiguing the cells responsible for color perception, by staring at a vibrant color for a length of time, and then looking at a white surface. This phenomenon of complementary colors shows that cyan, rather than green, is the complement of red, and that magenta, rather than red, is the complement of green. It therefore also shows that the reddish-green color supposed to be impossible by opponent process theory is actually the color yellow. Although this phenomenon is more readily explained by the trichromatic theory, explanations for the discrepancy may include alterations to the opponent process theory, such as redefining the opponent colors as red vs. cyan, to reflect this effect. Despite such criticis

T-distributed stochastic neighbor embedding

t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Geoffrey Hinton and Sam Roweis, where Laurens van der Maaten and Hinton proposed the t-distributed variant. It is a nonlinear dimensionality reduction technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. The t-SNE algorithm comprises two main stages. First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence (KL divergence) between the two distributions with respect to the locations of the points in the map. While the original algorithm uses the Euclidean distance between objects as the base of its similarity metric, this can be changed as appropriate. A Riemannian variant is UMAP. t-SNE has been used for visualization in a wide range of applications, including genomics, computer security research, natural language processing, music analysis, cancer research, bioinformatics, geological domain interpretation, and biomedical signal processing. For a data set with n {\displaystyle n} elements, t-SNE runs in O ( n 2 ) {\displaystyle O(n^{2})} time and requires O ( n 2 ) {\displaystyle O(n^{2})} space. == Details == Given a set of N {\displaystyle N} high-dimensional objects x 1 , … , x N {\displaystyle \mathbf {x} _{1},\dots ,\mathbf {x} _{N}} , t-SNE first computes probabilities p i j {\displaystyle p_{ij}} that are proportional to the similarity of objects x i {\displaystyle \mathbf {x} _{i}} and x j {\displaystyle \mathbf {x} _{j}} , as follows. For i ≠ j {\displaystyle i\neq j} , define p j ∣ i = exp ⁡ ( − ‖ x i − x j ‖ 2 / 2 σ i 2 ) ∑ k ≠ i exp ⁡ ( − ‖ x i − x k ‖ 2 / 2 σ i 2 ) {\displaystyle p_{j\mid i}={\frac {\exp(-\lVert \mathbf {x} _{i}-\mathbf {x} _{j}\rVert ^{2}/2\sigma _{i}^{2})}{\sum _{k\neq i}\exp(-\lVert \mathbf {x} _{i}-\mathbf {x} _{k}\rVert ^{2}/2\sigma _{i}^{2})}}} and set p i ∣ i = 0 {\displaystyle p_{i\mid i}=0} . Note the above denominator ensures ∑ j p j ∣ i = 1 {\displaystyle \sum _{j}p_{j\mid i}=1} for all i {\displaystyle i} . As van der Maaten and Hinton explained: "The similarity of datapoint x j {\displaystyle x_{j}} to datapoint x i {\displaystyle x_{i}} is the conditional probability, p j | i {\displaystyle p_{j|i}} , that x i {\displaystyle x_{i}} would pick x j {\displaystyle x_{j}} as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at x i {\displaystyle x_{i}} ." Now define p i j = p j ∣ i + p i ∣ j 2 N {\displaystyle p_{ij}={\frac {p_{j\mid i}+p_{i\mid j}}{2N}}} This is motivated because p i {\displaystyle p_{i}} and p j {\displaystyle p_{j}} from the N samples are estimated as 1/N, so the conditional probability can be written as p i ∣ j = N p i j {\displaystyle p_{i\mid j}=Np_{ij}} and p j ∣ i = N p j i {\displaystyle p_{j\mid i}=Np_{ji}} . Since p i j = p j i {\displaystyle p_{ij}=p_{ji}} , you can obtain previous formula. Also note that p i i = 0 {\displaystyle p_{ii}=0} and ∑ i , j p i j = 1 {\displaystyle \sum _{i,j}p_{ij}=1} . The bandwidth of the Gaussian kernels σ i {\displaystyle \sigma _{i}} is set in such a way that the entropy of the conditional distribution equals a predefined entropy using the bisection method. As a result, the bandwidth is adapted to the density of the data: smaller values of σ i {\displaystyle \sigma _{i}} are used in denser parts of the data space. The entropy increases with the perplexity of this distribution P i {\displaystyle P_{i}} ; this relation is seen as P e r p ( P i ) = 2 H ( P i ) {\displaystyle Perp(P_{i})=2^{H(P_{i})}} where H ( P i ) {\displaystyle H(P_{i})} is the Shannon entropy H ( P i ) = − ∑ j p j | i log 2 ⁡ p j | i . {\displaystyle H(P_{i})=-\sum _{j}p_{j|i}\log _{2}p_{j|i}.} The perplexity is a hand-chosen parameter of t-SNE, and as the authors state, "perplexity can be interpreted as a smooth measure of the effective number of neighbors. The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.". Since the Gaussian kernel uses the Euclidean distance ‖ x i − x j ‖ {\displaystyle \lVert x_{i}-x_{j}\rVert } , it is affected by the curse of dimensionality, and in high dimensional data when distances lose the ability to discriminate, the p i j {\displaystyle p_{ij}} become too similar (asymptotically, they would converge to a constant). It has been proposed to adjust the distances with a power transform, based on the intrinsic dimension of each point, to alleviate this. t-SNE aims to learn a d {\displaystyle d} -dimensional map y 1 , … , y N {\displaystyle \mathbf {y} _{1},\dots ,\mathbf {y} _{N}} (with y i ∈ R d {\displaystyle \mathbf {y} _{i}\in \mathbb {R} ^{d}} and d {\displaystyle d} typically chosen as 2 or 3) that reflects the similarities p i j {\displaystyle p_{ij}} as well as possible. To this end, it measures similarities q i j {\displaystyle q_{ij}} between two points in the map y i {\displaystyle \mathbf {y} _{i}} and y j {\displaystyle \mathbf {y} _{j}} , using a very similar approach. Specifically, for i ≠ j {\displaystyle i\neq j} , define q i j {\displaystyle q_{ij}} as q i j = ( 1 + ‖ y i − y j ‖ 2 ) − 1 ∑ k ∑ l ≠ k ( 1 + ‖ y k − y l ‖ 2 ) − 1 {\displaystyle q_{ij}={\frac {(1+\lVert \mathbf {y} _{i}-\mathbf {y} _{j}\rVert ^{2})^{-1}}{\sum _{k}\sum _{l\neq k}(1+\lVert \mathbf {y} _{k}-\mathbf {y} _{l}\rVert ^{2})^{-1}}}} and set q i i = 0 {\displaystyle q_{ii}=0} . Herein a heavy-tailed Student t-distribution (with one-degree of freedom, which is the same as a Cauchy distribution) is used to measure similarities between low-dimensional points in order to allow dissimilar objects to be modeled far apart in the map. The locations of the points y i {\displaystyle \mathbf {y} _{i}} in the map are determined by minimizing the (non-symmetric) Kullback–Leibler divergence of the distribution P {\displaystyle P} from the distribution Q {\displaystyle Q} , that is: K L ( P ∥ Q ) = ∑ i ≠ j p i j log ⁡ p i j q i j {\displaystyle \mathrm {KL} \left(P\parallel Q\right)=\sum _{i\neq j}p_{ij}\log {\frac {p_{ij}}{q_{ij}}}} The minimization of the Kullback–Leibler divergence with respect to the points y i {\displaystyle \mathbf {y} _{i}} is performed using gradient descent. The result of this optimization is a map that reflects the similarities between the high-dimensional inputs. == Output == While t-SNE plots often seem to display clusters, the visual clusters can be strongly influenced by the chosen parameterization (especially the perplexity) and so a good understanding of the parameters for t-SNE is needed. Such "clusters" can be shown to even appear in structured data with no clear clustering, and so may be false findings. Similarly, the size of clusters produced by t-SNE is not informative, and neither is the distance between clusters. Thus, interactive exploration may be needed to choose parameters and validate results. It has been shown that t-SNE can often recover well-separated clusters, and with special parameter choices, approximates a simple form of spectral clustering. == Software == A C++ implementation of Barnes-Hut is available on the github account of one of the original authors. The R package Rtsne implements t-SNE in R. ELKI contains tSNE, also with Barnes-Hut approximation scikit-learn, a popular machine learning library in Python implements t-SNE with both exact solutions and the Barnes-Hut approximation. Tensorboard, the visualization kit associated with TensorFlow, also implements t-SNE (online version) The Julia package TSne implements t-SNE