Web development is the process of designing, developing and maintaining websites and web apps. Web development encompasses several different fields, most commonly referring to the programming of websites. Front-end development is the act of developing the user interface and client-side code, while back-end development focuses on the infrastructure behind a website, mainly server-side code. Since the World Wide Web was released publicly in 1993, web development has evolved greatly, with websites changing from a collection of static HTML pages to complex projects using frameworks, servers, and databases. == Overview == Web development includes many individual tasks, including web design, web content development, networking, and coding. Among web professionals, "web development" usually refers to the main non-design aspects of building websites: writing markup and coding. Web development is generally split into two fields: front-end development and back-end development. Front-end developers create the user interface of websites, turning web designs into HTML, CSS, and JavaScript code. Front-end developers must also make sure that websites work consistently across different browsers and devices. Back-end development, also known as server-side development, focuses on the infrastructure behind a website, including APIs, database management, and security. Some choose to be full-stack developers, meaning they work on both the front-end and back-end. == History == The World Wide Web is often categorised into three generations: Web 1.0, Web 2.0, and Web 3.0 (or Web3). It was invented in 1989, and released to the public in 1993. In the early years of the web, restrospecitvely referred to as Web 1.0, websites were simply a collection of static HTML files, and had limited interactivity. After the introduction of JavaScript in 1995, websites could contain logic, allowing for interactivity. The following year CSS was released, allowing greater control over the styling of web pages. In 1999, the term Web 2.0 was coined by Darcy DiNucci. The term later resurfaced in the early 2000s, as websites started to increase in complexity, requiring server-side services in addition to JavaScript. This led to the emergence of various new programming languages and frameworks designed for backend services, such as PHP, Active Server Pages, and Jakarta Server Pages. This enabled websites to do additional server-side processing, such as accessing databases. Another shift in web development was the release of the iPhone in 2007. This created a new medium for accessing the web, requiring a new approach to web development, and resulting in responsive web design, which allows a single website to appear different depending on the device running it. Later, progressive web apps were introduced, allowing websites to be installed on a device as an independent application. In the 2010s, JavaScript frameworks began to emerge, creating new ways to manipulate web pages, and increasing compatibility between web browsers. JQuery was popular in the early 2010s, but was later surpassed by other frameworks such as React and Vue.js. In the mid 2020s, use of AI became prevalent among web developers, with the 2025 Stack Overflow survey showing over 80% of developers saying the use AI at least monthly in their development process.
Conservative morphological anti-aliasing
Conservative morphological anti-aliasing (CMAA) is an antialiasing technique originally developed by Filip Strugar at Intel. CMAA is an image-based, post processing technique similar to that of morphological antialiasing. CMAA uses 4 main steps which are image analysis for color discontinuities, locally dominant edge detection, simple shape handling, and lastly symmetrical long edge shape handling. A couple of years after CMAA was introduced, Intel unveiled an updated version which they named CMAA2.
Differential evolution
Differential evolution (DE) is an evolutionary algorithm to optimize a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. Such methods are commonly known as metaheuristics as they make few or no assumptions about the optimized problem and can search very large spaces of candidate solutions. However, metaheuristics such as DE do not guarantee an optimal solution is ever found. DE is used for multidimensional real-valued functions but does not use the gradient of the problem being optimized, which means DE does not require the optimization problem to be differentiable, as is required by classic optimization methods such as gradient descent and quasi-newton methods. DE can therefore also be used on optimization problems that are not even continuous, are noisy, change over time, etc. DE optimizes a problem by maintaining a population of candidate solutions and creating new candidate solutions by combining existing ones according to its simple formulae, and then keeping whichever candidate solution has the best score or fitness on the optimization problem at hand. In this way, the optimization problem is treated as a black box that merely provides a measure of quality given a candidate solution and the gradient is therefore not needed. == History == Storn and Price introduced Differential Evolution in 1995. Books have been published on theoretical and practical aspects of using DE in parallel computing, multiobjective optimization, constrained optimization, and the books also contain surveys of application areas. Surveys on the multi-faceted research aspects of DE can be found in journal articles. == Algorithm == A basic variant of the DE algorithm works by having a population of candidate solutions (called agents). These agents are moved around in the search-space by using simple mathematical formulae to combine the positions of existing agents from the population. If the new position of an agent is an improvement then it is accepted and forms part of the population, otherwise the new position is simply discarded. The process is repeated and by doing so it is hoped, but not guaranteed, that a satisfactory solution will eventually be discovered. Formally, let f : R n → R {\displaystyle f:\mathbb {R} ^{n}\to \mathbb {R} } be the fitness function which must be minimized (note that maximization can be performed by considering the function h := − f {\displaystyle h:=-f} instead). The function takes a candidate solution as argument in the form of a vector of real numbers. It produces a real number as output which indicates the fitness of the given candidate solution. The gradient of f {\displaystyle f} is not known. The goal is to find a solution m {\displaystyle \mathbf {m} } for which f ( m ) ≤ f ( p ) {\displaystyle f(\mathbf {m} )\leq f(\mathbf {p} )} for all p {\displaystyle \mathbf {p} } in the search-space, which means that m {\displaystyle \mathbf {m} } is the global minimum. Let x ∈ R n {\displaystyle \mathbf {x} \in \mathbb {R} ^{n}} designate a candidate solution (agent) in the population. The basic DE algorithm can then be described as follows: Choose the parameters NP ≥ 4 {\displaystyle {\text{NP}}\geq 4} , CR ∈ [ 0 , 1 ] {\displaystyle {\text{CR}}\in [0,1]} , and F ∈ [ 0 , 2 ] {\displaystyle F\in [0,2]} . NP : NP {\displaystyle {\text{NP}}} is the population size, i.e. the number of candidate agents or "parents". CR : The parameter CR ∈ [ 0 , 1 ] {\displaystyle {\text{CR}}\in [0,1]} is called the crossover probability. F : The parameter F ∈ [ 0 , 2 ] {\displaystyle F\in [0,2]} is called the differential weight. Typical settings are N P = 10 n {\displaystyle NP=10n} , C R = 0.9 {\displaystyle CR=0.9} and F = 0.8 {\displaystyle F=0.8} . Optimization performance may be greatly impacted by these choices; see below. Initialize all agents x {\displaystyle \mathbf {x} } with random positions in the search-space. Until a termination criterion is met (e.g. number of iterations performed, or adequate fitness reached), repeat the following: For each agent x {\displaystyle \mathbf {x} } in the population do: Pick three agents a , b {\displaystyle \mathbf {a} ,\mathbf {b} } , and c {\displaystyle \mathbf {c} } from the population at random, they must be distinct from each other as well as from agent x {\displaystyle \mathbf {x} } . ( a {\displaystyle \mathbf {a} } is called the "base" vector.) Pick a random index R ∈ { 1 , … , n } {\displaystyle R\in \{1,\ldots ,n\}} where n {\displaystyle n} is the dimensionality of the problem being optimized. Compute the agent's potentially new position y = [ y 1 , … , y n ] {\displaystyle \mathbf {y} =[y_{1},\ldots ,y_{n}]} as follows: For each i ∈ { 1 , … , n } {\displaystyle i\in \{1,\ldots ,n\}} , pick a uniformly distributed random number r i ∼ U ( 0 , 1 ) {\displaystyle r_{i}\sim U(0,1)} If r i < C R {\displaystyle r_{i} In computational learning theory, sample exclusion dimensions arise in the study of exact concept learning with queries. In algorithmic learning theory, a concept over a domain X is a Boolean function over X. Here we only consider finite domains. A partial approximation S of a concept c is a Boolean function over Y ⊆ X {\displaystyle Y\subseteq X} such that c is an extension to S. Let C be a class of concepts and c be a concept (not necessarily in C). Then a specifying set for c w.r.t. C, denoted by S is a partial approximation S of c such that C contains at most one extension to S. If we have observed a specifying set for some concept w.r.t. C, then we have enough information to verify a concept in C with at most one more mind change. The exclusion dimension, denoted by XD(C), of a concept class is the maximum of the size of the minimum specifying set of c' with respect to C, where c' is a concept not in C. VITAL (Validating Investment Tool for Advancing Life Sciences) was a Board Management Software machine learning proprietary software developed by Aging Analytics, a company registered in Bristol (England) and dissolved in 2017. Andrew Garazha (the firm's Senior Analyst) declared that the project aimed "through iterative releases and updates to create a piece of software capable of making autonomous investment decisions." According to Nick Dyer-Witheford, VITAL 1.0 was a "basic algorithm". On 13 May 2014, Deep Knowledge Ventures, a Hong Kong venture capital firm, claimed to have appointed VITAL to its board of directors in order to prove that artificial intelligence could be an instrument for investment decision-making. The announcement received great press coverage despite the fact commentators consider this a publicity stunt. Fortune reported in 2019 that VITAL is no longer used. == Criticism == Academics and journalists viewed VITAL's board appointment with skepticism. University of Sheffield computer science professor Noel Sharkey called it "a publicity hype". Michael Osborne, a University of Oxford associate professor in machine learning, found it is "a gimmick to call that an actual board member". Simon Sharwood of The Register, wrote there is "a strong whiff of stunt and/or promotion about this". In a 2019 speech, the Chief Scientist of Australia, Alan Finkel, commented, "At the time, most of us probably dismissed Vital as a PR exercise. I admit, I used her story three years ago to get a laugh in one of my speeches." Florian Möslein, a law professor at the University of Marburg, wrote in 2018 that "Vital has widely been acknowledged as the 'world's first artificial intelligence company director'". Vice journalist Jason Koebler suggested that the software did not have any article intelligence capabilities and concluded "VITAL can’t talk, and it can’t hear, and it can’t be a real, functional executive of a company." Sharwood of The Register noted that because VITAL was not a natural person, it could not be a board member under Hong Kong's corporate governance laws. However, in a 2017 interview to The Nikkei, Dmitry Kaminskiy, managing partner of Deep Knowledge Ventures, stated that VITAL had observer status on the board and no voting rights. University of Sheffield computer science professor Noel Sharkey said of VITAL, "On first sight, it looks like a futuristic idea but on reflection it is really a little bit of publicity hype." Vice journalist Jason Koebler said "this is a gimmick" and said "There is literally nothing to suggest that VITAL has any sort of capabilities beyond any other proprietary analysis software". Michael Osborne, a University of Oxford associate professor in machine learning, found VITAL's appointment to be noncredible, saying it is "a bit of a gimmick to call that an actual board member". Osborne said that a core duty of board members to converse with each other, which the algorithm is incapable of doing, so its more likely functionality is to serve as a springboard for conversation among other board members. In a 2019 speech, the Chief Scientist of Australia, Alan Finkel, commented, "At the time, most of us probably dismissed Vital as a PR exercise. I admit, I used her story three years ago to get a laugh in one of my speeches." == Machine intelligence as board member == VITAL was created by a group of programmers employed by Aging Analytics According to Andrew Garazh, Aging Analytics Senior Analyst, VITAL was not a machine learning algorithm as the necessary datasets on investment rounds, intellectual property and clinical trial outcomes are generally not disclosed. Rather, VITAL used fuzzy logic based on 50 parameters to assess risk factors. Aging Analytics licensed the software to Deep Knowledge Ventures. It was used to help the human board members of Deep Knowledge Venture make investment decisions in biotechnology companies. For instance, it supported investments in Insilico Medicine, which creates ways for computers to help find drugs in research into aging. VITAL also supported investing in Pathway Pharmaceuticals, which uses the OncoFinder algorithm to choose and appraise cancer treatments. According to Dmitry Kaminskiy, managing partner of Deep Knowledge Ventures, the motivation for using VITAL was the large number of failed investments in the biotechnology sector and the desire to avoid investing in companies likely to fail. == Ethical and legal implications == Scholars addressed questions around the safety, privacy, accountability transparency and bias in algorithms. Writing in the philosophical journal Multitudes, the academic Ariel Kyrou raised questions about the consequences of a mistake made by an algorithm recommending a dangerous investment. He raised the hypothetical where VITAL was able to persuade the board to invest in a startup that had the facade of doing research into treatment for age-associated ills, but in actuality was run by terrorists who were raising funds. Kyrou raised a series of questions about who society would fault for VITAL's mistake. As the owner of VITAL, should Deep Knowledge Ventures be held accountable, or rather should the companies that supplied data to VITAL or the people who created VITAL be held liable? Simon Sharwood of The Register wrote that because the appointment of a software program to the board directors is not legally feasible in Hong Kong, there is "a strong whiff of stunt and/or promotion about this". Quoting a Thomson Reuters website describing Hong Kong legislation related to corporate governance, Sharwood pointed out that in Hong Kong "the board comprises all of the directors of the company" and "a director must normally be a natural person, except that a private company may have a body corporate as its director if the company is not a member of a listed group." He concluded that since VITAL cannot be considered a "natural person", it is merely a "cosmetic" appointment to the board and that "this software is no more a Board member than Caligula's horse was a senator". Sharwood further argued that corporations frequently purchase directors and officers liability insurance but that it would be practically impossible to get such insurance for VITAL. Sharwood also wrote that were VITAL to be hacked, any misinformation it outputs could be considered "false and misleading communications". In the book Research Handbook on the Law of Artificial Intelligence, Florian Mölein wrote that VITAL could not become a director as defined in Hong Kong's corporate laws, so the other directors just were approaching it as "a member of [the] board with observer status". Lin Shaowei raised concerns in a Journal of East China University of Political Science and Law article about how the software's appearance inspired a complex question about the relationship between corporate law and artificial intelligence. VITAL could be considered either a board director who has voting rights or an observer who does not. Lin said either choice raised questions about whether VITAL is subject to corporate law and who would be held accountable if VITAL recommends a choice that turns out to be damaging to the company. David Theo Goldberg in the Critical Times, a peer reviewed journal in Critical Global Theory, argues that VITAL processed a dataset to predict the most remunerative investment opportunities. Drawing his analysis on an article from Business Insider, Goldberg describes VITAL's decision-making predictiveness based "on surface pattern recognition and the identification of regularities and/or irregularities". In other words, Goldberg asserts that "the normativity of the surface" explains algorithmic knowledge of a "product" like VITAL. In Homo Deus, Yuval Noah Harari mentions VITAL as an example of the future risks that humankind faces. Harari argues that the human mind is being replaced by a world in which algorithms and data make the decisions. Specifically, it is argued that "as algorithms push humans out of the job market," executive boards driven by artificial intelligence are more likely to give priority to algorithms over the humans. In mathematics, the intrinsic dimension of a subset can be thought of as the minimal number of variables needed to represent the subset. The concept has widespread applications in geometry, dynamical systems, signal processing, statistics, and other fields. Due to its widespread applications and vague conceptualization, there are many different ways to define it rigorously. Consequently, the same set might have different intrinsic dimensions according to different definitions. The intrinsic dimension can be used as a lower bound of what dimension it is possible to compress a data set into through dimension reduction, but it can also be used as a measure of the complexity of the data set or signal. For a data set or signal of N variables, its intrinsic dimension M satisfies 0 ≤ M ≤ N, although estimators may yield higher values. == Exact dimension == === Differential === In differential geometry, given a differentiable manifold N and a submanifold M, the intrinsic dimension of M is its dimension. Suppose N has n dimensions and M has m dimensions, then that means around any point in M, there exists a local coordinate system ( x 1 , … , x m , x m + 1 , … , x n ) {\displaystyle (x_{1},\dots ,x_{m},x_{m+1},\dots ,x_{n})} of N, such that the manifold M is simply the subset of N defined by x m + 1 = 0 , … , x n = 0 {\displaystyle x_{m+1}=0,\dots ,x_{n}=0} . === Metric === Given a mere metric space, we can still define its intrinsic dimension. The most general case is the Hausdorff dimension, though for metric spaces occurring in practice, the box-counting dimension and the packing dimension often are identical to the Hausdorff dimension. Let X , d {\textstyle X,d} be a metric space and A ⊂ X {\textstyle A\subset X} be totally bounded. Define the covering number N ( A , ε ) = min { k : A ⊂ ⋃ i = 1 k B ( x i , ε ) } . {\displaystyle N(A,\varepsilon )=\min \left\{k:A\subset \bigcup _{i=1}^{k}B\left(x_{i},\varepsilon \right)\right\}.} The metric entropy is H ( A , ε ) = log N ( A , ε ) {\textstyle H(A,\varepsilon )=\log N(A,\varepsilon )} (any log base). The upper and lower metric entropy dimensions are dim ¯ E A = lim sup ε ↓ 0 H ( A , ε ) log ( 1 / ε ) , dim _ E A = lim inf ε ↓ 0 H ( A , ε ) log ( 1 / ε ) . {\displaystyle {\overline {\dim }}_{E}A=\limsup _{\varepsilon \downarrow 0}{\frac {H(A,\varepsilon )}{\log(1/\varepsilon )}},\quad {\underline {\dim }}_{E}A=\liminf _{\varepsilon \downarrow 0}{\frac {H(A,\varepsilon )}{\log(1/\varepsilon )}}.} If they are equal, then dim E A {\textstyle \operatorname {dim} _{E}A} is that common value, called the metric entropy dimension. The entropy dimensions are usually used in information theory, and especially coding theory, since entropy is involved in its definition. === Topological === If X {\displaystyle X} is merely a topological space, then we can still define its intrinsic dimension, using the topological dimension or Lebesgue covering dimension. An open cover of a topological space X is a family of open sets Uα such that their union is the whole space, ∪ α {\displaystyle \cup _{\alpha }} Uα = X. The order or ply of an open cover A {\displaystyle {\mathfrak {A}}} = {Uα} is the smallest number m (if it exists) for which each point of the space belongs to at most m open sets in the cover: in other words Uα1 ∩ ⋅⋅⋅ ∩ Uαm+1 = ∅ {\displaystyle \emptyset } for α1, ..., αm+1 distinct. A refinement of an open cover A {\displaystyle {\mathfrak {A}}} = {Uα} is another open cover B {\displaystyle {\mathfrak {B}}} = {Vβ}, such that each Vβ is contained in some Uα. The covering dimension of a topological space X is defined to be the minimum value of n such that every finite open cover A {\displaystyle {\mathfrak {A}}} of X has an open refinement B {\displaystyle {\mathfrak {B}}} with order n + 1. The refinement B {\displaystyle {\mathfrak {B}}} can always be chosen to be finite. Thus, if n is finite, Vβ1 ∩ ⋅⋅⋅ ∩ Vβn+2 = ∅ {\displaystyle \emptyset } for β1, ..., βn+2 distinct. If no such minimal n exists, the space is said to have infinite covering dimension. == Introductory example == Let f ( x 1 , x 2 ) {\textstyle f(x_{1},x_{2})} be a two-variable function (or signal) which is of the form f ( x 1 , x 2 ) = g ( x 1 ) {\textstyle f(x_{1},x_{2})=g(x_{1})} for some one-variable function g which is not constant. This means that f varies, in accordance to g, with the first variable or along the first coordinate. On the other hand, f is constant with respect to the second variable or along the second coordinate. It is only necessary to know the value of one, namely the first, variable in order to determine the value of f. Hence, it is a two-variable function but its intrinsic dimension is one. A slightly more complicated example is f ( x 1 , x 2 ) = g ( x 1 + x 2 ) {\textstyle f(x_{1},x_{2})=g(x_{1}+x_{2})} . f is still intrinsic one-dimensional, which can be seen by making a variable transformation y 1 = x 1 + x 2 {\textstyle y_{1}=x_{1}+x_{2}} and y 2 = x 1 − x 2 {\textstyle y_{2}=x_{1}-x_{2}} which gives f ( y 1 + y 2 2 , y 1 − y 2 2 ) = g ( y 1 ) {\textstyle f\left({\frac {y_{1}+y_{2}}{2}},{\frac {y_{1}-y_{2}}{2}}\right)=g\left(y_{1}\right)} . Since the variation in f can be described by the single variable y1 its intrinsic dimension is one. For the case that f is constant, its intrinsic dimension is zero since no variable is needed to describe variation. For the general case, when the intrinsic dimension of the two-variable function f is neither zero or one, it is two. In the literature, functions which are of intrinsic dimension zero, one, or two are sometimes referred to as i0D, i1D or i2D, respectively. == Signal processing == In signal processing of multidimensional signals, the intrinsic dimension of the signal describes how many variables are needed to generate a good approximation of the signal. For an N-variable function f, the set of variables can be represented as an N-dimensional vector x: f = f ( x ) where x = ( x 1 , … , x N ) {\textstyle f=f\left(\mathbf {x} \right){\text{ where }}\mathbf {x} =\left(x_{1},\dots ,x_{N}\right)} . If for some M-variable function g and M × N matrix A it is the case that for all x; f ( x ) = g ( A x ) , {\textstyle f(\mathbf {x} )=g(\mathbf {Ax} ),} M is the smallest number for which the above relation between f and g can be found, then the intrinsic dimension of f is M. The intrinsic dimension is a characterization of f, it is not an unambiguous characterization of g nor of A. That is, if the above relation is satisfied for some f, g, and A, it must also be satisfied for the same f and g′ and A′ given by g ′ ( y ) = g ( B y ) {\textstyle g'\left(\mathbf {y} \right)=g\left(\mathbf {By} \right)} and A ′ = B − 1 A {\textstyle \mathbf {A'} =\mathbf {B} ^{-1}\mathbf {A} } where B is a non-singular M × M matrix, since f ( x ) = g ′ ( A ′ x ) = g ( B A ′ x ) = g ( A x ) {\textstyle f\left(\mathbf {x} \right)=g'\left(\mathbf {A'x} \right)=g\left(\mathbf {BA'x} \right)=g\left(\mathbf {Ax} \right)} . == The Fourier transform of signals of low intrinsic dimension == An N variable function which has intrinsic dimension M < N has a characteristic Fourier transform. Intuitively, since this type of function is constant along one or several dimensions its Fourier transform must appear like an impulse (the Fourier transform of a constant) along the same dimension in the frequency domain. === A simple example === Let f be a two-variable function which is i1D. This means that there exists a normalized vector n ∈ R 2 {\textstyle \mathbf {n} \in \mathbb {R} ^{2}} and a one-variable function g such that f ( x ) = g ( n T x ) {\textstyle f(\mathbf {x} )=g(\mathbf {n} ^{\operatorname {T} }\mathbf {x} )} for all x ∈ R 2 {\textstyle \mathbf {x} \in \mathbb {R} ^{2}} . If F is the Fourier transform of f (both are two-variable functions) it must be the case that F ( u ) = G ( n T u ) ⋅ δ ( m T u ) {\textstyle F\left(\mathbf {u} \right)=G\left(\mathbf {n} ^{\mathrm {T} }\mathbf {u} \right)\cdot \delta \left(\mathbf {m} ^{\mathrm {T} }\mathbf {u} \right)} . Here G is the Fourier transform of g (both are one-variable functions), δ is the Dirac impulse function and m is a normalized vector in R 2 {\textstyle \mathbb {R} ^{2}} perpendicular to n. This means that F vanishes everywhere except on a line which passes through the origin of the frequency domain and is parallel to m. Along this line F varies according to G. === The general case === Let f be an N-variable function which has intrinsic dimension M, that is, there exists an M-variable function g and M × N matrix A such that f ( x ) = g ( A x ) ∀ x {\textstyle f(\mathbf {x} )=g(\mathbf {Ax} )\quad \forall \mathbf {x} } . Its Fourier transform F can then be described as follows: F vanishes everywhere except for a subspace of dimension M The subspace M is spanned by the rows of the matrix A In the subspace, F varies according to G the Fourier transform of g == Generalizations == The type of intrinsic dimension described above assume In the context of artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the non-negative part of its argument, i.e., the ramp function: ReLU ( x ) = x + = max ( 0 , x ) = x + | x | 2 = { x if x > 0 , 0 x ≤ 0 {\displaystyle \operatorname {ReLU} (x)=x^{+}=\max(0,x)={\frac {x+|x|}{2}}={\begin{cases}x&{\text{if }}x>0,\\0&x\leq 0\end{cases}}} where x {\displaystyle x} is the input to a neuron. This is analogous to half-wave rectification in electrical engineering. ReLU is one of the most popular activation functions for artificial neural networks, and finds application in computer vision and speech recognition using deep neural nets and computational neuroscience. == History == The ReLU was first used by Alston Householder in 1941 as a mathematical abstraction of biological neural networks. Kunihiko Fukushima in 1969 used ReLU in the context of visual feature extraction in hierarchical neural networks. In 1998, Gregory Woodbury demonstrated that the rectified linear function could account for a broad range of emergent properties in the visual cortex. His work showed that a single unified model could drive the joint development of refined retinotopic maps, ocular dominance columns, and orientation selectivity. By utilizing the rectifier's "cutoff" property, Woodbury achieved a close quantitative fit to biological data, matching the spatial periodicities and topographic refinement patterns observed in macaque and cat cortical maps. Furthermore, he extended this framework to adult plasticity, accurately replicating the spatial and temporal dynamics of lesion-induced cortical reorganization. This research established that the rectified linear response was a necessary mechanism for the stable self-organisation and maintenance of complex, multi-feature neural maps. In 2000, Hahnloser et al. argued that ReLU approximates the biological relationship between neural firing rates and input current, in addition to enabling recurrent neural network dynamics to stabilise under weaker criteria. Prior to 2010, most activation functions used were the logistic sigmoid (which is inspired by probability theory; see logistic regression) and its more numerically efficient counterpart, the hyperbolic tangent. Around 2010, the use of ReLU became common again. Jarrett et al. (2009) noted that rectification by either absolute or ReLU (which they called "positive part") was critical for object recognition in convolutional neural networks (CNNs), specifically because it allows average pooling without neighboring filter outputs cancelling each other out. They hypothesized that the use of sigmoid or tanh was responsible for poor performance in previous CNNs. Nair and Hinton (2010) made a theoretical argument that the softplus activation function should be used, in that the softplus function numerically approximates the sum of an exponential number of linear models that share parameters. They then proposed ReLU as a good approximation to it. Specifically, they began by considering a single binary neuron in a Boltzmann machine that takes x {\displaystyle x} as input, and produces 1 as output with probability σ ( x ) = 1 1 + e − x {\displaystyle \sigma (x)={\frac {1}{1+e^{-x}}}} . They then considered extending its range of output by making infinitely many copies of it X 1 , X 2 , X 3 , … {\displaystyle X_{1},X_{2},X_{3},\dots } , that all take the same input, offset by an amount 0.5 , 1.5 , 2.5 , … {\displaystyle 0.5,1.5,2.5,\dots } , then their outputs are added together as ∑ i = 1 ∞ X i {\displaystyle \sum _{i=1}^{\infty }X_{i}} . They then demonstrated that ∑ i = 1 ∞ X i {\displaystyle \sum _{i=1}^{\infty }X_{i}} is approximately equal to N ( log ( 1 + e x ) , σ ( x ) ) {\displaystyle {\mathcal {N}}(\log(1+e^{x}),\sigma (x))} , which is also approximately equal to ReLU ( N ( x , σ ( x ) ) ) {\displaystyle \operatorname {ReLU} ({\mathcal {N}}(x,\sigma (x)))} , where N {\displaystyle {\mathcal {N}}} stands for the gaussian distribution. They also argued for another reason for using ReLU: that it allows "intensity equivariance" in image recognition. That is, multiplying input image by a constant k {\displaystyle k} multiplies the output also. In contrast, this is false for other activation functions like sigmoid or tanh. They found that ReLU activation allowed good empirical performance in restricted Boltzmann machines. Glorot et al (2011) argued that ReLU has the following advantages over sigmoid or tanh: ReLU is more similar to biological neurons' responses in their main operating regime. ReLU avoids vanishing gradients. ReLU is cheaper to compute. ReLU creates sparse representation naturally, because many hidden units output exactly zero for a given input. They also found empirically that deep networks trained with ReLU can achieve strong performance without unsupervised pre-training, especially on large, purely supervised tasks. In 2017, the rectified linear function became a central component of the transformer architecture introduced in the Vaswani et al paper "Attention Is All You Need". Within every transformer layer, ReLU is utilized in the position-wise feed-forward networks (FFN), defined by Equation 2 of their paper: FFN ( x ) = max ( 0 , x W 1 + b 1 ) W 2 + b 2 {\displaystyle \operatorname {FFN} (x)=\max(0,xW_{1}+b_{1})W_{2}+b_{2}} This equation is foundational to the model's capacity; while the attention mechanism determines the relationships between tokens, the ReLU-based FFN performs the majority of the numerical computation and houses the bulk of the model's parameters. The efficiency and scalability of this rectified framework triggered a global technological revolution, enabling the development of Large Language Models that have had a profound economic impact. The industrial response to this architecture—including the massive expansion of AI-specific hardware and the birth of the generative AI sector—has positioned the Transformer as a cornerstone of 21st-century infrastructure. During the post 2017 period of rapid AI advancement, the rectified linear unit function has been key to achieving increased model performance and scaling due to the fact that it zeros out responses that are immaterial for a given stimuli, preventing them from accumulating in massive scale models. It is the complete silencing of the parts of the model found to be stimuli-irrelevant during learning that allows for scaling. As the stimuli-irrelevant proportion of the model becomes more massive, these highly numerous connections within the model would inevitably accumulate during scaling no matter how small each individual response is. Therefore, the rectified linear unit function, with its absolute zeroing property, enabled the scaling to hundred billion parameter models and beyond. Early Transformer scaling giants like GPT-3 (2020) and Falcon-180B (2023) relied on the rectified linear unit function explicitly, while successors such as GPT-4 (2023) and Llama 3 (2024) utilized smoother variants like GELU or SwiGLU. These variants were used to improve training stability while fundamentally preserving the rectified principle of zeroing low responses. At the centre of modern artificial intelligence ReLU and its variants maintain absolute zero response across the bulk of the model at any one time, while maintaining approximately linear reponses for stimuli-relevant connections enabling high performance on each specific cognitive task. This feature of activation sparsity has been critical for massive scaling and performance gains of AI models right up to the present day. == Advantages == Advantages of ReLU include: Sparse activation: for example, in a randomly initialized network, only about 50% of hidden units are activated (i.e. have a non-zero output). Better gradient propagation: fewer vanishing gradient problems compared to sigmoidal activation functions that saturate in both directions. Efficiency: only requires comparison and addition. Scale-invariant (homogeneous, or "intensity equivariance"): max ( 0 , a x ) = a max ( 0 , x ) for a ≥ 0 {\displaystyle \max(0,ax)=a\max(0,x){\text{ for }}a\geq 0} . == Potential problems == Possible downsides can include: Non-differentiability at zero (however, it is differentiable anywhere else, and the value of the derivative at zero can be chosen to be 0 or 1 arbitrarily). Not zero-centered: ReLU outputs are always non-negative. This can make it harder for the network to learn during backpropagation, because gradient updates tend to push weights in one direction (positive or negative). Batch normalization can help address this. ReLU is unbounded. Redundancy of the parametrization: Because ReLU is scale-invariant, the network computes the exact same function by scaling the weights and biases in front of a ReLU activation by k {\displaystyle k} , and the weights after by 1 / k {\displaystyle 1/k} . Dying ReLU: ReLU neurons can sometimes be pushed into statesSample exclusion dimension
VITAL (machine learning software)
Intrinsic dimension
Rectified linear unit