Nicholas M. W. Frosst is a Canadian computer scientist and musician. He co-founded Cohere, a Toronto-based artificial intelligence company. He is also the lead singer in the indie rock band Good Kid. == Early life and education == Frosst was born on January 5, 1993. Frosst earned a Bachelor of Science degree in computer science and cognitive science from the University of Toronto in 2015. He was a student of Geoffrey Hinton, who also hired Frosst at Google Brain. == Career == Frosst was among Geoffrey Hinton's earliest hires at Google Brain in Toronto, working as a machine learning researcher on deep learning and neural network architectures. He worked there from 2016 to 2020. Frosst co-founded Cohere with Aidan Gomez and Ivan Zhang in 2019. The company builds large language models and enterprise AI tools. Frosst has publicly explained Cohere's focus on industries like finance and health, where there are privacy and other regulatory considerations. Frosst has also spoken openly about his belief that artificial intelligence will not replace humans, but rather streamline and automate mundane tasks, and his belief that AGI is less "imminent" than many in the field claim. Frosst and the other Cohere co-founders were listed first on Maclean's AI Trailblazers Power List and The Logic's Innovation Leaders. == Music == After spending time in a prior band which played "weird" music featuring a glockenspiel, Frosst and fellow computer science students at the University of Toronto formed the indie rock band Good Kid in 2015. Frosst is the lead vocalist for the band. While on tour with the band, Frosst continues his work in the tech industry remotely. Frosst has described the band as way for him to relax and not constantly think about tech. His vocals have been compared to that of Kele Okereke. As of 2026, the band, which has performed at Lollapalooza, has 3.1 million monthly Spotify listeners. In 2024, the band was nominated for the Juno Awards Breakthrough Group of the Year. == Discography == === Good Kid === Can We Hang Out Sometime? (2026)
Eyes of Things
Eyes of Things (EoT) is the name of a project funded by the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement number 643924. The purpose of the project, which is funded under the Smart Cyber-physical systems topic, is to develop a generic hardware-software platform for embedded, efficient (i.e. battery-operated, wearable, mobile), computer vision, including deep learning inference. On November 29, 2018, the European Space Agency announced that it was testing the suitability of the device for space applications in advance of a flight in a Cubesat. == Motivation == EoT is based on the following tenets: Future embedded systems will have more intelligence and cognitive functionality. Vision is paramount to such intelligent capacity Unlike other sensors, vision requires intensive processing. Power consumption must be optimized if vision is to be used in mobile and wearable applications Cloud processing of edge-captured images is not sustainable. The sheer amount of visual data generated cannot be transferred to the cloud. Bandwidth is not sufficient and cloud servers cannot cope with it. == Partners == VISILAB group at University of Castilla–La Mancha (Coordinator) Movidius Awaiba Thales Security Solutions & Systems DFKI Fluxguide Evercam nVISO == Awards == 2019 Electronic Component and Systems Innovation Award by the European Commission 2018 HiPEAC Tech Transfer Award 2018 EC Innovation Radar - highlighting excellent innovations Award 2018 Internet of Things (IoT) Technology Research Award Pilot by Google 2016 Semifinalist "THE VISION SHOW STARTUP COMPETITION", Global Association for Vision Information, Boston US
Representation collapse
Representation collapse is a phenomenon in machine learning and representation learning where a model maps different inputs to the same or very similar embeddings, which means it loses important information about how the data is spread out. It is frequently encountered in self-supervised learning, especially within contrastive and non-contrastive frameworks, when training objectives or model architectures do not maintain variance across representations. Collapse results in degenerate solutions characterized by uninformative learned features, significantly impairing downstream task performance. Various techniques have been proposed to mitigate representation collapse, including the use of negative samples, architectural asymmetry, stop-gradient operations, variance regularization, and redundancy reduction objectives, as seen in methods such as SimCLR, BYOL, and VICReg. Comprehending and averting representation collapse is regarded as a fundamental challenge in the advancement of stable and efficient self-supervised learning systems.
Active learning (machine learning)
Active learning is a special case of machine learning in which a learning algorithm can interactively query a human user (or some other information source) to label new data points with the desired outputs. The human user must possess expertise in the problem domain, including the ability to consult authoritative sources when necessary. In statistics literature, it is sometimes also called optimal experimental design. The information source is also called teacher or oracle. There are situations in which unlabeled data is abundant but manual labeling is expensive. In such a scenario, learning algorithms can actively query the teacher for labels. Since the learner chooses the examples, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning. However, there is a risk that the algorithm is overwhelmed by uninformative examples. Recent developments are dedicated to multi-label active learning, hybrid active learning and active learning in a single-pass (on-line) context, combining concepts from the field of machine learning (e.g. conflict and ignorance) with adaptive, incremental learning policies in the field of online machine learning. Using active learning allows for faster development of a machine learning algorithm, when comparative updates would require a quantum or super computer. Large-scale active learning projects may benefit from crowdsourcing frameworks such as Amazon Mechanical Turk that include many humans in the active learning loop. == Definitions == Let T be the total set of all data under consideration. For example, in a protein engineering problem, T would include all proteins that are known to have a certain interesting activity and all additional proteins that one might want to test for that activity. During each iteration, i, T is broken up into three subsets T K , i {\displaystyle \mathbf {T} _{K,i}} : Data points where the label is known. T U , i {\displaystyle \mathbf {T} _{U,i}} : Data points where the label is unknown. T C , i {\displaystyle \mathbf {T} _{C,i}} : A subset of TU,i that is chosen to be labeled. Most of the current research in active learning involves the best method to choose the data points for TC,i. == Scenarios == Pool-based sampling: In this approach, which is the most well known scenario, the learning algorithm attempts to evaluate the entire dataset before selecting data points (instances) for labeling. It is often initially trained on a fully labeled subset of the data using a machine-learning method such as logistic regression or SVM that yields class-membership probabilities for individual data instances. The candidate instances are those for which the prediction is most ambiguous. Instances are drawn from the entire data pool and assigned a confidence score, a measurement of how well the learner "understands" the data. The system then selects the instances for which it is the least confident and queries the teacher for the labels. The theoretical drawback of pool-based sampling is that it is memory-intensive and is therefore limited in its capacity to handle enormous datasets, but in practice, the rate-limiting factor is that the teacher is typically a (fatiguable) human expert who must be paid for their effort, rather than computer memory. Stream-based selective sampling: Here, each consecutive unlabeled instance is examined one at a time with the machine evaluating the informativeness of each item against its query parameters. The learner decides for itself whether to assign a label or query the teacher for each datapoint. As contrasted with Pool-based sampling, the obvious drawback of stream-based methods is that the learning algorithm does not have sufficient information, early in the process, to make a sound assign-label-vs ask-teacher decision, and it does not capitalize as efficiently on the presence of already labeled data. Therefore, the teacher is likely to spend more effort in supplying labels than with the pool-based approach. Membership query synthesis: This is where the learner generates synthetic data from an underlying natural distribution. For example, if the dataset are pictures of humans and animals, the learner could send a clipped image of a leg to the teacher and query if this appendage belongs to an animal or human. This is particularly useful if the dataset is small. The challenge here, as with all synthetic-data-generation efforts, is in ensuring that the synthetic data is consistent in terms of meeting the constraints on real data. As the number of variables/features in the input data increase, and strong dependencies between variables exist, it becomes increasingly difficult to generate synthetic data with sufficient fidelity. For example, to create a synthetic data set for human laboratory-test values, the sum of the various white blood cell (WBC) components in a white blood cell differential must equal 100, since the component numbers are really percentages. Similarly, the enzymes alanine transaminase (ALT) and aspartate transaminase (AST) measure liver function (though AST is also produced by other tissues, e.g., lung, pancreas) A synthetic data point with AST at the lower limit of normal range (8–33 units/L) with an ALT several times above normal range (4–35 units/L) in a simulated chronically ill patient would be physiologically impossible. == Query strategies == Algorithms for determining which data points should be labeled can be organized into a number of different categories, based upon their purpose: Balance exploration and exploitation: the choice of examples to label is seen as a dilemma between the exploration and the exploitation over the data space representation. This strategy manages this compromise by modelling the active learning problem as a contextual bandit problem. For example, Bouneffouf et al. propose a sequential algorithm named Active Thompson Sampling (ATS), which, in each round, assigns a sampling distribution on the pool, samples one point from this distribution, and queries the oracle for this sample point label. Expected model change: label those points that would most change the current model. Expected error reduction: label those points that would most reduce the model's generalization error. Exponentiated Gradient Exploration for Active Learning: In this paper, the author proposes a sequential algorithm named exponentiated gradient (EG)-active that can improve any active learning algorithm by an optimal random exploration. Uncertainty sampling: label those points for which the current model is least certain as to what the correct output should be. Query by committee: a variety of models are trained on the current labeled data, and vote on the output for unlabeled data; label those points for which the "committee" disagrees the most Querying from diverse subspaces or partitions: When the underlying model is a forest of trees, the leaf nodes might represent (overlapping) partitions of the original feature space. This offers the possibility of selecting instances from non-overlapping or minimally overlapping partitions for labeling. Variance reduction: label those points that would minimize output variance, which is one of the components of error. Conformal prediction: predicts that a new data point will have a label similar to old data points in some specified way and degree of the similarity within the old examples is used to estimate the confidence in the prediction. Mismatch-first farthest-traversal: The primary selection criterion is the prediction mismatch between the current model and nearest-neighbour prediction. It targets on wrongly predicted data points. The second selection criterion is the distance to previously selected data, the farthest first. It aims at optimizing the diversity of selected data. User-centered labeling strategies: Learning is accomplished by applying dimensionality reduction to graphs and figures like scatter plots. Then the user is asked to label the compiled data (categorical, numerical, relevance scores, relation between two instances). A wide variety of algorithms have been studied that fall into these categories. While the traditional AL strategies can achieve remarkable performance, it is often challenging to predict in advance which strategy is the most suitable in a particular situation. In recent years, meta-learning algorithms have been gaining in popularity. Some of them have been proposed to tackle the problem of learning AL strategies instead of relying on manually designed strategies. A benchmark which compares 'meta-learning approaches to active learning' to 'traditional heuristic-based Active Learning' may give intuitions if 'Learning active learning' is at the crossroads == Minimum marginal hyperplane == Some active learning algorithms are built upon support-vector machines (SVMs) and exploit the structure of the SVM to determine which data points to label. Such methods usually calculate the margin, W, of each u
A Logical Calculus of the Ideas Immanent in Nervous Activity
"A Logical Calculus of the Ideas Immanent in Nervous Activity" is a 1943 paper written by Warren Sturgis McCulloch and Walter Pitts, published in the journal The Bulletin of Mathematical Biophysics. The paper proposed a mathematical model of the nervous system as a network of simple logical elements, later known as artificial neurons, or McCulloch–Pitts neurons. These neurons receive inputs, perform a weighted sum, and fire an output signal based on a threshold function. By connecting these units in various configurations, McCulloch and Pitts demonstrated that their model could perform all logical functions. It is a seminal work in cognitive science, computational neuroscience, computer science, and artificial intelligence. It was a foundational result in automata theory. John von Neumann cited it as a significant result. == Mathematics == The artificial neuron used in the original paper is slightly different from the modern version. They considered neural networks that operate in discrete steps of time t = 0 , 1 , … {\displaystyle t=0,1,\dots } . The neural network contains a number of neurons. Let the state of a neuron i {\displaystyle i} at time t {\displaystyle t} be N i ( t ) {\displaystyle N_{i}(t)} . The state of a neuron can either be 0 or 1, standing for "not firing" and "firing". Each neuron also has a firing threshold θ {\displaystyle \theta } , such that it fires if the total input exceeds the threshold. Each neuron can connect to any other neuron (including itself) with positive synapses (excitatory) or negative synapses (inhibitory). That is, each neuron can connect to another neuron with a weight w {\displaystyle w} taking an integer value. A peripheral afferent is a neuron with no incoming synapses. We can regard each neural network as a directed graph, with the nodes being the neurons, and the directed edges being the synapses. A neural network has a circle or a circuit if there exists a directed circle in the graph. Let w i j ( t ) {\displaystyle w_{ij}(t)} be the connection weight from neuron j {\displaystyle j} to neuron i {\displaystyle i} at time t {\displaystyle t} , then its next state is N i ( t + 1 ) = H ( ∑ j = 1 n w i j ( t ) N j ( t ) − θ i ( t ) ) , {\displaystyle N_{i}(t+1)=H\left(\sum _{j=1}^{n}w_{ij}(t)N_{j}(t)-\theta _{i}(t)\right),} where H {\displaystyle H} is the Heaviside step function (outputting 1 if the input is greater than or equal to 0, and 0 otherwise). === Symbolic logic === The paper used, as a logical language for describing neural networks, "Language II" from The Logical Syntax of Language by Rudolf Carnap with some notations taken from Principia Mathematica by Alfred North Whitehead and Bertrand Russell. Language II covers substantial parts of classical mathematics, including real analysis and portions of set theory. To describe a neural network with peripheral afferents N 1 , N 2 , … , N p {\displaystyle N_{1},N_{2},\dots ,N_{p}} and non-peripheral afferents N p + 1 , N p + 2 , … , N n {\displaystyle N_{p+1},N_{p+2},\dots ,N_{n}} they considered logical predicate of form P r ( N 1 , N 2 , … , N p , t ) {\displaystyle Pr(N_{1},N_{2},\dots ,N_{p},t)} where P r {\displaystyle Pr} is a first-order logic predicate function (a function that outputs a boolean), N 1 , … , N p {\displaystyle N_{1},\dots ,N_{p}} are predicates that take t {\displaystyle t} as an argument, and t {\displaystyle t} is the only free variable in the predicate. Intuitively speaking, N 1 , … , N p {\displaystyle N_{1},\dots ,N_{p}} specifies the binary input patterns going into the neural network over all time, and P r ( N 1 , N 2 , … , N n , t ) {\displaystyle Pr(N_{1},N_{2},\dots ,N_{n},t)} is a function that takes some binary input patterns, and constructs an output binary pattern P r ( N 1 , N 2 , … , N n , 0 ) , P r ( N 1 , N 2 , … , N n , 1 ) , … {\displaystyle Pr(N_{1},N_{2},\dots ,N_{n},0),Pr(N_{1},N_{2},\dots ,N_{n},1),\dots } . A logical sentence P r ( N 1 , N 2 , … , N n , t ) {\displaystyle Pr(N_{1},N_{2},\dots ,N_{n},t)} is realized by a neural network iff there exists a time-delay T ≥ 0 {\displaystyle T\geq 0} , a neuron i {\displaystyle i} in the network, and an initial state for the non-peripheral neurons N p + 1 ( 0 ) , … , N n ( 0 ) {\displaystyle N_{p+1}(0),\dots ,N_{n}(0)} , such that for any time t {\displaystyle t} , the truth-value of the logical sentence is equal to the state of the neuron i {\displaystyle i} at time t + T {\displaystyle t+T} . That is, ∀ t = 0 , 1 , 2 , … , P r ( N 1 , N 2 , … , N p , t ) = N i ( t + T ) {\displaystyle \forall t=0,1,2,\dots ,\quad Pr(N_{1},N_{2},\dots ,N_{p},t)=N_{i}(t+T)} === Equivalence === In the paper, they considered some alternative definitions of artificial neural networks, and have shown them to be equivalent, that is, neural networks under one definition realizes precisely the same logical sentences as neural networks under another definition. They considered three forms of inhibition: relative inhibition, absolute inhibition, and extinction. The definition above is relative inhibition. By "absolute inhibition" they meant that if any negative synapse fires, then the neuron will not fire. By "extinction" they meant that if at time t {\displaystyle t} , any inhibitory synapse fires on a neuron i {\displaystyle i} , then θ i ( t + j ) = θ i ( 0 ) + b j {\displaystyle \theta _{i}(t+j)=\theta _{i}(0)+b_{j}} for j = 1 , 2 , 3 , … {\displaystyle j=1,2,3,\dots } , until the next time an inhibitory synapse fires on i {\displaystyle i} . It is required that b j = 0 {\displaystyle b_{j}=0} for all large j {\displaystyle j} . Theorem 4 and 5 state that these are equivalent. They considered three forms of excitation: spatial summation, temporal summation, and facilitation. The definition above is spatial summation (which they pictured as having multiple synapses placed close together, so that the effect of their firing sums up). By "temporal summation" they meant that the total incoming signal is ∑ τ = 0 T ∑ j = 1 n w i j ( t ) N j ( t − τ ) {\displaystyle \sum _{\tau =0}^{T}\sum _{j=1}^{n}w_{ij}(t)N_{j}(t-\tau )} for some T ≥ 1 {\displaystyle T\geq 1} . By "facilitation" they meant the same as extinction, except that b j ≤ 0 {\displaystyle b_{j}\leq 0} . Theorem 6 states that these are equivalent. They considered neural networks that do not change, and those that change by Hebbian learning. That is, they assume that at t = 0 {\displaystyle t=0} , some excitatory synaptic connections are not active. If at any t {\displaystyle t} , both N i ( t ) = 1 , N j ( t ) = 1 {\displaystyle N_{i}(t)=1,N_{j}(t)=1} , then any latent excitatory synapse between i , j {\displaystyle i,j} becomes active. Theorem 7 states that these are equivalent. === Logical expressivity === They considered "temporal propositional expressions" (TPE), which are propositional formulas with one free variable t {\displaystyle t} . For example, N 1 ( t ) ∨ N 2 ( t ) ∧ ¬ N 3 ( t ) {\displaystyle N_{1}(t)\vee N_{2}(t)\wedge \neg N_{3}(t)} is such an expression. Theorem 1 and 2 together showed that neural nets without circles are equivalent to TPE. For neural nets with loops, they noted that "realizable P r {\displaystyle Pr} may involve reference to past events of an indefinite degree of remoteness". These then encodes for sentences like "There was some x such that x was a ψ" or ( ∃ x ) ( ψ x ) {\displaystyle (\exists x)(\psi x)} . Theorems 8 to 10 showed that neural nets with loops can encode all first-order logic with equality and conversely, any looped neural networks is equivalent to a sentence in first-order logic with equality, thus showing that they are equivalent in logical expressiveness. As a remark, they noted that a neural network, if furnished with a tape, scanners, and write-heads, is equivalent to a Turing machine, and conversely, every Turing machine is equivalent to some such neural network. Thus, these neural networks are equivalent to Turing computability and Church's lambda-definability. == Context == === Previous work === The paper built upon several previous strands of work. In the symbolic logic side, it built on the previous work by Carnap, Whitehead, and Russell. This was contributed by Walter Pitts, who had a strong proficiency with symbolic logic. Pitts provided mathematical and logical rigor to McCulloch’s vague ideas on psychons (atoms of psychological events) and circular causality. In the neuroscience side, it built on previous work by the mathematical biology research group centered around Nicolas Rashevsky, of which McCulloch was a member. The paper was published in the Bulletin of Mathematical Biophysics, which was founded by Rashevsky in 1939. During the late 1930s, Rashevsky's research group was producing papers that had difficulty publishing in other journals at the time, so Rashevsky decided to found a new journal exclusively devoted to mathematical biophysics. Also in the Rashevsky's group was Alston Scott Householder, who in 1941 published an abstract model
Comparison of raster graphics editors
Raster graphics editors can be compared by many variables, including availability. == List == == General information == Basic general information about the editor: creator, company, license, etc. == Operating system support == The operating systems on which the editors can run natively, that is, without emulation, virtual machines or compatibility layers. In other words, the software must be specifically coded for the operation system; for example, Adobe Photoshop for Windows running on Linux with Wine does not fit. == Features == == Color spaces == == File support ==
Sample complexity
The sample complexity of a machine learning algorithm represents the number of training-samples that it needs in order to successfully learn a target function. More precisely, the sample complexity is the number of training-samples that we need to supply to the algorithm, so that the function returned by the algorithm is within an arbitrarily small error of the best possible function, with probability arbitrarily close to 1. There are two variants of sample complexity: The weak variant fixes a particular input-output distribution; The strong variant takes the worst-case sample complexity over all input-output distributions. The No free lunch theorem, discussed below, proves that, in general, the strong sample complexity is infinite, i.e. that there is no algorithm that can learn the globally-optimal target function using a finite number of training samples. However, if we are only interested in a particular class of target functions (e.g., only linear functions) then the sample complexity is finite, and it depends linearly on the VC dimension on the class of target functions. == Definition == Let X {\displaystyle X} be a space which we call the input space, and Y {\displaystyle Y} be a space which we call the output space, and let Z {\displaystyle Z} denote the product X × Y {\displaystyle X\times Y} . For example, in the setting of binary classification, X {\displaystyle X} is typically a finite-dimensional vector space and Y {\displaystyle Y} is the set { − 1 , 1 } {\displaystyle \{-1,1\}} . Fix a hypothesis space H {\displaystyle {\mathcal {H}}} of functions h : X → Y {\displaystyle h\colon X\to Y} . A learning algorithm over H {\displaystyle {\mathcal {H}}} is a computable map from Z {\displaystyle Z} to H {\displaystyle {\mathcal {H}}} . In other words, it is an algorithm that takes as input a finite sequence of training samples and outputs a function from X {\displaystyle X} to Y {\displaystyle Y} . Typical learning algorithms include empirical risk minimization, without or with Tikhonov regularization. Fix a loss function L : Y × Y → R ≥ 0 {\displaystyle {\mathcal {L}}\colon Y\times Y\to \mathbb {R} _{\geq 0}} , for example, the square loss L ( y , y ′ ) = ( y − y ′ ) 2 {\displaystyle {\mathcal {L}}(y,y')=(y-y')^{2}} , where h ( x ) = y ′ {\displaystyle h(x)=y'} . For a given distribution ρ {\displaystyle \rho } on X × Y {\displaystyle X\times Y} , the expected risk of a hypothesis (a function) h ∈ H {\displaystyle h\in {\mathcal {H}}} is E ( h ) := E ρ [ L ( h ( x ) , y ) ] = ∫ X × Y L ( h ( x ) , y ) d ρ ( x , y ) {\displaystyle {\mathcal {E}}(h):=\mathbb {E} _{\rho }[{\mathcal {L}}(h(x),y)]=\int _{X\times Y}{\mathcal {L}}(h(x),y)\,d\rho (x,y)} In our setting, we have h = A ( S n ) {\displaystyle h={\mathcal {A}}(S_{n})} , where A {\displaystyle {\mathcal {A}}} is a learning algorithm and S n = ( ( x 1 , y 1 ) , … , ( x n , y n ) ) ∼ ρ n {\displaystyle S_{n}=((x_{1},y_{1}),\ldots ,(x_{n},y_{n}))\sim \rho ^{n}} is a sequence of vectors which are all drawn independently from ρ {\displaystyle \rho } . Define the optimal risk E H ∗ = inf h ∈ H E ( h ) . {\displaystyle {\mathcal {E}}_{\mathcal {H}}^{}={\underset {h\in {\mathcal {H}}}{\inf }}{\mathcal {E}}(h).} Set h n = A ( S n ) {\displaystyle h_{n}={\mathcal {A}}(S_{n})} , for each sample size n {\displaystyle n} . h n {\displaystyle h_{n}} is a random variable and depends on the random variable S n {\displaystyle S_{n}} , which is drawn from the distribution ρ n {\displaystyle \rho ^{n}} . The algorithm A {\displaystyle {\mathcal {A}}} is called consistent if E ( h n ) {\displaystyle {\mathcal {E}}(h_{n})} probabilistically converges to E H ∗ {\displaystyle {\mathcal {E}}_{\mathcal {H}}^{}} . In other words, for all ϵ , δ > 0 {\displaystyle \epsilon ,\delta >0} , there exists a positive integer N {\displaystyle N} , such that, for all sample sizes n ≥ N {\displaystyle n\geq N} , we have Pr ρ n [ E ( h n ) − E H ∗ ≥ ε ] < δ . {\displaystyle \Pr _{\rho ^{n}}[{\mathcal {E}}(h_{n})-{\mathcal {E}}_{\mathcal {H}}^{}\geq \varepsilon ]<\delta .} The sample complexity of A {\displaystyle {\mathcal {A}}} is then the minimum N {\displaystyle N} for which this holds, as a function of ρ , ϵ {\displaystyle \rho ,\epsilon } , and δ {\displaystyle \delta } . We write the sample complexity as N ( ρ , ϵ , δ ) {\displaystyle N(\rho ,\epsilon ,\delta )} to emphasize that this value of N {\displaystyle N} depends on ρ , ϵ {\displaystyle \rho ,\epsilon } , and δ {\displaystyle \delta } . If A {\displaystyle {\mathcal {A}}} is not consistent, then we set N ( ρ , ϵ , δ ) = ∞ {\displaystyle N(\rho ,\epsilon ,\delta )=\infty } . If there exists an algorithm for which N ( ρ , ϵ , δ ) {\displaystyle N(\rho ,\epsilon ,\delta )} is finite, then we say that the hypothesis space H {\displaystyle {\mathcal {H}}} is learnable. In others words, the sample complexity N ( ρ , ϵ , δ ) {\displaystyle N(\rho ,\epsilon ,\delta )} defines the rate of consistency of the algorithm: given a desired accuracy ϵ {\displaystyle \epsilon } and confidence δ {\displaystyle \delta } , one needs to sample N ( ρ , ϵ , δ ) {\displaystyle N(\rho ,\epsilon ,\delta )} data points to guarantee that the risk of the output function is within ϵ {\displaystyle \epsilon } of the best possible, with probability at least 1 − δ {\displaystyle 1-\delta } . In probably approximately correct (PAC) learning, one is concerned with whether the sample complexity is polynomial, that is, whether N ( ρ , ϵ , δ ) {\displaystyle N(\rho ,\epsilon ,\delta )} is bounded by a polynomial in 1 / ϵ {\displaystyle 1/\epsilon } and 1 / δ {\displaystyle 1/\delta } . If N ( ρ , ϵ , δ ) {\displaystyle N(\rho ,\epsilon ,\delta )} is polynomial for some learning algorithm, then one says that the hypothesis space H {\displaystyle {\mathcal {H}}} is PAC-learnable. This is a stronger notion than being learnable. == Unrestricted hypothesis space: infinite sample complexity == One can ask whether there exists a learning algorithm so that the sample complexity is finite in the strong sense, that is, there is a bound on the number of samples needed so that the algorithm can learn any distribution over the input-output space with a specified target error. More formally, one asks whether there exists a learning algorithm A {\displaystyle {\mathcal {A}}} , such that, for all ϵ , δ > 0 {\displaystyle \epsilon ,\delta >0} , there exists a positive integer N {\displaystyle N} such that for all n ≥ N {\displaystyle n\geq N} , we have sup ρ ( Pr ρ n [ E ( h n ) − E H ∗ ≥ ε ] ) < δ , {\displaystyle \sup _{\rho }\left(\Pr _{\rho ^{n}}[{\mathcal {E}}(h_{n})-{\mathcal {E}}_{\mathcal {H}}^{}\geq \varepsilon ]\right)<\delta ,} where h n = A ( S n ) {\displaystyle h_{n}={\mathcal {A}}(S_{n})} , with S n = ( ( x 1 , y 1 ) , … , ( x n , y n ) ) ∼ ρ n {\displaystyle S_{n}=((x_{1},y_{1}),\ldots ,(x_{n},y_{n}))\sim \rho ^{n}} as above. The No Free Lunch Theorem says that without restrictions on the hypothesis space H {\displaystyle {\mathcal {H}}} , this is not the case, i.e., there always exist "bad" distributions for which the sample complexity is arbitrarily large. Thus, in order to make statements about the rate of convergence of the quantity sup ρ ( Pr ρ n [ E ( h n ) − E H ∗ ≥ ε ] ) , {\displaystyle \sup _{\rho }\left(\Pr _{\rho ^{n}}[{\mathcal {E}}(h_{n})-{\mathcal {E}}_{\mathcal {H}}^{}\geq \varepsilon ]\right),} one must either constrain the space of probability distributions ρ {\displaystyle \rho } , e.g. via a parametric approach, or constrain the space of hypotheses H {\displaystyle {\mathcal {H}}} , as in distribution-free approaches. == Restricted hypothesis space: finite sample-complexity == The latter approach leads to concepts such as VC dimension and Rademacher complexity which control the complexity of the space H {\displaystyle {\mathcal {H}}} . A smaller hypothesis space introduces more bias into the inference process, meaning that E H ∗ {\displaystyle {\mathcal {E}}_{\mathcal {H}}^{}} may be greater than the best possible risk in a larger space. However, by restricting the complexity of the hypothesis space it becomes possible for an algorithm to produce more uniformly consistent functions. This trade-off leads to the concept of regularization. It is a theorem from VC theory that the following three statements are equivalent for a hypothesis space H {\displaystyle {\mathcal {H}}} : H {\displaystyle {\mathcal {H}}} is PAC-learnable. The VC dimension of H {\displaystyle {\mathcal {H}}} is finite. H {\displaystyle {\mathcal {H}}} is a uniform Glivenko-Cantelli class. This gives a way to prove that certain hypothesis spaces are PAC learnable, and by extension, learnable. === An example of a PAC-learnable hypothesis space === X = R d , Y = { − 1 , 1 } {\displaystyle X=\mathbb {R} ^{d},Y=\{-1,1\}} , and let H {\displaystyle {\mathcal {H}}} be the space of affine functions on X {\displaystyle X} , that is, functions of the form x ↦ ⟨ w , x ⟩ + b {\displaystyle x\mapsto \langl