Information scientist

Information scientist

The term information scientist developed in the latter part of the twentieth century by Wm. Hovey Smith to describe an individual, usually with a relevant subject degree (such as one in Information and Computer Science - CIS) or high level of subject knowledge, providing focused information to scientific and technical research staff in industry. It is a role quite distinct from and complementary to that of a librarian. Developments in end-user searching, together with some convergence between the roles of librarian and information scientist, have led to a diminution in its use in this context, and the term information officer or information professional (information specialist) are also now used. The term was, and is, also used for an individual carrying out research in information science. Brian C. Vickery mentions that the Institute of Information Scientists (IIS) was established in London during 1958 and lists the criteria put forward by this institute "Criteria for Information Science" (appendix 1) as well as his own "Areas of study in information science" (appendix 2). The IIS merged with the Library Association in 2002 to form the Chartered Institute of Library and Information Professionals (CILIP). == Notable Information Scientists == See also Award of Merit - Association for Information Science and Technology Marcia Bates David Blair (information technologist) Samuel C. Bradford Michael Buckland John M. Carroll Blaise Cronin Emilia Currás Brenda Dervin Eugene Garfield Paul B. Kantor Frederick Wilfrid Lancaster Calvin Mooers Tefko Saracevic Linda C. Smith Robert Saxton Taylor Brian Campbell Vickery Thomas D. Wilson == Additional reading == Ellis, David and Merete Haugan. (1997) "Modelling the information seeking patterns of engineers and research scientists in an industrial environment" (Journal of Documentation, Volume 53(4): pp. 384–403) Poole, Alex H. (2024). "'There's a big difference between going through life with the wind at your back, and going through life leaning into the wind': Feminism in Post-World War II Information Science". Proceedings of the Association for Information Science and Technology. 61: 300–313. doi:10.1002/pra2.1029. Vickery, Brian Campbell (1988) "Essays presented to B. C. Vickery" (Journal of Documentation, Volume 44, pp. 199–283). Vickery, B. & Vickery, A. (1987) Information Science in theory and practice (London: Bowker-Saur, pp. 361–369)

Kernel embedding of distributions

In machine learning, the kernel embedding of distributions (also called the kernel mean or mean map) comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space Ω {\displaystyle \Omega } on which a sensible kernel function (measuring similarity between elements of Ω {\displaystyle \Omega } ) may be defined. For example, various kernels have been proposed for learning from data which are: vectors in R d {\displaystyle \mathbb {R} ^{d}} , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song, Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in. The analysis of distributions is fundamental in machine learning and statistics, and many algorithms in these fields rely on information theoretic approaches such as entropy, mutual information, or Kullback–Leibler divergence. However, to estimate these quantities, one must first either perform density estimation, or employ sophisticated space-partitioning/bias-correction strategies which are typically infeasible for high-dimensional data. Commonly, methods for modeling complex distributions rely on parametric assumptions that may be unfounded or computationally challenging (e.g. Gaussian mixture models), while nonparametric methods like kernel density estimation (Note: the smoothing kernels in this context have a different interpretation than the kernels discussed here) or characteristic function representation (via the Fourier transform of the distribution) break down in high-dimensional settings. Methods based on the kernel embedding of distributions sidestep these problems and also possess the following advantages: Data may be modeled without restrictive assumptions about the form of the distributions and relationships between variables Intermediate density estimation is not needed Practitioners may specify the properties of a distribution most relevant for their problem (incorporating prior knowledge via choice of the kernel) If a characteristic kernel is used, then the embedding can uniquely preserve all information about a distribution, while thanks to the kernel trick, computations on the potentially infinite-dimensional RKHS can be implemented in practice as simple Gram matrix operations Dimensionality-independent rates of convergence for the empirical kernel mean (estimated using samples from the distribution) to the kernel embedding of the true underlying distribution can be proven. Learning algorithms based on this framework exhibit good generalization ability and finite sample convergence, while often being simpler and more effective than information theoretic methods Thus, learning via the kernel embedding of distributions offers a principled drop-in replacement for information theoretic approaches and is a framework which not only subsumes many popular methods in machine learning and statistics as special cases, but also can lead to entirely new learning algorithms. == Definitions == Let X {\displaystyle X} denote a random variable with domain Ω {\displaystyle \Omega } and distribution P {\displaystyle P} . Given a symmetric, positive-definite kernel k : Ω × Ω → R {\displaystyle k:\Omega \times \Omega \rightarrow \mathbb {R} } the Moore–Aronszajn theorem asserts the existence of a unique RKHS H {\displaystyle {\mathcal {H}}} on Ω {\displaystyle \Omega } (a Hilbert space of functions f : Ω → R {\displaystyle f:\Omega \to \mathbb {R} } equipped with an inner product ⟨ ⋅ , ⋅ ⟩ H {\displaystyle \langle \cdot ,\cdot \rangle _{\mathcal {H}}} and a norm ‖ ⋅ ‖ H {\displaystyle \|\cdot \|_{\mathcal {H}}} ) for which k {\displaystyle k} is a reproducing kernel, i.e., in which the element k ( x , ⋅ ) {\displaystyle k(x,\cdot )} satisfies the reproducing property ⟨ f , k ( x , ⋅ ) ⟩ H = f ( x ) ∀ f ∈ H , ∀ x ∈ Ω . {\displaystyle \langle f,k(x,\cdot )\rangle _{\mathcal {H}}=f(x)\qquad \forall f\in {\mathcal {H}},\quad \forall x\in \Omega .} One may alternatively consider x ↦ k ( x , ⋅ ) {\displaystyle x\mapsto k(x,\cdot )} as an implicit feature mapping φ : Ω → H {\displaystyle \varphi :\Omega \rightarrow {\mathcal {H}}} (which is therefore also called the feature space), so that k ( x , x ′ ) = ⟨ φ ( x ) , φ ( x ′ ) ⟩ H {\displaystyle k(x,x')=\langle \varphi (x),\varphi (x')\rangle _{\mathcal {H}}} can be viewed as a measure of similarity between points x , x ′ ∈ Ω . {\displaystyle x,x'\in \Omega .} While the similarity measure is linear in the feature space, it may be highly nonlinear in the original space depending on the choice of kernel. === Kernel embedding === The kernel embedding of the distribution P {\displaystyle P} in H {\displaystyle {\mathcal {H}}} (also called the kernel mean or mean map) is given by: μ X := E [ k ( X , ⋅ ) ] = E [ φ ( X ) ] = ∫ Ω φ ( x ) d P ( x ) {\displaystyle \mu _{X}:=\mathbb {E} [k(X,\cdot )]=\mathbb {E} [\varphi (X)]=\int _{\Omega }\varphi (x)\ \mathrm {d} P(x)} If P {\displaystyle P} allows a square integrable density p {\displaystyle p} , then μ X = E k p {\displaystyle \mu _{X}={\mathcal {E}}_{k}p} , where E k {\displaystyle {\mathcal {E}}_{k}} is the Hilbert–Schmidt integral operator. A kernel is characteristic if the mean embedding μ : { family of distributions over Ω } → H {\displaystyle \mu :\{{\text{family of distributions over }}\Omega \}\to {\mathcal {H}}} is injective. Each distribution can thus be uniquely represented in the RKHS and all statistical features of distributions are preserved by the kernel embedding if a characteristic kernel is used. === Empirical kernel embedding === Given n {\displaystyle n} training examples { x 1 , … , x n } {\displaystyle \{x_{1},\ldots ,x_{n}\}} drawn independently and identically distributed (i.i.d.) from P , {\displaystyle P,} the kernel embedding of P {\displaystyle P} can be empirically estimated as μ ^ X = 1 n ∑ i = 1 n φ ( x i ) {\displaystyle {\widehat {\mu }}_{X}={\frac {1}{n}}\sum _{i=1}^{n}\varphi (x_{i})} === Joint distribution embedding === If Y {\displaystyle Y} denotes another random variable (for simplicity, assume the co-domain of Y {\displaystyle Y} is also Ω {\displaystyle \Omega } with the same kernel k {\displaystyle k} which satisfies ⟨ φ ( x ) ⊗ φ ( y ) , φ ( x ′ ) ⊗ φ ( y ′ ) ⟩ = k ( x , x ′ ) k ( y , y ′ ) {\displaystyle \langle \varphi (x)\otimes \varphi (y),\varphi (x')\otimes \varphi (y')\rangle =k(x,x')k(y,y')} ), then the joint distribution P ( x , y ) ) {\displaystyle P(x,y))} can be mapped into a tensor product feature space H ⊗ H {\displaystyle {\mathcal {H}}\otimes {\mathcal {H}}} via C X Y = E [ φ ( X ) ⊗ φ ( Y ) ] = ∫ Ω × Ω φ ( x ) ⊗ φ ( y ) d P ( x , y ) {\displaystyle {\mathcal {C}}_{XY}=\mathbb {E} [\varphi (X)\otimes \varphi (Y)]=\int _{\Omega \times \Omega }\varphi (x)\otimes \varphi (y)\ \mathrm {d} P(x,y)} By the equivalence between a tensor and a linear map, this joint embedding may be interpreted as an uncentered cross-covariance operator C X Y : H → H {\displaystyle {\mathcal {C}}_{XY}:{\mathcal {H}}\to {\mathcal {H}}} from which the cross-covariance of functions f , g ∈ H {\displaystyle f,g\in {\mathcal {H}}} can be computed as Cov ⁡ ( f ( X ) , g ( Y ) ) := E [ f ( X ) g ( Y ) ] − E [ f ( X ) ] E [ g ( Y ) ] = ⟨ f , C X Y g ⟩ H = ⟨ f ⊗ g , C X Y ⟩ H ⊗ H {\displaystyle \operatorname {Cov} (f(X),g(Y)):=\mathbb {E} [f(X)g(Y)]-\mathbb {E} [f(X)]\mathbb {E} [g(Y)]=\langle f,{\mathcal {C}}_{XY}g\rangle _{\mathcal {H}}=\langle f\otimes g,{\mathcal {C}}_{XY}\rangle _{{\mathcal {H}}\otimes {\mathcal {H}}}} Given n {\displaystyle n} pairs of training examples { ( x 1 , y 1 ) , … , ( x n , y n ) } {\displaystyle \{(x_{1},y_{1}),\dots ,(x_{n},y_{n})\}} drawn i.i.d. from P {\displaystyle P} , we can also empirically estimate the joint distribution kernel embedding via C ^ X Y = 1 n ∑ i = 1 n φ ( x i ) ⊗ φ ( y i ) {\displaystyle {\widehat {\mathcal {C}}}_{XY}={\frac {1}{n}}\sum _{i=1}^{n}\varphi (x_{i})\otimes \varphi (y_{i})} === Conditional distribution embedding === Given a conditional distribution P ( y ∣ x ) , {\displaystyle P(y\mid x),} one can define the corresponding RKHS embedding as μ Y ∣ x = E [ φ ( Y ) ∣ X ] = ∫ Ω φ ( y ) d P ( y ∣ x ) {\displaystyle \mu _{Y\mid x}=\mathbb {E} [\varphi (Y)\mid X]=\int _{\Omega

Solomonoff's theory of inductive inference

Solomonoff's theory of inductive inference proves that, under its common sense assumptions (axioms), the best possible scientific model is the shortest algorithm that generates the empirical data under consideration. In addition to the choice of data, other assumptions are that, to avoid the post-hoc fallacy, the programming language must be chosen prior to the data and that the environment being observed is generated by an unknown algorithm. This is also called a theory of induction. Due to its basis in the dynamical (state-space model) character of Algorithmic Information Theory, it encompasses statistical as well as dynamical information criteria for model selection. It was introduced by Ray Solomonoff, based on probability theory and theoretical computer science. In essence, Solomonoff's induction derives the posterior probability of any computable theory, given a sequence of observed data. This posterior probability is derived from Bayes' rule and some universal prior, that is, a prior that assigns a positive probability to any computable theory. Solomonoff proved that this induction is incomputable (or more precisely, lower semi-computable), but noted that "this incomputability is of a very benign kind", and that it "in no way inhibits its use for practical prediction" (as it can be approximated from below more accurately with more computational resources). It is only "incomputable" in the benign sense that no scientific consensus is able to prove that the best current scientific theory is the best of all possible theories. However, Solomonoff's theory does provide an objective criterion for deciding among the current scientific theories explaining a given set of observations. Solomonoff's induction naturally formalizes Occam's razor by assigning larger prior credences to theories that require a shorter algorithmic description. == Origin == === Philosophical === The theory is based in philosophical foundations, and was founded by Ray Solomonoff around 1960. It is a mathematically formalized combination of Occam's razor and the Principle of Multiple Explanations. All computable theories which perfectly describe previous observations are used to calculate the probability of the next observation, with more weight put on the shorter computable theories. Marcus Hutter's universal artificial intelligence builds upon this to calculate the expected value of an action. === Principle === Solomonoff's induction has been argued to be the computational formalization of pure Bayesianism. To understand, recall that Bayesianism derives the posterior probability P [ T | D ] {\displaystyle \mathbb {P} [T|D]} of a theory T {\displaystyle T} given data D {\displaystyle D} by applying Bayes rule, which yields P [ T | D ] = P [ D | T ] P [ T ] P [ D | T ] P [ T ] + ∑ A ≠ T P [ D | A ] P [ A ] {\displaystyle \mathbb {P} [T|D]={\frac {\mathbb {P} [D|T]\mathbb {P} [T]}{\mathbb {P} [D|T]\mathbb {P} [T]+\sum _{A\neq T}\mathbb {P} [D|A]\mathbb {P} [A]}}} where theories A {\displaystyle A} are alternatives to theory T {\displaystyle T} . For this equation to make sense, the quantities P [ D | T ] {\displaystyle \mathbb {P} [D|T]} and P [ D | A ] {\displaystyle \mathbb {P} [D|A]} must be well-defined for all theories T {\displaystyle T} and A {\displaystyle A} . In other words, any theory must define a probability distribution over observable data D {\displaystyle D} . Solomonoff's induction essentially boils down to demanding that all such probability distributions be computable. Interestingly, the set of computable probability distributions is a subset of the set of all programs, which is countable. Similarly, the sets of observable data considered by Solomonoff were finite. Without loss of generality, we can thus consider that any observable data is a finite bit string. As a result, Solomonoff's induction can be defined by only invoking discrete probability distributions. Solomonoff's induction then allows to make probabilistic predictions of future data F {\displaystyle F} , by simply obeying the laws of probability. Namely, we have P [ F | D ] = E T [ P [ F | T , D ] ] = ∑ T P [ F | T , D ] P [ T | D ] {\displaystyle \mathbb {P} [F|D]=\mathbb {E} _{T}[\mathbb {P} [F|T,D]]=\sum _{T}\mathbb {P} [F|T,D]\mathbb {P} [T|D]} . This quantity can be interpreted as the average predictions P [ F | T , D ] {\displaystyle \mathbb {P} [F|T,D]} of all theories T {\displaystyle T} given past data D {\displaystyle D} , weighted by their posterior credences P [ T | D ] {\displaystyle \mathbb {P} [T|D]} . === Mathematical === The proof of the "razor" is based on the known mathematical properties of a probability distribution over a countable set. These properties are relevant because the infinite set of all programs is a denumerable set. The sum S of the probabilities of all programs must be exactly equal to one (as per the definition of probability) thus the probabilities must roughly decrease as we enumerate the infinite set of all programs, otherwise S will be strictly greater than one. To be more precise, for every ϵ {\displaystyle \epsilon } > 0, there is some length l such that the probability of all programs longer than l is at most ϵ {\displaystyle \epsilon } . This does not, however, preclude very long programs from having very high probability. Fundamental ingredients of the theory are the concepts of algorithmic probability and Kolmogorov complexity. The universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs (for a universal computer) that compute something starting with p. Given some p and any computable but unknown probability distribution from which x is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of x in optimal fashion. == Mathematical guarantees == === Solomonoff's completeness === The remarkable property of Solomonoff's induction is its completeness. In essence, the completeness theorem guarantees that the expected cumulative errors made by the predictions based on Solomonoff's induction are upper-bounded by the Kolmogorov complexity of the (stochastic) data generating process. The errors can be measured using the Kullback–Leibler divergence or the square of the difference between the induction's prediction and the probability assigned by the (stochastic) data generating process. === Solomonoff's uncomputability === Unfortunately, Solomonoff also proved that Solomonoff's induction is uncomputable. In fact, he showed that computability and completeness are mutually exclusive: any complete theory must be uncomputable. The proof of this is derived from a game between the induction and the environment. Essentially, any computable induction can be tricked by a computable environment, by choosing the computable environment that negates the computable induction's prediction. This fact can be regarded as an instance of the no free lunch theorem. == Modern applications == === Artificial intelligence === Though Solomonoff's inductive inference is not computable, several AIXI-derived algorithms approximate it in order to make it run on a modern computer. The more computing power they are given, the closer their predictions are to the predictions of inductive inference (their mathematical limit is Solomonoff's inductive inference). Another direction of inductive inference is based on E. Mark Gold's model of learning in the limit from 1967 and has developed since then more and more models of learning. The general scenario is the following: Given a class S of computable functions, is there a learner (that is, recursive functional) which for any input of the form (f(0),f(1),...,f(n)) outputs a hypothesis (an index e with respect to a previously agreed on acceptable numbering of all computable functions; the indexed function may be required consistent with the given values of f). A learner M learns a function f if almost all its hypotheses are the same index e, which generates the function f; M learns S if M learns every f in S. Basic results are that all recursively enumerable classes of functions are learnable while the class REC of all computable functions is not learnable. Many related models have been considered and also the learning of classes of recursively enumerable sets from positive data is a topic studied from Gold's pioneering paper in 1967 onwards. A far reaching extension of the Gold’s approach is developed by Schmidhuber's theory of generalized Kolmogorov complexities, which are kinds of super-recursive algorithms.

Coupled pattern learner

Coupled Pattern Learner (CPL) is a machine learning algorithm which couples the semi-supervised learning of categories and relations to forestall the problem of semantic drift associated with boot-strap learning methods. == Coupled Pattern Learner == Semi-supervised learning approaches using a small number of labeled examples with many unlabeled examples are usually unreliable as they produce an internally consistent, but incorrect set of extractions. CPL solves this problem by simultaneously learning classifiers for many different categories and relations in the presence of an ontology defining constraints that couple the training of these classifiers. It was introduced by Andrew Carlson, Justin Betteridge, Estevam R. Hruschka Jr. and Tom M. Mitchell in 2009. == CPL overview == CPL is an approach to semi-supervised learning that yields more accurate results by coupling the training of many information extractors. Basic idea behind CPL is that semi-supervised training of a single type of extractor such as ‘coach’ is much more difficult than simultaneously training many extractors that cover a variety of inter-related entity and relation types. Using prior knowledge about the relationships between these different entities and relations CPL makes unlabeled data as a useful constraint during training. For e.g., ‘coach(x)’ implies ‘person(x)’ and ‘not sport(x)’. == CPL description == === Coupling of predicates === CPL primarily relies on the notion of coupling the learning of multiple functions so as to constrain the semi-supervised learning problem. CPL constrains the learned function in two ways. Sharing among same-arity predicates according to logical relations Relation argument type-checking === Sharing among same-arity predicates === Each predicate P in the ontology has a list of other same-arity predicates with which P is mutually exclusive. If A is mutually exclusive with predicate B, A’s positive instances and patterns become negative instances and negative patterns for B. For example, if ‘city’, having an instance ‘Boston’ and a pattern ‘mayor of arg1’, is mutually exclusive with ‘scientist’, then ‘Boston’ and ‘mayor of arg1’ will become a negative instance and a negative pattern respectively for ‘scientist.’ Further, Some categories are declared to be a subset of another category. For e.g., ‘athlete’ is a subset of ‘person’. === Relation argument type-checking === This is a type checking information used to couple the learning of relations and categories. For example, the arguments of the ‘ceoOf’ relation are declared to be of the categories ‘person’ and ‘company’. CPL does not promote a pair of noun phrases as an instance of a relation unless the two noun phrases are classified as belonging to the correct argument types. === Algorithm description === Following is a quick summary of the CPL algorithm. Input: An ontology O, and a text corpus C Output: Trusted instances/patterns for each predicate for i=1,2,...,∞ do foreach predicate p in O do EXTRACT candidate instances/contextual patterns using recently promoted patterns/instances; FILTER candidates that violate coupling; RANK candidate instances/patterns; PROMOTE top candidates; end end ==== Inputs ==== A large corpus of Part-Of-Speech tagged sentences and an initial ontology with predefined categories, relations, mutually exclusive relationships between same-arity predicates, subset relationships between some categories, seed instances for all predicates, and seed patterns for the categories. ==== Candidate extraction ==== CPL finds new candidate instances by using newly promoted patterns to extract the noun phrases that co-occur with those patterns in the text corpus. CPL extracts, Category Instances Category Patterns Relation Instances Relation Patterns ==== Candidate filtering ==== Candidate instances and patterns are filtered to maintain high precision, and to avoid extremely specific patterns. An instance is only considered for assessment if it co-occurs with at least two promoted patterns in the text corpus, and if its co-occurrence count with all promoted patterns is at least three times greater than its co-occurrence count with negative patterns. ==== Candidate ranking ==== CPL ranks candidate instances using the number of promoted patterns that they co-occur with so that candidates that occur with more patterns are ranked higher. Patterns are ranked using an estimate of the precision of each pattern. ==== Candidate promotion ==== CPL ranks the candidates according to their assessment scores and promotes at most 100 instances and 5 patterns for each predicate. Instances and patterns are only promoted if they co-occur with at least two promoted patterns or instances, respectively. == Meta-Bootstrap Learner == Meta-Bootstrap Learner (MBL) was also proposed by the authors of CPL. Meta-Bootstrap learner couples the training of multiple extraction techniques with a multi-view constraint, which requires the extractors to agree. It makes addition of coupling constraints on top of existing extraction algorithms, while treating them as black boxes, feasible. MBL assumes that the errors made by different extraction techniques are independent. Following is a quick summary of MBL. Input: An ontology O, a set of extractors ε Output: Trusted instances for each predicate for i=1,2,...,∞ do foreach predicate p in O do foreach extractor e in ε do Extract new candidates for p using e with recently promoted instances; end FILTER candidates that violate mutual-exclusion or type-checking constraints; PROMOTE candidates that were extracted by all extractors; end end Subordinate algorithms used with MBL do not promote any instance on their own, they report the evidence about each candidate to MBL and MBL is responsible for promoting instances. == Applications == In their paper authors have presented results showing the potential of CPL to contribute new facts to existing repository of semantic knowledge, Freebase

Hidden layer

In artificial neural networks, a hidden layer is a layer of artificial neurons that is neither an input layer nor an output layer. The simplest examples appear in multilayer perceptrons (MLP), as illustrated in the diagram. An MLP without any hidden layer is essentially just a linear model. With hidden layers and activation functions, however, nonlinearity is introduced into the model. In typical machine learning practice, the weights and biases are initialized, then iteratively updated during training via backpropagation.

Event condition action

Event condition action (ECA) is a short-cut for referring to the structure of active rules in event-driven architecture and active database systems. Such a rule traditionally consisted of three parts: The event part specifies the signal that triggers the invocation of the rule The condition part is a logical test that, if satisfied or evaluates to true, causes the action to be carried out The action part consists of updates or invocations on the local data This structure was used by the early research in active databases which started to use the term ECA. Current state of the art ECA rule engines use many variations on rule structure. Also other features not considered by the early research is introduced, such as strategies for event selection into the event part. In a memory-based rule engine, the condition could be some tests on local data and actions could be updates to object attributes. In a database system, the condition could simply be a query to the database, with the result set (if not null) being passed to the action part for changes to the database. In either case, actions could also be calls to external programs or remote procedures. Note that for database usage, updates to the database are regarded as internal events. As a consequence, the execution of the action part of an active rule can match the event part of the same or another active rule, thus triggering it. The equivalent in a memory-based rule engine would be to invoke an external method that caused an external event to trigger another ECA rule. ECA rules can also be used in rule engines that use variants of the Rete algorithm for rule processing. == ECA rule engines == Rulecore Concurrent Rules Apart Database Detect Invocation Rules ConceptBase ECArules

Model compression

Model compression is a machine learning technique for reducing the size of trained models. Large models can achieve high accuracy, but often at the cost of significant resource requirements. Compression techniques aim to compress models without significant performance reduction. Smaller models require less storage space, and consume less memory and compute during inference. Compressed models enable deployment on resource-constrained devices such as smartphones, embedded systems, edge computing devices, and consumer electronics computers. Efficient inference is also valuable for large corporations that serve large model inference over an API, allowing them to reduce computational costs and improve response times for users. Model compression is not to be confused with knowledge distillation, in which a smaller "student" model is trained to imitate the input-output behavior of a larger "teacher" model (as opposed to using the "teacher"'s trained parameters or the "teacher"'s training targets). == Techniques == Several techniques are employed for model compression. === Pruning === Pruning sparsifies a large model by setting some parameters to exactly zero. This effectively reduces the number of parameters. This allows the use of sparse matrix operations, which are faster than dense matrix operations. Pruning criteria can be based on magnitudes of parameters, the statistical pattern of neural activations, Hessian values, etc. === Quantization === Quantization reduces the numerical precision of weights and activations. For example, instead of storing weights as 32-bit floating-point numbers, they can be represented using 8-bit integers. Low-precision parameters take up less space, and takes less compute to perform arithmetic with. It is also possible to quantize some parameters more aggressively than others, so for example, a less important parameter can have 8-bit precision while another, more important parameter, can have 16-bit precision. Inference with such models requires mixed-precision arithmetic. Quantized models can also be used during training (rather than after training). PyTorch implements automatic mixed-precision (AMP), which performs autocasting, gradient scaling, and loss scaling. === Low-rank factorization === Weight matrices can be approximated by low-rank matrices. Let W {\displaystyle W} be a weight matrix of shape m × n {\displaystyle m\times n} . A low-rank approximation is W ≈ U V T {\displaystyle W\approx UV^{T}} , where U {\displaystyle U} and V {\displaystyle V} are matrices of shapes m × k , n × k {\displaystyle m\times k,n\times k} . When k {\displaystyle k} is small, this both reduces the number of parameters needed to represent W {\displaystyle W} approximately, and accelerates matrix multiplication by W {\displaystyle W} . Low-rank approximations can be found by singular value decomposition (SVD). The choice of rank for each weight matrix is a hyperparameter, and jointly optimized as a mixed discrete-continuous optimization problem. The rank of weight matrices may also be pruned after training, taking into account the effect of activation functions like ReLU on the implicit rank of the weight matrices. == Training == Model compression may be decoupled from training, that is, a model is first trained without regard for how it might be compressed, then it is compressed. However, it may also be combined with training. The "train big, then compress" method trains a large model for a small number of training steps (less than it would be if it were trained to convergence), then heavily compress the model. It is found that at the same compute budget, this method results in a better model than lightly compressed, small models. In Deep Compression, the compression has three steps. First loop (pruning): prune all weights lower than a threshold, then finetune the network, then prune again, etc. Second loop (quantization): cluster weights, then enforce weight sharing among all weights in each cluster, then finetune the network, then cluster again, etc. Third step: Use Huffman coding to losslessly compress the model. The SqueezeNet paper reported that Deep Compression achieved a compression ratio of 35 on AlexNet, and a ratio of ~10 on SqueezeNets.