Language model benchmark

Language model benchmark

A language model benchmark is a standardized test designed to evaluate the performance of language models on various natural language processing tasks. These tests are intended for comparing different models' capabilities in areas such as language understanding, generation, and reasoning. Benchmarks generally consist of a dataset and corresponding evaluation metrics. The dataset provides text samples and annotations, while the metrics measure a model's performance on tasks like answering questions, text classification, and machine translation. These benchmarks are developed and maintained by academic institutions, research organizations, and industry players to track progress in the field. In addition to accuracy, the metrics can include throughput, energy efficiency, bias, trust, and sustainability. == Overview == === Types === Benchmarks may be described by the following adjectives, not mutually exclusive: Classical: These tasks are studied in natural language processing, even before the advent of deep learning. Examples include the Penn Treebank for testing syntactic and semantic parsing, as well as bilingual translation benchmarked by BLEU scores. Question answering: These tasks have a text question and a text answer, often multiple-choice. They can be open-book or closed-book. Open-book QA resembles reading comprehension questions, with relevant passages included as annotation in the question, in which the answer appears. Closed-book QA includes no relevant passages. Closed-book QA is also called open-domain question-answering. Before the era of large language models, open-book QA was more common, and understood as testing information retrieval methods. Closed-book QA became common since GPT-2 as a method to measure knowledge stored within model parameters. Omnibus: An omnibus benchmark combines many benchmarks, often previously published. It is intended as an all-in-one benchmarking solution. Reasoning: These tasks are usually in the question-answering format, but are intended to be more difficult than standard question answering. Multimodal: These tasks require processing not only text, but also other modalities, such as images and sound. Examples include OCR and transcription. Agency: These tasks are for a language-model–based software agent that operates a computer for a user, such as editing images, browsing the web, etc. Adversarial: A benchmark is "adversarial" if the items in the benchmark are picked specifically so that certain models do badly on them. Adversarial benchmarks are often constructed after state of the art (SOTA) models have saturated (achieved 100% performance) a benchmark, to renew the benchmark. A benchmark is "adversarial" only at a certain moment in time, since what is adversarial may cease to be adversarial as newer SOTA models appear. Public/Private: A benchmark might be partly or entirely private, meaning that some or all of the questions are not publicly available. The idea is that if a question is publicly available, then it might be used for training, which would be "training on the test set" and invalidate the result of the benchmark. Usually, only the guardians of the benchmark have access to the private subsets, and to score a model on such a benchmark, one must send the model weights, or provide API access, to the guardians. The boundary between a benchmark and a dataset is not sharp. Generally, a dataset contains three "splits": training, test, and validation. Both the test and validation splits are essentially benchmarks. In general, a benchmark is distinguished from a test/validation dataset in that a benchmark is typically intended to be used to measure the performance of many different models that are not trained specifically for doing well on the benchmark, while a test/validation set is intended to be used to measure the performance of models trained specifically on the corresponding training set. In other words, a benchmark may be thought of as a test/validation set without a corresponding training set. Conversely, certain benchmarks may be used as a training set, such as the English Gigaword or the One Billion Word Benchmark, which in modern language is just the negative log-likelihood loss on a pretraining set with 1 billion words. Indeed, the distinction between benchmark and dataset in language models became sharper after the rise of the pretraining paradigm, whereby a model is first trained on massive, unlabeled datasets to learn general language patterns, syntax, and knowledge (pretraining), and the base model is then adapted to specific, downstream tasks using smaller, labeled datasets (fine-tuning). === Lifecycle === Generally, the life cycle of a benchmark consists of the following steps: Inception: A benchmark is published. It can be simply given as a demonstration of the power of a new model (implicitly) that others then picked up as a benchmark, or as a benchmark that others are encouraged to use (explicitly). Growth: More papers and models use the benchmark, and the performance on the benchmark grows. Maturity, degeneration or deprecation: A benchmark may be saturated, after which researchers move on to other benchmarks. Progress on the benchmark may also be neglected as the field moves to focus on other benchmarks. Renewal: A saturated benchmark can be upgraded to make it no longer saturated, allowing further progress. === Construction === Like datasets, benchmarks are typically constructed by several methods, individually or in combination: Web scraping: Ready-made question-answer pairs may be scraped online, such as from websites that teach mathematics and programming. Conversion: Items may be constructed programmatically from scraped web content, such as by blanking out named entities from sentences, and asking the model to fill in the blank. This was used for making the CNN/Daily Mail Reading Comprehension Task. Crowd sourcing: Items may be constructed by paying people to write them, such as on Amazon Mechanical Turk. This was used for making the MCTest. === Evaluation === Generally, benchmarks are fully automated. This limits the questions that can be asked. For example, with mathematical questions, "proving a claim" would be difficult to automatically check, while "calculate an answer with a unique integer answer" would be automatically checkable. With programming tasks, the answer can generally be checked by running unit tests, with an upper limit on runtime. The benchmark scores are of the following kinds: For multiple choice or cloze questions, common scores are accuracy (frequency of correct answer), precision, recall, F1 score, etc. pass@n: The model is given n {\displaystyle n} attempts to solve each problem. If any attempt is correct, the model earns a point. The pass@n score is the model's average score over all problems. k@n: The model makes n {\displaystyle n} attempts to solve each problem, but only k {\displaystyle k} attempts out of them are selected for submission. If any submission is correct, the model earns a point. The k@n score is the model's average score over all problems. cons@n: The model is given n {\displaystyle n} attempts to solve each problem. If the most common answer is correct, the model earns a point. The cons@n score is the model's average score over all problems. Here "cons" stands for "consensus" or "majority voting". The pass@n score can be estimated more accurately by making N > n {\displaystyle N>n} attempts, and use the unbiased estimator 1 − ( N − c n ) ( N n ) {\displaystyle 1-{\frac {\binom {N-c}{n}}{\binom {N}{n}}}} , where c {\displaystyle c} is the number of correct attempts. For less well-formed tasks, where the output can be any sentence, there are the following commonly used scores including BLEU ROUGE, METEOR, NIST, word error rate, LEPOR, CIDEr, and SPICE. === Issues === error: Some benchmark answers may be wrong. ambiguity: Some benchmark questions may be ambiguously worded. subjective: Some benchmark questions may not have an objective answer at all. This problem generally prevents creative writing benchmarks. Similarly, this prevents benchmarking writing proofs in natural language, though benchmarking proofs in a formal language is possible. open-ended: Some benchmark questions may not have a single answer of a fixed size. This problem generally prevents programming benchmarks from using more natural tasks such as "write a program for X", and instead uses tasks such as "write a function that implements specification X". inter-annotator agreement: Some benchmark questions may be not fully objective, such that even people would not agree with 100% on what the answer should be. This is common in natural language processing tasks, such as syntactic annotation. shortcut: Some benchmark questions may be easily solved by an "unintended" shortcut. For example, in the SNLI benchmark, having a negative word like "not" in the second sentence is a strong signal for the "Contradiction" category, regardless of what the se

Corel VideoStudio

Corel VideoStudio (formerly Ulead VideoStudio) is a video editing software package for Microsoft Windows. == Features == === Basic editing === The software allows storyboard and timeline-oriented editing. Various formats are supported for source clips, and the resulting video can be exported to a video file. DVD and AVCHD DVD authoring capabilities are included, and Blu-ray authoring is available via a plug-in. VideoStudio supports direct DV and HDV capture and burning. === Overlay === Users can overlay videos, images, and text. Using the overlay track, up to 50 clips can be displayed simultaneously. It can handle videos in MOV and AVI formats, including alpha channel, and images in PSP, PSD, PNG, and GIF formats. Clips that do not contain an alpha channel can have specific colours removed from the overlay video so that the required background or image is displayed in the foreground. === Proxy video files === VideoStudio supports high-definition video. Proxy files are smaller versions of the video source that stand in for the full-resolution source during editing to improve performance. === Plug-ins/bundles === VideoStudio supports VFX-type plug-ins from providers, including NewBlue and proDAD. proDAD plug-ins Roto-Pen, Script, Vitascene, and Mercalli-Stabilizer are bundled with X4 and later Ultimate Editions. == Version history == Ulead VideoStudio 4 (1999) Ulead VideoStudio 5 (2001) Ulead VideoStudio 6 (2002) Ulead VideoStudio 7 (2003) Ulead VideoStudio 8 (2004) Ulead VideoStudio 9 (2005) Ulead VideoStudio 10 plus. (2006) Corel Ulead VideoStudio 11 plus. (2007) Corel VideoStudio Pro X2 (v12, 2008) Corel VideoStudio Pro X3 (v13, 2010) 2011: Corel VideoStudio Pro X4 (v14, 2011) Adds support for stop motion animation, time-lapse mode photography, 3D movies, and 2nd generation Intel Core. Corel VideoStudio Pro X5 (v15, March 9, 2012): Adds HTML5 export (Comparison of HTML5 and Flash). Corel VideoStudio Pro X6 (v16, April 25, 2013): Windows 8 compatible. Adds UHD 4K support. Corel VideoStudio Pro X7 (v17, March 5, 2014): Software becomes 64-bit. Corel VideoStudio Pro X8 (v18, May 8, 2015): Several improvements. Corel VideoStudio Pro X9 (v19, February 16, 2016): Windows 10 compatible. Adds H.265 support, Multi-Camera Editor, and Match moving. Corel VideoStudio Pro X10 (v20, February 15, 2017): Adds Mask Creator, Track Transparency, and 360-degree video support. Corel VideoStudio Pro 2018 (v21, February 13, 2018): Adds split screen Video, Lens Correction, and 3D Title Editor. Corel VideoStudio Pro 2019 (v22, February 12, 2019): Adds Color Grading, Morph Transitions, and MultiCam Capture Lite. Corel VideoStudio Pro 2020 (v23, February 25, 2020). Corel VideoStudio Pro 2021 (v24, March 26, 2021): Adds Instant Project Templates, AR Stickers, and performance improvements (particularly regarding hardware acceleration). Corel VideoStudio Pro 2022 (v25, March 6, 2022): Adds face effects, GIF Creator, transitions for Camera Movements, a speech to text converter, and ProRes Smart Proxy.

Probabilistic latent semantic analysis

Probabilistic latent semantic analysis (PLSA), also known as probabilistic latent semantic indexing (PLSI, especially in information retrieval circles) is a statistical technique for the analysis of two-mode and co-occurrence data. In effect, one can derive a low-dimensional representation of the observed variables in terms of their affinity to certain hidden variables, just as in latent semantic analysis, from which PLSA evolved. Compared to standard latent semantic analysis which stems from linear algebra and downsizes the occurrence tables (usually via a singular value decomposition), probabilistic latent semantic analysis is based on a mixture decomposition derived from a latent class model. == Model == Considering observations in the form of co-occurrences ( w , d ) {\displaystyle (w,d)} of words and documents, PLSA models the probability of each co-occurrence as a mixture of conditionally independent multinomial distributions: P ( w , d ) = ∑ c P ( d ) P ( c | d ) P ( w | c ) = P ( d ) ∑ c P ( c | d ) P ( w | c ) {\displaystyle P(w,d)=\sum _{c}P(d)P(c|d)P(w|c)=P(d)\sum _{c}P(c|d)P(w|c)} with c {\displaystyle c} being the words' topic. Note that the number of topics is a hyperparameter that must be chosen in advance and is not estimated from the data. The first formulation is the symmetric formulation, where w {\displaystyle w} and d {\displaystyle d} are both generated from the latent class c {\displaystyle c} in similar ways (using the conditional probabilities P ( d | c ) {\displaystyle P(d|c)} and P ( w | c ) {\displaystyle P(w|c)} ), whereas the second formulation is the asymmetric formulation, where, for each document d {\displaystyle d} , a latent class is chosen conditionally to the document according to P ( c | d ) {\displaystyle P(c|d)} , and a word is then generated from that class according to P ( w | c ) {\displaystyle P(w|c)} . Although we have used words and documents in this example, the co-occurrence of any couple of discrete variables may be modelled in exactly the same way. So, the number of parameters is equal to c d + w c {\displaystyle cd+wc} . The number of parameters grows linearly with the number of documents. In addition, although PLSA is a generative model of the documents in the collection it is estimated on, it is not a generative model of new documents. Their parameters are learned using the EM algorithm. == Application == PLSA may be used in a discriminative setting, via Fisher kernels. PLSA has applications in information retrieval and filtering, natural language processing, machine learning from text, bioinformatics, and related areas. It is reported that the aspect model used in the probabilistic latent semantic analysis has severe overfitting problems. == Extensions == Hierarchical extensions: Asymmetric: MASHA ("Multinomial ASymmetric Hierarchical Analysis") Symmetric: HPLSA ("Hierarchical Probabilistic Latent Semantic Analysis") Generative models: The following models have been developed to address an often-criticized shortcoming of PLSA, namely that it is not a proper generative model for new documents. Latent Dirichlet allocation – adds a Dirichlet prior on the per-document topic distribution Higher-order data: Although this is rarely discussed in the scientific literature, PLSA extends naturally to higher order data (three modes and higher), i.e. it can model co-occurrences over three or more variables. In the symmetric formulation above, this is done simply by adding conditional probability distributions for these additional variables. This is the probabilistic analogue to non-negative tensor factorisation. == History == This is an example of a latent class model (see references therein), and it is related to non-negative matrix factorization. The present terminology was coined in 1999 by Thomas Hofmann.

Santa Fe Trail problem

The Santa Fe Trail problem is a genetic programming exercise in which artificial ants search for food pellets according to a programmed set of instructions. The layout of food pellets in the Santa Fe Trail problem has become a standard for comparing different genetic programming algorithms and solutions. One method for programming and testing algorithms on the Santa Fe Trail problem is by using the NetLogo application. There is at least one case of a student creating a Lego robotic ant to solve the problem.

Hinge loss

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined as ℓ ( y ) = max ( 0 , 1 − t ⋅ y ) {\displaystyle \ell (y)=\max(0,1-t\cdot y)} Note that y {\displaystyle y} should be the "raw" output of the classifier's decision function, not the predicted class label. For instance, in linear SVMs, y = w ⋅ x + b {\displaystyle y=\mathbf {w} \cdot \mathbf {x} +b} , where ( w , b ) {\displaystyle (\mathbf {w} ,b)} are the parameters of the hyperplane and x {\displaystyle \mathbf {x} } is the input variable(s). When t and y have the same sign (meaning y predicts the right class) and | y | ≥ 1 {\displaystyle |y|\geq 1} , the hinge loss ℓ ( y ) = 0 {\displaystyle \ell (y)=0} . When they have opposite signs, ℓ ( y ) {\displaystyle \ell (y)} increases linearly with y, and similarly if | y | < 1 {\displaystyle |y|<1} , even if it has the same sign (correct prediction, but not by enough margin). The Hinge loss is not a proper scoring rule. == Extensions == While binary SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion, it is also possible to extend the hinge loss itself for such an end. Several different variations of multiclass hinge loss have been proposed. For example, Crammer and Singer defined it for a linear classifier as ℓ ( y ) = max ( 0 , 1 + max y ≠ t w y x − w t x ) {\displaystyle \ell (y)=\max(0,1+\max _{y\neq t}\mathbf {w} _{y}\mathbf {x} -\mathbf {w} _{t}\mathbf {x} )} , where t {\displaystyle t} is the target label, w t {\displaystyle \mathbf {w} _{t}} and w y {\displaystyle \mathbf {w} _{y}} are the model parameters. Weston and Watkins provided a similar definition, but with a sum rather than a max: ℓ ( y ) = ∑ y ≠ t max ( 0 , 1 + w y x − w t x ) {\displaystyle \ell (y)=\sum _{y\neq t}\max(0,1+\mathbf {w} _{y}\mathbf {x} -\mathbf {w} _{t}\mathbf {x} )} . In structured prediction, the hinge loss can be further extended to structured output spaces. Structured SVMs with margin rescaling use the following variant, where w denotes the SVM's parameters, y the SVM's predictions, φ the joint feature function, and Δ the Hamming loss: ℓ ( y ) = max ( 0 , Δ ( y , t ) + ⟨ w , ϕ ( x , y ) ⟩ − ⟨ w , ϕ ( x , t ) ⟩ ) = max ( 0 , max y ∈ Y ( Δ ( y , t ) + ⟨ w , ϕ ( x , y ) ⟩ ) − ⟨ w , ϕ ( x , t ) ⟩ ) {\displaystyle {\begin{aligned}\ell (\mathbf {y} )&=\max(0,\Delta (\mathbf {y} ,\mathbf {t} )+\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {y} )\rangle -\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {t} )\rangle )\\&=\max(0,\max _{y\in {\mathcal {Y}}}\left(\Delta (\mathbf {y} ,\mathbf {t} )+\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {y} )\rangle \right)-\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {t} )\rangle )\end{aligned}}} . == Optimization == The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters w of a linear SVM with score function y = w ⋅ x {\displaystyle y=\mathbf {w} \cdot \mathbf {x} } that is given by ∂ ℓ ∂ w i = { − t ⋅ x i if t ⋅ y < 1 , 0 otherwise . {\displaystyle {\frac {\partial \ell }{\partial w_{i}}}={\begin{cases}-t\cdot x_{i}&{\text{if }}t\cdot y<1,\\0&{\text{otherwise}}.\end{cases}}} However, since the derivative of the hinge loss at t y = 1 {\displaystyle ty=1} is undefined, smoothed versions may be preferred for optimization, such as Rennie and Srebro's ℓ ( y ) = { 1 2 − t y if t y ≤ 0 , 1 2 ( 1 − t y ) 2 if 0 < t y < 1 , 0 if 1 ≤ t y {\displaystyle \ell (y)={\begin{cases}{\frac {1}{2}}-ty&{\text{if}}~~ty\leq 0,\\{\frac {1}{2}}(1-ty)^{2}&{\text{if}}~~0

Device-independent pixel

A device-independent pixel (also: density-independent pixel, dip, dp) is a unit of length. A typical use is to allow mobile device software to scale the display of information and user interaction to different screen sizes. The abstraction allows an application to work in pixels as a measurement, while the underlying graphics system converts the abstract pixel measurements of the application into real pixel measurements appropriate to the particular device. For example, on the Android operating system a device-independent pixel is equivalent to one physical pixel on a 160 dpi screen, while the Windows Presentation Foundation specifies one device-independent pixel as equivalent to 1/96th of an inch. As dp is a physical unit it has an absolute value which can be measured in traditional units, e.g. for Android devices 1 dp equals 1/160 of inch or 0.15875 mm. While traditional pixels only refer to the display of information, device-independent pixels may also be used to measure user input such as input on a touch screen device.

Information gain ratio

In decision tree learning, information gain ratio is a ratio of information gain to the intrinsic information. It was proposed by Ross Quinlan, to reduce a bias towards multi-valued attributes by taking the number and size of branches into account when choosing an attribute. Information gain is also known as mutual information. == Information gain calculation == Information gain is the reduction in entropy produced from partitioning a set with attributes a {\displaystyle a} and finding the optimal candidate that produces the highest value: IG ( T , a ) = H ( T ) − H ( T | a ) , {\displaystyle {\text{IG}}(T,a)=\mathrm {H} {(T)}-\mathrm {H} {(T|a)},} where T {\displaystyle T} is a random variable and H ( T | a ) {\displaystyle \mathrm {H} {(T|a)}} is the entropy of T {\displaystyle T} given the value of attribute a {\displaystyle a} . The information gain is equal to the total entropy for an attribute if for each of the attribute values a unique classification can be made for the result attribute. In this case the relative entropies subtracted from the total entropy are 0. == Split information calculation == The split information value for a test is defined as follows: SplitInformation ( X ) = − ∑ i = 1 n N ( x i ) N ( x ) ∗ log ⁡ 2 N ( x i ) N ( x ) {\displaystyle {\text{SplitInformation}}(X)=-\sum _{i=1}^{n}{{\frac {\mathrm {N} (x_{i})}{\mathrm {N} (x)}}\log {_{2}}{\frac {\mathrm {N} (x_{i})}{\mathrm {N} (x)}}}} where X {\displaystyle X} is a discrete random variable with possible values x 1 , x 2 , . . . , x i {\displaystyle {x_{1},x_{2},...,x_{i}}} and N ( x i ) {\displaystyle N(x_{i})} being the number of times that x i {\displaystyle x_{i}} occurs divided by the total count of events N ( x ) {\displaystyle N(x)} where x {\displaystyle x} is the set of events. The split information value is a positive number that describes the potential worth of splitting a branch from a node. This in turn is the intrinsic value that the random variable possesses and will be used to remove the bias in the information gain ratio calculation. == Information gain ratio calculation == The information gain ratio is the ratio between the information gain and the split information value: IGR ( T , a ) = IG ( T , a ) / SplitInformation ( T ) {\displaystyle {\text{IGR}}(T,a)={\text{IG}}(T,a)/{\text{SplitInformation}}(T)} IGR ( T , a ) = − ∑ i = 1 n P ( T ) log ⁡ P ( T ) − ( − ∑ i = 1 n P ( T | a ) log ⁡ P ( T | a ) ) − ∑ i = 1 n N ( t i ) N ( t ) ∗ log ⁡ 2 N ( t i ) N ( t ) {\displaystyle {\text{IGR}}(T,a)={\frac {-\sum _{i=1}^{n}{\mathrm {P} (T)\log \mathrm {P} (T)}-(-\sum _{i=1}^{n}{\mathrm {P} (T|a)\log \mathrm {P} (T|a)})}{-\sum _{i=1}^{n}{{\frac {\mathrm {N} (t_{i})}{\mathrm {N} (t)}}\log {_{2}}{\frac {\mathrm {N} (t_{i})}{\mathrm {N} (t)}}}}}} == Example == Using weather data published by Fordham University, the table was created below: Using the table above, one can find the entropy, information gain, split information, and information gain ratio for each variable (outlook, temperature, humidity, and wind). These calculations are shown in the tables below: Using the above tables, one can deduce that Outlook has the highest information gain ratio. Next, one must find the statistics for the sub-groups of the Outlook variable (sunny, overcast, and rainy), for this example one will only build the sunny branch (as shown in the table below): One can find the following statistics for the other variables (temperature, humidity, and wind) to see which have the greatest effect on the sunny element of the outlook variable: Humidity was found to have the highest information gain ratio. One will repeat the same steps as before and find the statistics for the events of the Humidity variable (high and normal): Since the play values are either all "No" or "Yes", the information gain ratio value will be equal to 1. Also, now that one has reached the end of the variable chain with Wind being the last variable left, they can build an entire root to leaf node branch line of a decision tree. Once finished with reaching this leaf node, one would follow the same procedure for the rest of the elements that have yet to be split in the decision tree. This set of data was relatively small, however, if a larger set was used, the advantages of using the information gain ratio as the splitting factor of a decision tree can be seen more. == Advantages == Information gain ratio biases the decision tree against considering attributes with a large number of distinct values. For example, suppose that we are building a decision tree for some data describing a business's customers. Information gain ratio is used to decide which of the attributes are the most relevant. These will be tested near the root of the tree. One of the input attributes might be the customer's telephone number. This attribute has a high information gain, because it uniquely identifies each customer. Due to its high amount of distinct values, this will not be chosen to be tested near the root. == Disadvantages == Although information gain ratio solves the key problem of information gain, it creates another problem. If one is considering an amount of attributes that have a high number of distinct values, these will never be above one that has a lower number of distinct values. == Difference from information gain == Information gain's shortcoming is created by not providing a numerical difference between attributes with high distinct values from those that have less. Example: Suppose that we are building a decision tree for some data describing a business's customers. Information gain is often used to decide which of the attributes are the most relevant, so they can be tested near the root of the tree. One of the input attributes might be the customer's credit card number. This attribute has a high information gain, because it uniquely identifies each customer, but we do not want to include it in the decision tree: deciding how to treat a customer based on their credit card number is unlikely to generalize to customers we haven't seen before. Information gain ratio's strength is that it has a bias towards the attributes with the lower number of distinct values. Below is a table describing the differences of information gain and information gain ratio when put in certain scenarios.