HTK (Hidden Markov Model Toolkit) is a proprietary software toolkit for handling HMMs. It is mainly intended for speech recognition, but has been used in many other pattern recognition applications that employ HMMs, including speech synthesis, character recognition and DNA sequencing. Originally developed at the Machine Intelligence Laboratory (formerly known as the Speech Vision and Robotics Group) of the Cambridge University Engineering Department (CUED), HTK is now being widely used among researchers who are working on HMMs.
List of large language models
A large language model (LLM) is a type of machine learning model designed for natural language processing tasks such as language generation. LLMs are language models with many parameters, and are trained with self-supervised learning on a vast amount of text. == List == For the training cost column, 1 petaFLOP-day equals 1 petaFLOP/sec × 1 day, or 8.64×1019 FLOP (floating point operations). Only the cost of the largest model is shown. The number of parameters is measured in billions, and the training cost is measured in petaFLOP-days. === 2018 === === 2019 === === 2020 === === 2021 === === 2022 === === 2023 === === 2024 === === 2025 === === 2026 ===
List of datasets for machine-learning research
These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine-learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality unlabeled datasets for unsupervised learning can also be difficult and costly to produce. Many organizations, including governments, publish and share their datasets, often using common metadata formats (such as Croissant). The datasets are classified, based on the licenses, into two groups: open data and non-open data. The datasets from various governmental-bodies are presented in List of open government data sites. The datasets are ported on open data portals. They are made available for searching, depositing and accessing through interfaces like Open API. The datasets are made available as various sorted types and subtypes. == List of sorting used for datasets == The data portal is classified based on its type of license. The open source license based data portals are known as open data portals which are used by many government organizations and academic institutions. == List of open data portals == == List of portals suitable for multiple types of applications == The data portal sometimes lists a wide variety of subtypes of datasets pertaining to many machine learning applications. == List of portals suitable for a specific subtype of applications == The data portals which are suitable for a specific subtype of machine learning application are listed in the subsequent sections. == Image data == == Text data == These datasets consist primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis. === Reviews === === News articles === === Messages === === Twitter and tweets === === Dialogues === === Legal === === Other text === == Sound data == These datasets consist of sounds and sound features used for tasks such as speech recognition and speech synthesis. === Speech === === Music === === Other sounds === == Signal data == Datasets containing electric signal information requiring some sort of signal processing for further analysis. === Electrical === === Motion-tracking === === Other signals === == Chemical data == Datasets from physical systems. === Chemical Reactions with transition states (TS) === === OpenReACT-CHON-EFH === OpenReACT-CHON-EFH (Open Reaction Dataset of Atomic ConfiguraTions comprising C, H, O and N with Energies, Forces and Hessians) is a 2025 open-access benchmark for machine-learning interatomic potentials. RTP set – 35,087 stationary-point geometries (reactant, transition state and product) drawn from 11,961 elementary reactions, each labeled with density-functional energies, atomic forces and full Hessian matrices at the ωB97X-D/6-31G(d) level. IRC set – 34,248 structures along 600 minimum-energy reaction paths, used to test extrapolation beyond trained stationary points. NMS set – 62,527 off-equilibrium geometries generated by normal-mode sampling to probe model robustness under thermal perturbations. The collection underpins the study Does Hessian Data Improve the Performance of Machine Learning Potentials? and was used to train and benchmark the machine-learning interatomic potentials reported therein. The dataset itself is distributed under a CC licence via Figshare. == Physical data == Datasets from physical systems. === High-energy physics === === Systems === === Astronomy === === Earth science === === Other physical === == Biological data == Datasets from biological systems. === Human === === Animal === === Fungi === === Plant === === Microbe === === Drug discovery === == Anomaly data == == Question answering data == This section includes datasets that deals with structured data. == Dialog or instruction prompted data == This section includes datasets that contains multi-turn text with at least two actors, a "user" and an "agent". The user makes requests for the agent, which performs the request. == Cybersecurity == == Climate and sustainability == == Code data == == Multivariate data == === Financial === === Weather === === Census === === Transit === === Internet === === Games === === Other multivariate === == Curated repositories of datasets == As datasets come in myriad formats and can sometimes be difficult to use, there has been considerable work put into curating and standardizing the format of datasets to make them easier to use for machine learning research. OpenML: Web platform with Python, R, Java, and other APIs for downloading hundreds of machine learning datasets, evaluating algorithms on datasets, and benchmarking algorithm performance against dozens of other algorithms. PMLB: A large, curated repository of benchmark datasets for evaluating supervised machine learning algorithms. Provides classification and regression datasets in a standardized format that are accessible through a Python API. Metatext NLP: https://metatext.io/datasets web repository maintained by community, containing nearly 1000 benchmark datasets, and counting. Provides many tasks from classification to QA, and various languages from English, Portuguese to Arabic. Appen: Off The Shelf and Open Source Datasets hosted and maintained by the company. These biological, image, physical, question answering, signal, sound, text, and video resources number over 250 and can be applied to over 25 different use cases.
Genotypic and phenotypic repair
Genotypic and phenotypic repair are optional components of an evolutionary algorithm (EA). An EA reproduces essential elements of biological evolution as a computer algorithm in order to solve demanding optimization or planning tasks, at least approximately. A candidate solution is represented by a - usually linear - data structure that plays the role of an individual's chromosome. New solution candidates are generated by mutation and crossover operators following the example of biology. These offspring may be defective, which is corrected or compensated for by genotypic or phenotypic repair. == Description == Genotypic repair, also known as genetic repair, is the removal or correction of impermissible entries in the chromosome that violate restrictions. In phenotypic repair, the corrections are only made in the genotype-phenotype mapping and the chromosome remains unchanged. Michalewicz wrote about the importance of restrictions in real-world applications: "In general, constraints are an integral part of the formulation of any problem". Restriction violations are application-specific and therefore it depends on the current problem whether and which type of repair is useful. They can usually also be treated by a correspondingly extended evaluation and it depends on the problem which measures are possible and which is the most suitable. If a phenotypic repair is feasible, then it is usually the most efficient compared to the other measures. A survey on repair methods used as constraint handling techniques can be found in. Violations of the range limits of genes should be avoided as far as possible by the formulation of the genome. If this is not possible or if restrictions within the search space defined by the genome are involved, their violations are usually handled by the evaluation. This can be done, for example, by penalty functions that lower the fitness. Repair is often also required for combinatorial tasks. The application of a 1- or n-point crossover operator can, for example, lead to genes being missing in one of the child genomes that are present in duplicate in the other. In this case, a suitable genotypic repair measure is to move the surplus genes to the other genome in a positional manner. The use of the aforementioned operators in combinatorial tasks has also proven to be useful in combination with crossover types specially developed for permutations, at least for certain problems. Particularly in combinatorial problems, it has been observed that genotypic repair can promote premature convergence to a suboptimum, but can also significantly accelerate a successful search. Studies on various tasks have shown that this is application-dependent. An effective measure to avoid premature convergence is generally the use of structured populations instead of the usual panmictic ones. Sequence restrictions play a role in many scheduling tasks, for example when it comes to planning workflows. If, for example, it is specified that step A must be carried out before step B and the gene of step B is located before the gene of A in the chromosome, then there is an impermissible gene sequence. This is because the scheduling operation of step B requires the planned end of step A for correct scheduling, but this is not yet scheduled at the time gene B is processed. The problem can be solved in two ways: The scheduling operation of step B is postponed until the gene from step A has been processed. The genome remains unchanged and the repair only influences the genotype-phenotype mapping. Since only the phenotype is changed, this is referred to as phenotypic repair. If, on the other hand, the gene of step B is moved behind the gene of step A, this is a genotypic repair. The same applies to the alternative shift of gene A in front of gene B. In this case, genotypic repair has the disadvantage that it prevents a meaningful restructuring of the gene sequence in the chromosome if this requires several intermediate steps (mutations) that at least partially violate restrictions.
Teaching dimension
In computational learning theory, the teaching dimension of a concept class C is defined to be max c ∈ C { w C ( c ) } {\displaystyle \max _{c\in C}\{w_{C}(c)\}} , where w C ( c ) {\displaystyle {w_{C}(c)}} is the minimum size of a witness set for c in C. Intuitively, this measures the number of instances that are needed to identify a concept in the class, using supervised learning with examples provided by a helpful teacher who is trying to convey the concept as succinctly as possible. This definition was formulated in 1995 by Sally Goldman and Michael Kearns, based on earlier work by Goldman, Ron Rivest, and Robert Schapire. The teaching dimension of a finite concept class can be used to give a lower and an upper bound on the membership query cost of the concept class. In Stasys Jukna's book "Extremal Combinatorics", a lower bound is given for the teaching dimension in general: Let C be a concept class over a finite domain X. If the size of C is greater than 2 k ( | X | k ) , {\displaystyle 2^{k}{|X| \choose k},} then the teaching dimension of C is greater than k. However, there are more specific teaching models that make assumptions about teacher or learner, and can get lower values for the teaching dimension. For instance, several models are the classical teaching (CT) model, the optimal teacher (OT) model, recursive teaching (RT), preference-based teaching (PBT), and non-clashing teaching (NCT).
Puck App
Puck App is a mobile application that allows hockey players to quickly find and rent a hockey goalie. Founded in 2015 in Toronto, the application primarily operates throughout Canada. It is available on Apple's App Store and Google Play. == History == Puck App was founded in 2016 by Niki Sawni. Users can rate the goalies, message with available goalies, and coordinate skill levels. In 2017, Puck App expanded to Western Canada and has over 1,000 goalies registered. In 2018, Puck App charged approximately $40 CDN to rent a goalie with more than 2 hours notice. Previously, Puck App was a competitor to a similar application called GoalieUp. As of 2024, both companies have agreed to a merger deal.
Randomized weighted majority algorithm
The randomized weighted majority algorithm is an algorithm in machine learning theory for aggregating expert predictions to a series of decision problems. It is a simple and effective method based on weighted voting which improves on the mistake bound of the deterministic weighted majority algorithm. In fact, in the limit, its prediction rate can be arbitrarily close to that of the best-predicting expert. == Example == Imagine that every morning before the stock market opens, we get a prediction from each of our "experts" about whether the stock market will go up or down. Our goal is to somehow combine this set of predictions into a single prediction that we then use to make a buy or sell decision for the day. The principal challenge is that we do not know which experts will give better or worse predictions. The RWMA gives us a way to do this combination such that our prediction record will be nearly as good as that of the single expert which, in hindsight, gave the most accurate predictions. == Motivation == In machine learning, the weighted majority algorithm (WMA) is a deterministic meta-learning algorithm for aggregating expert predictions. In pseudocode, the WMA is as follows: initialize all experts to weight 1 for each round: add each expert's weight to the option they predicted predict the option with the largest weighted sum multiply the weights of all experts who predicted wrongly by 1 2 {\displaystyle {\frac {1}{2}}} Suppose there are n {\displaystyle n} experts and the best expert makes m {\displaystyle m} mistakes. Then, the weighted majority algorithm (WMA) makes at most 2.4 ( log 2 n + m ) {\displaystyle 2.4(\log _{2}n+m)} mistakes. This bound is highly problematic in the case of highly error-prone experts. Suppose, for example, the best expert makes a mistake 20% of the time; that is, in N = 100 {\displaystyle N=100} rounds using n = 10 {\displaystyle n=10} experts, the best expert makes m = 20 {\displaystyle m=20} mistakes. Then, the weighted majority algorithm only guarantees an upper bound of 2.4 ( log 2 10 + 20 ) ≈ 56 {\displaystyle 2.4(\log _{2}10+20)\approx 56} mistakes. As this is a known limitation of the weighted majority algorithm, various strategies have been explored in order to improve the dependence on m {\displaystyle m} . In particular, we can do better by introducing randomization. Drawing inspiration from the Multiplicative Weights Update Method algorithm, we will probabilistically make predictions based on how the experts have performed in the past. Similarly to the WMA, every time an expert makes a wrong prediction, we will decrement their weight. Mirroring the MWUM, we will then use the weights to make a probability distribution over the actions and draw our action from this distribution (instead of deterministically picking the majority vote as the WMA does). == Randomized weighted majority algorithm (RWMA) == The randomized weighted majority algorithm is an attempt to improve the dependence of the mistake bound of the WMA on m {\displaystyle m} . Instead of predicting based on majority vote, the weights, are used as probabilities for choosing the experts in each round and are updated over time (hence the name randomized weighted majority). Precisely, if w i {\displaystyle w_{i}} is the weight of expert i {\displaystyle i} , let W = ∑ i w i {\displaystyle W=\sum _{i}w_{i}} . We will follow expert i {\displaystyle i} with probability w i W {\displaystyle {\frac {w_{i}}{W}}} . This results in the following algorithm: initialize all experts to weight 1. for each round: add all experts' weights together to obtain the total weight W {\displaystyle W} choose expert i {\displaystyle i} randomly with probability w i W {\displaystyle {\frac {w_{i}}{W}}} predict as the chosen expert predicts multiply the weights of all experts who predicted wrongly by β {\displaystyle \beta } The goal is to bound the worst-case expected number of mistakes, assuming that the adversary has to select one of the answers as correct before we make our coin toss. This is a reasonable assumption in, for instance, the stock market example provided above: the variance of a stock price should not depend on the opinions of experts that influence private buy or sell decisions, so we can treat the price change as if it was decided before the experts gave their recommendations for the day. The randomized algorithm is better in the worst case than the deterministic algorithm (weighted majority algorithm): in the latter, the worst case was when the weights were split 50/50. But in the randomized version, since the weights are used as probabilities, there would still be a 50/50 chance of getting it right. In addition, generalizing to multiplying the weights of the incorrect experts by β < 1 {\displaystyle \beta <1} instead of strictly 1 2 {\displaystyle {\frac {1}{2}}} allows us to trade off between dependence on m {\displaystyle m} and log 2 n {\displaystyle \log _{2}n} . This trade-off will be quantified in the analysis section. == Analysis == Let W t {\displaystyle W_{t}} denote the total weight of all experts at round t {\displaystyle t} . Also let F t {\displaystyle F_{t}} denote the fraction of weight placed on experts which predict the wrong answer at round t {\displaystyle t} . Finally, let N {\displaystyle N} be the total number of rounds in the process. By definition, F t {\displaystyle F_{t}} is the probability that the algorithm makes a mistake on round t {\displaystyle t} . It follows from the linearity of expectation that if M {\displaystyle M} denotes the total number of mistakes made during the entire process, E [ M ] = ∑ t = 1 N F t {\displaystyle E[M]=\sum _{t=1}^{N}F_{t}} . After round t {\displaystyle t} , the total weight is decreased by ( 1 − β ) F t W t {\displaystyle \ (1-\beta )F_{t}W_{t}} , since all weights corresponding to a wrong answer are multiplied by β < 1 {\displaystyle \ \beta <1} . It then follows that W t + 1 = W t ( 1 − ( 1 − β ) F t ) {\displaystyle W_{t+1}=W_{t}(1-(1-\beta )F_{t})} . By telescoping, since W 1 = n {\displaystyle W_{1}=n} , it follows that the total weight after the process concludes is On the other hand, suppose that m {\displaystyle \ m} is the number of mistakes made by the best-performing expert. At the end, this expert has weight β m {\displaystyle \ \beta ^{m}} . It follows, then, that the total weight is at least this much; in other words, W ≥ β m {\displaystyle \ W\geq \beta ^{m}} . This inequality and the above result imply Taking the natural logarithm of both sides yields Now, the Taylor series of the natural logarithm is In particular, it follows that ln ( 1 − ( 1 − β ) F t ) < − ( 1 − β ) F t {\displaystyle \ \ln(1-(1-\beta )F_{t})<-(1-\beta )F_{t}} . Thus, Recalling that E [ M ] = ∑ t = 1 N F t {\displaystyle E[M]=\sum _{t=1}^{N}F_{t}} and rearranging, it follows that Now, as β → 1 {\displaystyle \beta \to 1} from below, the first constant tends to 1 {\displaystyle 1} ; however, the second constant tends to + ∞ {\displaystyle +\infty } . To quantify this tradeoff, define ε = 1 − β {\displaystyle \varepsilon =1-\beta } to be the penalty associated with getting a prediction wrong. Then, again applying the Taylor series of the natural logarithm, It then follows that the mistake bound, for small ε {\displaystyle \varepsilon } , can be written in the form ( 1 + ϵ 2 + O ( ε 2 ) ) m + ϵ − 1 ln ( n ) {\displaystyle \ \left(1+{\frac {\epsilon }{2}}+O(\varepsilon ^{2})\right)m+\epsilon ^{-1}\ln(n)} . In English, the less that we penalize experts for their mistakes, the more that additional experts will lead to initial mistakes but the closer we get to capturing the predictive accuracy of the best expert as time goes on. In particular, given a sufficiently low value of ε {\displaystyle \varepsilon } and enough rounds, the randomized weighted majority algorithm can get arbitrarily close to the correct prediction rate of the best expert. In particular, as long as m {\displaystyle m} is sufficiently large compared to ln ( n ) {\displaystyle \ln(n)} (so that their ratio is sufficiently small), we can assign we can obtain an upper bound on the number of mistakes equal to This implies that the "regret bound" on the algorithm (that is, how much worse it performs than the best expert) is sublinear, at O ( m ln ( n ) ) {\displaystyle O({\sqrt {m\ln(n)}})} . == Revisiting the motivation == Recall that the motivation for the randomized weighted majority algorithm was given by an example where the best expert makes a mistake 20% of the time. Precisely, in N = 100 {\displaystyle N=100} rounds, with n = 10 {\displaystyle n=10} experts, where the best expert makes m = 20 {\displaystyle m=20} mistakes, the deterministic weighted majority algorithm only guarantees an upper bound of 2.4 ( log 2 10 + 20 ) ≈ 56 {\displaystyle 2.4(\log _{2}10+20)\approx 56} . By the analysis above, it follows that minimizing the number of worst-case expected mistakes is equivalent to minimizing the fun