Randomized weighted majority algorithm

Randomized weighted majority algorithm

The randomized weighted majority algorithm is an algorithm in machine learning theory for aggregating expert predictions to a series of decision problems. It is a simple and effective method based on weighted voting which improves on the mistake bound of the deterministic weighted majority algorithm. In fact, in the limit, its prediction rate can be arbitrarily close to that of the best-predicting expert. == Example == Imagine that every morning before the stock market opens, we get a prediction from each of our "experts" about whether the stock market will go up or down. Our goal is to somehow combine this set of predictions into a single prediction that we then use to make a buy or sell decision for the day. The principal challenge is that we do not know which experts will give better or worse predictions. The RWMA gives us a way to do this combination such that our prediction record will be nearly as good as that of the single expert which, in hindsight, gave the most accurate predictions. == Motivation == In machine learning, the weighted majority algorithm (WMA) is a deterministic meta-learning algorithm for aggregating expert predictions. In pseudocode, the WMA is as follows: initialize all experts to weight 1 for each round: add each expert's weight to the option they predicted predict the option with the largest weighted sum multiply the weights of all experts who predicted wrongly by 1 2 {\displaystyle {\frac {1}{2}}} Suppose there are n {\displaystyle n} experts and the best expert makes m {\displaystyle m} mistakes. Then, the weighted majority algorithm (WMA) makes at most 2.4 ( log 2 ⁡ n + m ) {\displaystyle 2.4(\log _{2}n+m)} mistakes. This bound is highly problematic in the case of highly error-prone experts. Suppose, for example, the best expert makes a mistake 20% of the time; that is, in N = 100 {\displaystyle N=100} rounds using n = 10 {\displaystyle n=10} experts, the best expert makes m = 20 {\displaystyle m=20} mistakes. Then, the weighted majority algorithm only guarantees an upper bound of 2.4 ( log 2 ⁡ 10 + 20 ) ≈ 56 {\displaystyle 2.4(\log _{2}10+20)\approx 56} mistakes. As this is a known limitation of the weighted majority algorithm, various strategies have been explored in order to improve the dependence on m {\displaystyle m} . In particular, we can do better by introducing randomization. Drawing inspiration from the Multiplicative Weights Update Method algorithm, we will probabilistically make predictions based on how the experts have performed in the past. Similarly to the WMA, every time an expert makes a wrong prediction, we will decrement their weight. Mirroring the MWUM, we will then use the weights to make a probability distribution over the actions and draw our action from this distribution (instead of deterministically picking the majority vote as the WMA does). == Randomized weighted majority algorithm (RWMA) == The randomized weighted majority algorithm is an attempt to improve the dependence of the mistake bound of the WMA on m {\displaystyle m} . Instead of predicting based on majority vote, the weights, are used as probabilities for choosing the experts in each round and are updated over time (hence the name randomized weighted majority). Precisely, if w i {\displaystyle w_{i}} is the weight of expert i {\displaystyle i} , let W = ∑ i w i {\displaystyle W=\sum _{i}w_{i}} . We will follow expert i {\displaystyle i} with probability w i W {\displaystyle {\frac {w_{i}}{W}}} . This results in the following algorithm: initialize all experts to weight 1. for each round: add all experts' weights together to obtain the total weight W {\displaystyle W} choose expert i {\displaystyle i} randomly with probability w i W {\displaystyle {\frac {w_{i}}{W}}} predict as the chosen expert predicts multiply the weights of all experts who predicted wrongly by β {\displaystyle \beta } The goal is to bound the worst-case expected number of mistakes, assuming that the adversary has to select one of the answers as correct before we make our coin toss. This is a reasonable assumption in, for instance, the stock market example provided above: the variance of a stock price should not depend on the opinions of experts that influence private buy or sell decisions, so we can treat the price change as if it was decided before the experts gave their recommendations for the day. The randomized algorithm is better in the worst case than the deterministic algorithm (weighted majority algorithm): in the latter, the worst case was when the weights were split 50/50. But in the randomized version, since the weights are used as probabilities, there would still be a 50/50 chance of getting it right. In addition, generalizing to multiplying the weights of the incorrect experts by β < 1 {\displaystyle \beta <1} instead of strictly 1 2 {\displaystyle {\frac {1}{2}}} allows us to trade off between dependence on m {\displaystyle m} and log 2 ⁡ n {\displaystyle \log _{2}n} . This trade-off will be quantified in the analysis section. == Analysis == Let W t {\displaystyle W_{t}} denote the total weight of all experts at round t {\displaystyle t} . Also let F t {\displaystyle F_{t}} denote the fraction of weight placed on experts which predict the wrong answer at round t {\displaystyle t} . Finally, let N {\displaystyle N} be the total number of rounds in the process. By definition, F t {\displaystyle F_{t}} is the probability that the algorithm makes a mistake on round t {\displaystyle t} . It follows from the linearity of expectation that if M {\displaystyle M} denotes the total number of mistakes made during the entire process, E [ M ] = ∑ t = 1 N F t {\displaystyle E[M]=\sum _{t=1}^{N}F_{t}} . After round t {\displaystyle t} , the total weight is decreased by ( 1 − β ) F t W t {\displaystyle \ (1-\beta )F_{t}W_{t}} , since all weights corresponding to a wrong answer are multiplied by β < 1 {\displaystyle \ \beta <1} . It then follows that W t + 1 = W t ( 1 − ( 1 − β ) F t ) {\displaystyle W_{t+1}=W_{t}(1-(1-\beta )F_{t})} . By telescoping, since W 1 = n {\displaystyle W_{1}=n} , it follows that the total weight after the process concludes is On the other hand, suppose that m {\displaystyle \ m} is the number of mistakes made by the best-performing expert. At the end, this expert has weight β m {\displaystyle \ \beta ^{m}} . It follows, then, that the total weight is at least this much; in other words, W ≥ β m {\displaystyle \ W\geq \beta ^{m}} . This inequality and the above result imply Taking the natural logarithm of both sides yields Now, the Taylor series of the natural logarithm is In particular, it follows that ln ⁡ ( 1 − ( 1 − β ) F t ) < − ( 1 − β ) F t {\displaystyle \ \ln(1-(1-\beta )F_{t})<-(1-\beta )F_{t}} . Thus, Recalling that E [ M ] = ∑ t = 1 N F t {\displaystyle E[M]=\sum _{t=1}^{N}F_{t}} and rearranging, it follows that Now, as β → 1 {\displaystyle \beta \to 1} from below, the first constant tends to 1 {\displaystyle 1} ; however, the second constant tends to + ∞ {\displaystyle +\infty } . To quantify this tradeoff, define ε = 1 − β {\displaystyle \varepsilon =1-\beta } to be the penalty associated with getting a prediction wrong. Then, again applying the Taylor series of the natural logarithm, It then follows that the mistake bound, for small ε {\displaystyle \varepsilon } , can be written in the form ( 1 + ϵ 2 + O ( ε 2 ) ) m + ϵ − 1 ln ⁡ ( n ) {\displaystyle \ \left(1+{\frac {\epsilon }{2}}+O(\varepsilon ^{2})\right)m+\epsilon ^{-1}\ln(n)} . In English, the less that we penalize experts for their mistakes, the more that additional experts will lead to initial mistakes but the closer we get to capturing the predictive accuracy of the best expert as time goes on. In particular, given a sufficiently low value of ε {\displaystyle \varepsilon } and enough rounds, the randomized weighted majority algorithm can get arbitrarily close to the correct prediction rate of the best expert. In particular, as long as m {\displaystyle m} is sufficiently large compared to ln ⁡ ( n ) {\displaystyle \ln(n)} (so that their ratio is sufficiently small), we can assign we can obtain an upper bound on the number of mistakes equal to This implies that the "regret bound" on the algorithm (that is, how much worse it performs than the best expert) is sublinear, at O ( m ln ⁡ ( n ) ) {\displaystyle O({\sqrt {m\ln(n)}})} . == Revisiting the motivation == Recall that the motivation for the randomized weighted majority algorithm was given by an example where the best expert makes a mistake 20% of the time. Precisely, in N = 100 {\displaystyle N=100} rounds, with n = 10 {\displaystyle n=10} experts, where the best expert makes m = 20 {\displaystyle m=20} mistakes, the deterministic weighted majority algorithm only guarantees an upper bound of 2.4 ( log 2 ⁡ 10 + 20 ) ≈ 56 {\displaystyle 2.4(\log _{2}10+20)\approx 56} . By the analysis above, it follows that minimizing the number of worst-case expected mistakes is equivalent to minimizing the fun

Version space learning

Version space learning is a logical approach to machine learning, specifically binary classification. Version space learning algorithms search a predefined space of hypotheses, viewed as a set of logical sentences. Formally, the hypothesis space is a disjunction H 1 ∨ H 2 ∨ . . . ∨ H n {\displaystyle H_{1}\lor H_{2}\lor ...\lor H_{n}} (i.e., one or more of hypotheses 1 through n are true). A version space learning algorithm is presented with examples, which it will use to restrict its hypothesis space; for each example x, the hypotheses that are inconsistent with x are removed from the space. This iterative refining of the hypothesis space is called the candidate elimination algorithm, the hypothesis space maintained inside the algorithm, its version space. == The version space algorithm == In settings where there is a generality-ordering on hypotheses, it is possible to represent the version space by two sets of hypotheses: (1) the most specific consistent hypotheses, and (2) the most general consistent hypotheses, where "consistent" indicates agreement with observed data. The most specific hypotheses (i.e., the specific boundary SB) cover the observed positive training examples, and as little of the remaining feature space as possible. These hypotheses, if reduced any further, exclude a positive training example, and hence become inconsistent. These minimal hypotheses essentially constitute a (pessimistic) claim that the true concept is defined just by the positive data already observed: Thus, if a novel (never-before-seen) data point is observed, it should be assumed to be negative. (I.e., if data has not previously been ruled in, then it's ruled out.) The most general hypotheses (i.e., the general boundary GB) cover the observed positive training examples, but also cover as much of the remaining feature space without including any negative training examples. These, if enlarged any further, include a negative training example, and hence become inconsistent. These maximal hypotheses essentially constitute a (optimistic) claim that the true concept is defined just by the negative data already observed: Thus, if a novel (never-before-seen) data point is observed, it should be assumed to be positive. (I.e., if data has not previously been ruled out, then it's ruled in.) Thus, during learning, the version space (which itself is a set – possibly infinite – containing all consistent hypotheses) can be represented by just its lower and upper bounds (maximally general and maximally specific hypothesis sets), and learning operations can be performed just on these representative sets. After learning, classification can be performed on unseen examples by testing the hypothesis learned by the algorithm. If the example is consistent with multiple hypotheses, a majority vote rule can be applied. == Historical background == The notion of version spaces was introduced by Mitchell in the early 1980s as a framework for understanding the basic problem of supervised learning within the context of solution search. Although the basic "candidate elimination" search method that accompanies the version space framework is not a popular learning algorithm, there are some practical implementations that have been developed (e.g., Sverdlik & Reynolds 1992, Hong & Tsang 1997, Dubois & Quafafou 2002). A major drawback of version space learning is its inability to deal with noise: any pair of inconsistent examples can cause the version space to collapse, i.e., become empty, so that classification becomes impossible. One solution of this problem is proposed by Dubois and Quafafou that proposed the Rough Version Space, where rough sets based approximations are used to learn certain and possible hypothesis in the presence of inconsistent data.

Training, validation, and test data sets

In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from input data. These input data used to build the model are usually divided into multiple data sets. In particular, three data sets are commonly used in different stages of the creation of the model: training, validation, and testing sets. The model is initially fit on a training data set, which is a set of examples used to fit the parameters (e.g. weights of connections between neurons in artificial neural networks) of the model. The model (e.g. a naive Bayes classifier) is trained on the training data set using a supervised learning method, for example using optimization methods such as gradient descent or stochastic gradient descent. In practice, the training data set often consists of pairs of an input vector (or scalar) and the corresponding output vector (or scalar), where the answer key is commonly denoted as the target (or label). The current model is run with the training data set and produces a result, which is then compared with the target, for each input vector in the training data set. Based on the result of the comparison and the specific learning algorithm being used, the parameters of the model are adjusted. The model fitting can include both variable selection and parameter estimation. Successively, the fitted model is used to predict the responses for the observations in a second data set called the validation data set. The validation data set provides an unbiased evaluation of a model fit on the training data set while tuning the model's hyperparameters (e.g. the number of hidden units—layers and layer widths—in a neural network). Validation data sets can be used for regularization by early stopping (stopping training when the error on the validation data set increases, as this is a sign of over-fitting to the training data set). This simple procedure is complicated in practice by the fact that the validation data set's error may fluctuate during training, producing multiple local minima. This complication has led to the creation of many ad-hoc rules for deciding when over-fitting has truly begun. Finally, the test data set is a data set used to provide an unbiased evaluation of a model fit on the training data set. When the data in the test data set has never been used (for example in cross-validation), the test data set is called a holdout data set. The term "validation set" is sometimes used instead of "test set" in some literature (e.g., if the original data set was partitioned into only two subsets, the test set might be referred to as the validation set). Deciding the sizes and strategies for data set division in training, test and validation sets is very dependent on the problem and data available. == Training data set == A training data set is a data set of examples used during the learning process and is used to fit the parameters (e.g., weights) of, for example, a classifier. For classification tasks, a supervised learning algorithm looks at the training data set to determine, or learn, the optimal combinations of variables that will generate a good predictive model. The goal is to produce a trained (fitted) model that generalizes well to new, unknown data. The fitted model is evaluated using “new” examples from the held-out data sets (validation and test data sets) to estimate the model’s accuracy in classifying new data. To reduce the risk of issues such as over-fitting, the examples in the validation and test data sets should not be used to train the model. Most approaches that search through training data for empirical relationships tend to overfit the data, meaning that they can identify and exploit apparent relationships in the training data that do not hold in general. When a training set is continuously expanded with new data, then this is incremental learning. == Validation data set == A validation data set is a data set of examples used to tune the hyperparameters (i.e. the architecture) of a model. It is sometimes also called the development set or the "dev set". An example of a hyperparameter for artificial neural networks includes the number of hidden units in each layer. It, as well as the testing set (as mentioned below), should follow the same probability distribution as the training data set. In order to avoid overfitting, when any classification parameter needs to be adjusted, it is necessary to have a validation data set in addition to the training and test data sets. For example, if the most suitable classifier for the problem is sought, the training data set is used to train the different candidate classifiers, the validation data set is used to compare their performances and decide which one to take and, finally, the test data set is used to obtain the performance characteristics such as accuracy, sensitivity, specificity, F-measure, and so on. The validation data set functions as a hybrid: it is training data used for testing, but neither as part of the low-level training nor as part of the final testing. The basic process of using a validation data set for model selection (as part of training data set, validation data set, and test data set) is: Since our goal is to find the network having the best performance on new data, the simplest approach to the comparison of different networks is to evaluate the error function using data which is independent of that used for training. Various networks are trained by minimization of an appropriate error function defined with respect to a training data set. The performance of the networks is then compared by evaluating the error function using an independent validation set, and the network having the smallest error with respect to the validation set is selected. This approach is called the hold out method. Since this procedure can itself lead to some overfitting to the validation set, the performance of the selected network should be confirmed by measuring its performance on a third independent set of data called a test set. An application of this process is in early stopping, where the candidate models are successive iterations of the same network, and training stops when the error on the validation set grows, choosing the previous model (the one with minimum error). == Test data set == A test data set is a data set that is independent of the training data set, but that follows the same probability distribution as the training data set. A test set is therefore a set of examples used only to assess the performance (i.e. generalization) of a specified classifier on unseen data. To do this, the model is used to predict classifications of examples in the test set. Those predictions are compared to the examples' true classifications to assess the model's accuracy. If a model fit to the training and validation data set also fits the test data set well, minimal overfitting has taken place (see figure below). A better fitting of the training or validation data sets as opposed to the test data set usually points to overfitting. In the scenario where a data set has a low number of samples, it is usually partitioned into a training set and a validation data set, where the model is trained on the training set and refined using the validation set to improve accuracy, but this approach will lead to overfitting. The holdout method can also be employed, where the test set is used at the end, after training on the training set. Other techniques, such as cross-validation and bootstrapping, are used on small data sets. The bootstrap method generates numerous simulated data sets of the same size by randomly sampling with replacement from the original data, allowing the random data points to serve as test sets for evaluating model performance. Cross-validation splits the data set into multiple folds, with a single sub-fold used as test data; the model is trained on the remaining folds, and all folds are cross-validated (with results averaged and models consolidated) to estimate final model performance. Note that some sources advise against using a single split, as it can lead to overfitting as well as biased model performance estimates. For this reason, data sets are split into three partitions: training, validation and test data sets. The standard machine learning practice is to train on the training set and tune hyperparameters using the validation set, where the validation process selects the model with the lowest validation loss, which is then tested on the test data set (normally held out) to assess the final model. The holdout method for the test set reduces computation by avoiding using the test set after each epoch. The test data set should never be used for validating the training model or fine-tuning hyperparameters, as it provides an accurate and honest evaluation of the model's final performance on unseen dat

Nearest centroid classifier

In machine learning, a nearest centroid classifier or nearest prototype classifier is a classification model that assigns to observations the label of the class of training samples whose mean (centroid) is closest to the observation. When applied to text classification using word vectors containing tfidf weights to represent documents, the nearest centroid classifier is known as the Rocchio classifier because of its similarity to the Rocchio algorithm for relevance feedback. An extended version of the nearest centroid classifier has found applications in the medical domain, specifically classification of tumors. == Algorithm == === Training === Given labeled training samples { ( x → 1 , y 1 ) , … , ( x → n , y n ) } {\displaystyle \textstyle \{({\vec {x}}_{1},y_{1}),\dots ,({\vec {x}}_{n},y_{n})\}} with class labels y i ∈ Y {\displaystyle y_{i}\in \mathbf {Y} } , compute the per-class centroids μ → ℓ = 1 | C ℓ | ∑ i ∈ C ℓ x → i {\displaystyle \textstyle {\vec {\mu }}_{\ell }={\frac {1}{|C_{\ell }|}}{\underset {i\in C_{\ell }}{\sum }}{\vec {x}}_{i}} where C ℓ {\displaystyle C_{\ell }} is the set of indices of samples belonging to class ℓ ∈ Y {\displaystyle \ell \in \mathbf {Y} } . === Prediction === The class assigned to an observation x → {\displaystyle {\vec {x}}} is y ^ = arg ⁡ min ℓ ∈ Y ‖ μ → ℓ − x → ‖ {\displaystyle {\hat {y}}={\arg \min }_{\ell \in \mathbf {Y} }\|{\vec {\mu }}_{\ell }-{\vec {x}}\|} .

Sum of absolute transformed differences

The sum of absolute transformed differences (SATD) is a block matching criterion widely used in fractional motion estimation for video compression. It works by taking a frequency transform, usually a Hadamard transform, of the differences between the pixels in the original block and the corresponding pixels in the block being used for comparison. The transform itself is often of a small block rather than the entire macroblock. For example, in x264, a series of 4×4 blocks are transformed rather than doing the more processor-intensive 16×16 transform. == Comparison to other metrics == SATD is slower than the sum of absolute differences (SAD), both due to its increased complexity and the fact that SAD-specific MMX and SSE2 instructions exist, while there are no such instructions for SATD. However, SATD can still be optimized considerably with SIMD instructions on most modern CPUs. The benefit of SATD is that it more accurately models the number of bits required to transmit the residual error signal. As such, it is often used in video compressors, either as a way to drive and estimate rate explicitly, such as in the Theora encoder (since 1.1 alpha2), as an optional metric used in wide motion searches, such as in the Microsoft VC-1 encoder, or as a metric used in sub-pixel refinement, such as in x264.

Automatic acquisition of sense-tagged corpora

The knowledge acquisition bottleneck is perhaps the major impediment to solving the word-sense disambiguation (WSD) problem. Unsupervised learning methods rely on knowledge about word senses, which is barely formulated in dictionaries and lexical databases. Supervised learning methods depend heavily on the existence of manually annotated examples for every word sense, a requisite that can so far be met only for a handful of words for testing purposes, as it is done in the Senseval exercises. == Existing methods == Therefore, one of the most promising trends in WSD research is using the largest corpus ever accessible, the World Wide Web, to acquire lexical information automatically. WSD has been traditionally understood as an intermediate language engineering technology which could improve applications such as information retrieval (IR). In this case, however, the reverse is also true: Web search engines implement simple and robust IR techniques that can be successfully used when mining the Web for information to be employed in WSD. The most direct way of using the Web (and other corpora) to enhance WSD performance is the automatic acquisition of sense-tagged corpora, the fundamental resource to feed supervised WSD algorithms. Although this is far from being commonplace in the WSD literature, a number of different and effective strategies to achieve this goal have already been proposed. Some of these strategies are: acquisition by direct Web searching (searches for monosemous synonyms, hypernyms, hyponyms, parsed gloss' words, etc.), Yarowsky algorithm (bootstrapping), acquisition via Web directories, and acquisition via cross-language meaning evidences. == Summary == === Optimistic results === The automatic extraction of examples to train supervised learning algorithms reviewed has been, by far, the best explored approach to mine the web for word-sense disambiguation. Some results are certainly encouraging: In some experiments, the quality of the Web data for WSD equals that of human-tagged examples. This is the case of the monosemous relatives plus bootstrapping with Semcor seeds technique and the examples taken from the ODP Web directories. In the first case, however, Semcor-size example seeds are necessary (and only available for English), and it has only been tested with a very limited set of nouns; in the second case, the coverage is quite limited, and it is not yet clear whether it can be grown without compromising the quality of the examples retrieved. It has been shown that a mainstream supervised learning technique trained exclusively with web data can obtain better results than all unsupervised WSD systems which participated at Senseval-2. Web examples made a significant contribution to the best Senseval-2 English all-words system. === Difficulties === There are, however, several open research issues related to the use of Web examples in WSD: High precision in the retrieved examples (i.e., correct sense assignments for the examples) does not necessarily lead to good supervised WSD results (i.e., the examples are possibly not useful for training). The most complete evaluation of Web examples for supervised WSD indicates that learning with Web data improves over unsupervised techniques, but the results are nevertheless far from those obtained with hand-tagged data, and do not even beat the most-frequent-sense baseline. Results are not always reproducible; the same or similar techniques may lead to different results in different experiments. Compare, for instance, Mihalcea (2002) with Agirre and Martínez (2004), or Agirre and Martínez (2000) with Mihalcea and Moldovan (1999). Results with Web data seem to be very sensitive to small differences in the learning algorithm, to when the corpus was extracted (search engines change continuously), and on small heuristic issues (e.g., differences in filters to discard part of the retrieved examples). Results are strongly dependent on bias (i.e., on the relative frequencies of examples per word sense). It is unclear whether this is simply a problem of Web data, or an intrinsic problem of supervised learning techniques, or just a problem of how WSD systems are evaluated (indeed, testing with rather small Senseval data may overemphasize sense distributions compared to sense distributions obtained from the full Web as corpus). In any case, Web data has an intrinsic bias, because queries to search engines directly constrain the context of the examples retrieved. There are approaches that alleviate this problem, such as using several different seeds/queries per sense or assigning senses to Web directories and then scanning directories for examples; but this problem is nevertheless far from being solved. Once a Web corpus of examples is built, it is not entirely clear whether its distribution is safe from a legal perspective. === Future === Besides automatic acquisition of examples from the Web, there are some other WSD experiments that have profited from the Web: The Web as a social network has been successfully used for cooperative annotation of a corpus (OMWE, Open Mind Word Expert project), which has already been used in three Senseval-3 tasks (English, Romanian and Multilingual). The Web has been used to enrich WordNet senses with domain information: topic signatures and Web directories, which have in turn been successfully used for WSD. Also, some research benefited from the semantic information that the Wikipedia maintains on its disambiguation pages. It is clear, however, that most research opportunities remain largely unexplored. For instance, little is known about how to use lexical information extracted from the Web in knowledge-based WSD systems; and it is also hard to find systems that use Web-mined parallel corpora for WSD, even though there are already efficient algorithms that use parallel corpora in WSD.

Common Voice

Common Voice is a crowdsourcing project started by Mozilla to create a free and open speech corpus. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. The transcribed sentences are collected in a voice database available under the public domain license CC0. This license ensures that developers can use the database for voice-to-text and text-to-voice applications without restrictions or costs. == Aims == Common Voice aims to provide diverse voice samples. According to Mozilla's Katharina Borchert, many existing projects took datasets from public radio or otherwise had datasets that underrepresented both women and people with pronounced accents. == Voice database == The first dataset was released in November 2017. More than 20,000 users worldwide had recorded 500 hours of English sentences. In February 2019, the first batch of languages was released for use. This included 18 languages such as English, French, German and Mandarin Chinese, but also less prevalent languages like Welsh and Kabyle. In total, this included almost 1,400 hours of recorded voice data from more than 42,000 contributors. By July 2020 the database had amassed 7,226 hours of voice recordings in 54 languages, 5,591 hours of which had been verified by volunteers. In May 2021, following the work to add Kinyarwanda, the project received a grant to add Kiswahili. At the beginning of 2022, Bengali.AI partnered with Common Voice to launch the "Bangla Speech Recognition" project that aims to make machines understand the Bangla language. 2000 hours of voice was collected. In September 2022, it was announced that the Twi language of Ghana was the 100th language to be added to the database. As of December 2025, Mozilla Common Voice collects voice data for over 250 languages, with the most hours having been collected in English, Catalan, Kinyarwanda, Belarusian and Esperanto.