The Santa Fe Trail problem is a genetic programming exercise in which artificial ants search for food pellets according to a programmed set of instructions. The layout of food pellets in the Santa Fe Trail problem has become a standard for comparing different genetic programming algorithms and solutions. One method for programming and testing algorithms on the Santa Fe Trail problem is by using the NetLogo application. There is at least one case of a student creating a Lego robotic ant to solve the problem.
Computational humor
Computational humor is a branch of computational linguistics and artificial intelligence which uses computers in humor research. It is a relatively new area, with the first dedicated conference organized in 1996. The first "computer model of a sense of humor" was suggested by Suslov as early as 1992. Investigation of the general scheme of the information processing show a possibility of a specific malfunction, conditioned by the necessity of a quick deletion from consciousness of a false version. This specific malfunction can be identified with a humorous effect on the psychological grounds; however, an essentially new ingredient, a role of timing, is added to a well known role of ambiguity. In biological systems, a sense of humour inevitably develops in the course of evolution, because its biological function consists in quickening the transmission of processed information into consciousness and in a more effective use of brain resources. A realization of this algorithm in neural networks explains naturally the mechanism of laughter: deletion of a false version corresponds to zeroing of some part of the neural network and excessive energy of neurons is thrown out to the motor cortex, arousing muscular contractions. Unfortunately, a practical realization of this algorithm needs extensive databases, whose creation in the automatic regime was suggested only recently . As a result, this magistral direction was not developed properly and subsequent investigations (see below) accepted somewhat specialized colouring. == Joke generators == === Pun generation === An approach to analysis of humor is classification of jokes. A further step is an attempt to generate jokes basing on the rules that underlie classification. Simple prototypes for computer pun generation were reported in the early 1990s, based on a natural language generator program, VINCI. Graeme Ritchie and Kim Binsted in their 1994 research paper described a computer program, JAPE, designed to generate question-answer-type puns from a general, i.e., non-humorous, lexicon. (The program name is an acronym for "Joke Analysis and Production Engine".) Some examples produced by JAPE are: Q: What is the difference between leaves and a car? A: One you brush and rake, the other you rush and brake. Q: What do you call a strange market? A: A bizarre bazaar. Since then the approach has been improved, and the latest report, dated 2007, describes the STANDUP joke generator, implemented in the Java programming language. The STANDUP generator was tested on children within the framework of analyzing its usability for language skills development for children with communication disabilities, e.g., because of cerebral palsy. (The project name is an acronym for "System To Augment Non-speakers' Dialog Using Puns" and an allusion to standup comedy.) Children responded to this "language playground" with enthusiasm, and showed marked improvement on certain types of language tests. The two young people, who used the system over a ten-week period, regaled their peers, staff, family and neighbors with jokes such as: "What do you call a spicy missile? A hot shot!" Their joy and enthusiasm at entertaining others was inspirational. === Other === Stock and Strapparava described a program to generate funny acronyms. == Joke recognition == A statistical machine learning algorithm to detect whether a sentence contained a "That's what she said" double entendre was developed by Kiddon and Brun (2011). There is an open-source Python implementation of Kiddon & Brun's TWSS system. A program to recognize knock-knock jokes was reported by Taylor and Mazlack. This kind of research is important in analysis of human–computer interaction. An application of machine learning techniques for the distinguishing of joke texts from non-jokes was described by Mihalcea and Strapparava (2006). Takizawa et al. (1996) reported on a heuristic program for detecting puns in the Japanese language. == Applications == A possible application for assistance in language acquisition is described in the section "Pun generation". Another envisioned use of joke generators is in cases of a steady supply of jokes where quantity is more important than quality. Another obvious, yet remote, direction is automated joke appreciation. It is known that humans interact with computers in ways similar to interacting with other humans that may be described in terms of personality, politeness, flattery, and in-group favoritism. Therefore, the role of humor in human–computer interaction is being investigated. In particular, humor generation in user interface to ease communications with computers was suggested. Craig McDonough implemented the Mnemonic Sentence Generator, which converts passwords into humorous sentences. Based on the incongruity theory of humor, it is suggested that the resulting meaningless but funny sentences are easier to remember. For example, the password AjQA3Jtv is converted into "Arafat joined Quayle's Ant, while TARAR Jeopardized thurmond's vase," an example chosen by combining politicians names with verbs and common nouns. == Related research == John Allen Paulos is known for his interest in mathematical foundations of humor. His book Mathematics and Humor: A Study of the Logic of Humor demonstrates structures common to humor and formal sciences (mathematics, linguistics) and develops a mathematical model of jokes based on catastrophe theory. Conversational systems which have been designed to take part in Turing test competitions generally have the ability to learn humorous anecdotes and jokes. Because many people regard humor as something particular to humans, its appearance in conversation can be quite useful in convincing a human interrogator that a hidden entity, which could be a machine or a human, is in fact a human.
TabPFN
TabPFN (Tabular Prior-data Fitted Network) is a machine learning model for tabular datasets proposed in 2022. It uses a transformer architecture. It is intended for supervised classification and regression analysis on tabular datasets, particularly focusing on small- to medium-sized datasets. The latest version, TabPFN-3, was released in May 2026 and supports datasets with up to one million rows and 200 features. == History == TabPFN was first introduced in a 2022 pre-print and presented at ICLR 2023. TabPFN v2 was published in 2025 in Nature by Hollmann and co-authors. The source code is published on GitHub under a modified Apache License and on PyPi. Writing for ICLR blogs, McCarter states that the model has attracted attention due to its performance on small dataset benchmarks. TabPFN v2.5 was released on November 6, 2025. TabPFN-3 was released on May 12, 2026. Prior Labs, founded in 2024, aims to commercialize TabPFN. As of April 2026, the open-source TabPFN repository had more than 6,000 stars on GitHub. == Overview and pre-training == TabPFN supports classification, regression and generative tasks. It leverages "Prior-Data Fitted Networks" models to model tabular data. By using a transformer pre-trained on synthetic tabular datasets, TabPFN avoids benchmark contamination and costs of curating real-world data. TabPFN v2 was pre-trained on approximately 130 million such datasets. Synthetic datasets are generated using causal models or Bayesian neural networks; this can include simulating missing values, imbalanced data, and noise. Random inputs are passed through these models to generate outputs, with a bias towards simpler causal structures. During pre-training, TabPFN predicts the masked target values of new data points given training data points and their known targets, effectively learning a generic learning algorithm that is executed by running a neural network forward pass. The new dataset is then processed in a single forward pass without retraining. The model's transformer encoder processes features and labels by alternating attention across rows and columns. TabPFN v2 handles numerical and categorical features, missing values, and supports tasks like regression and synthetic data generation, while TabPFN-2.5 scales this approach to datasets with up to 50,000 rows and 2,000 features. TabPFN-3 introduced a redesigned architecture with row-compression, an attention-based many-class decoder, native missing-value handling, and inference optimizations such as row chunking and a reduced key-value cache, with benchmark-validated regimes of up to 1 million rows with 200 features, 100,000 rows with 2,000 features, or 1,000 rows with 20,000 features. Since TabPFN is pre-trained, in contrast to other deep learning methods, it does not require costly hyperparameter optimization. == Research == TabPFN is the subject of on-going research. Applications for TabPFN have been investigated for domains such as chemoproteomics, insurance risk classification, and metagenomics. In clinical research, TabPFN was used in a study on the early detection of pancreatic cancer from blood samples, where it was combined with metabolomic data and reported high diagnostic performance. == Applications == TabPFN has been used in industrial and biomedical contexts. Hitachi Ltd. has been reported to use the model for predictive maintenance in rail networks, with its use described as helping to identify track issues earlier and reduce manual inspections. In the biomedical domain, Oxford Cancer Analytics has used TabPFN in the analysis of proteomic data in lung disease research. A 2025 ML Contests report noted that the winners of DrivenData's PREPARE challenge used TabPFN to generate features for gradient-boosted decision tree models. == Limitations == TabPFN has been criticized for its "one large neural network is all you need" approach to modeling problems. Further, its performance is limited in high-dimensional and large-scale datasets. == Scaling Mode == In late November 2025, Prior Labs introduced ‘‘Scaling Mode’’, an operating mode for TabPFN designed to remove the fixed upper bound on training set size. Earlier versions of TabPFN had been optimized and validated primarily for datasets of up to 100,000 rows, whereas Scaling Mode was reported to extend support to substantially larger datasets, with benchmarked experiments on datasets containing up to 10 million rows. According to Prior Labs, Scaling Mode preserves the existing TabPFN architecture, including its alternating row-attention and column-attention design, as well as the same feature-count limits as prior releases. Inference remains based on a single forward pass rather than dataset-specific gradient-descent training, while scalability is described as being constrained primarily by available compute and memory resources. Prior Labs reported benchmark results on an internal collection of datasets ranging from 1 million to 10 million rows across industry and scientific applications. In these benchmarks, Scaling Mode was compared with CatBoost, XGBoost, LightGBM, and TabPFN 2.5 using 50,000-row subsampling. The company stated that predictive performance improved monotonically with increasing training set size and that no diminishing returns from scaling were observed within the tested range. Prior Labs also announced the release through company and executive social media channels. TabPFN-3 later incorporated related scaling improvements, including row chunking and a reduced key-value cache, into the model architecture and inference pipeline.
Naive Bayes classifier
In statistics, naive (sometimes simple or idiot's) Bayes classifiers are a family of "probabilistic classifiers" which assume that the features are conditionally independent, given the target class. In other words, a naive Bayes model assumes the information about the class provided by each variable is unrelated to the information from the others, with no information shared between the predictors. The highly unrealistic nature of this assumption, called the naive independence assumption, is what gives the classifier its name. These classifiers are some of the simplest Bayesian network models. Naive Bayes classifiers generally perform worse than more advanced models like logistic regressions, especially at quantifying uncertainty (with naive Bayes models often producing wildly overconfident probabilities). However, they are highly scalable, requiring only one parameter for each feature or predictor in a learning problem. Maximum-likelihood training can be done by evaluating a closed-form expression (simply by counting observations in each group), rather than the expensive iterative approximation algorithms required by most other models. Despite the use of Bayes' theorem in the classifier's decision rule, naive Bayes is not (necessarily) a Bayesian method, and naive Bayes models can be fit to data using either Bayesian or frequentist methods. == Introduction == Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without accepting Bayesian probability or using any Bayesian methods. Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, an analysis of the Bayesian classification problem showed that there are sound theoretical reasons for the apparently implausible efficacy of naive Bayes classifiers. Still, a comprehensive comparison with other classification algorithms in 2006 showed that Bayes classification is outperformed by other approaches, such as boosted trees or random forests. An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification. == Probabilistic model == Abstractly, naive Bayes is a conditional probability model: it assigns probabilities p ( C k ∣ x 1 , … , x n ) {\displaystyle p(C_{k}\mid x_{1},\ldots ,x_{n})} for each of the K possible outcomes or classes C k {\displaystyle C_{k}} given a problem instance to be classified, represented by a vector x = ( x 1 , … , x n ) {\displaystyle \mathbf {x} =(x_{1},\ldots ,x_{n})} encoding some n features (independent variables). The problem with the above formulation is that if the number of features n is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible. The model must therefore be reformulated to make it more tractable. Using Bayes' theorem, the conditional probability can be decomposed as: p ( C k ∣ x ) = p ( C k ) p ( x ∣ C k ) p ( x ) {\displaystyle p(C_{k}\mid \mathbf {x} )={\frac {p(C_{k})\ p(\mathbf {x} \mid C_{k})}{p(\mathbf {x} )}}\,} In plain English, using Bayesian probability terminology, the above equation can be written as posterior = prior × likelihood evidence {\displaystyle {\text{posterior}}={\frac {{\text{prior}}\times {\text{likelihood}}}{\text{evidence}}}\,} In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on C {\displaystyle C} and the values of the features x i {\displaystyle x_{i}} are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model p ( C k , x 1 , … , x n ) {\displaystyle p(C_{k},x_{1},\ldots ,x_{n})\,} which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability: p ( C k , x 1 , … , x n ) = p ( x 1 , … , x n , C k ) = p ( x 1 ∣ x 2 , … , x n , C k ) p ( x 2 , … , x n , C k ) = p ( x 1 ∣ x 2 , … , x n , C k ) p ( x 2 ∣ x 3 , … , x n , C k ) p ( x 3 , … , x n , C k ) = ⋯ = p ( x 1 ∣ x 2 , … , x n , C k ) p ( x 2 ∣ x 3 , … , x n , C k ) ⋯ p ( x n − 1 ∣ x n , C k ) p ( x n ∣ C k ) p ( C k ) {\displaystyle {\begin{aligned}p(C_{k},x_{1},\ldots ,x_{n})&=p(x_{1},\ldots ,x_{n},C_{k})\\&=p(x_{1}\mid x_{2},\ldots ,x_{n},C_{k})\ p(x_{2},\ldots ,x_{n},C_{k})\\&=p(x_{1}\mid x_{2},\ldots ,x_{n},C_{k})\ p(x_{2}\mid x_{3},\ldots ,x_{n},C_{k})\ p(x_{3},\ldots ,x_{n},C_{k})\\&=\cdots \\&=p(x_{1}\mid x_{2},\ldots ,x_{n},C_{k})\ p(x_{2}\mid x_{3},\ldots ,x_{n},C_{k})\cdots p(x_{n-1}\mid x_{n},C_{k})\ p(x_{n}\mid C_{k})\ p(C_{k})\\\end{aligned}}} Now the "naive" conditional independence assumptions come into play: assume that all features in x {\displaystyle \mathbf {x} } are mutually independent, conditional on the category C k {\displaystyle C_{k}} . Under this assumption, p ( x i ∣ x i + 1 , … , x n , C k ) = p ( x i ∣ C k ) . {\displaystyle p(x_{i}\mid x_{i+1},\ldots ,x_{n},C_{k})=p(x_{i}\mid C_{k})\,.} Thus, the joint model can be expressed as p ( C k ∣ x 1 , … , x n ) ∝ p ( C k , x 1 , … , x n ) = p ( C k ) p ( x 1 ∣ C k ) p ( x 2 ∣ C k ) p ( x 3 ∣ C k ) ⋯ = p ( C k ) ∏ i = 1 n p ( x i ∣ C k ) , {\displaystyle {\begin{aligned}p(C_{k}\mid x_{1},\ldots ,x_{n})\varpropto \ &p(C_{k},x_{1},\ldots ,x_{n})\\&=p(C_{k})\ p(x_{1}\mid C_{k})\ p(x_{2}\mid C_{k})\ p(x_{3}\mid C_{k})\ \cdots \\&=p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})\,,\end{aligned}}} where ∝ {\displaystyle \varpropto } denotes proportionality since the denominator p ( x ) {\displaystyle p(\mathbf {x} )} is omitted. This means that under the above independence assumptions, the conditional distribution over the class variable C {\displaystyle C} is: p ( C k ∣ x 1 , … , x n ) = 1 Z p ( C k ) ∏ i = 1 n p ( x i ∣ C k ) {\displaystyle p(C_{k}\mid x_{1},\ldots ,x_{n})={\frac {1}{Z}}\ p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})} where the evidence Z = p ( x ) = ∑ k p ( C k ) p ( x ∣ C k ) {\displaystyle Z=p(\mathbf {x} )=\sum _{k}p(C_{k})\ p(\mathbf {x} \mid C_{k})} is a scaling factor dependent only on x 1 , … , x n {\displaystyle x_{1},\ldots ,x_{n}} , that is, a constant if the values of the feature variables are known. Often, it is only necessary to discriminate between classes. In that case, the scaling factor is irrelevant, and it is sufficient to calculate the log-probability up to a factor: ln p ( C k ∣ x 1 , … , x n ) = ln p ( C k ) + ∑ i = 1 n ln p ( x i ∣ C k ) − ln Z ⏟ irrelevant {\displaystyle \ln p(C_{k}\mid x_{1},\ldots ,x_{n})=\ln p(C_{k})+\sum _{i=1}^{n}\ln p(x_{i}\mid C_{k})\underbrace {-\ln Z} _{\text{irrelevant}}} The scaling factor is irrelevant, since discrimination subtracts it away: ln p ( C k ∣ x 1 , … , x n ) p ( C l ∣ x 1 , … , x n ) = ( ln p ( C k ) + ∑ i = 1 n ln p ( x i ∣ C k ) ) − ( ln p ( C l ) + ∑ i = 1 n ln p ( x i ∣ C l ) ) {\displaystyle \ln {\frac {p(C_{k}\mid x_{1},\ldots ,x_{n})}{p(C_{l}\mid x_{1},\ldots ,x_{n})}}=\left(\ln p(C_{k})+\sum _{i=1}^{n}\ln p(x_{i}\mid C_{k})\right)-\left(\ln p(C_{l})+\sum _{i=1}^{n}\ln p(x_{i}\mid C_{l})\right)} There are two benefits of using log-probability. One is that it allows an interpretation in information theory, where log-probabilities are units of information in nats. Another is that it avoids arithmetic underflow. === Constructing a classifier from the probability model === The discussion so far has derived the independent feature model, that is, the naive Bayes probability model. The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable so as to minimize the probability of misclassification; this is known as the maximum a posteriori or MAP decision rule. The corresponding classifier, a Bayes classifier, is the function that assigns a class label y ^ = C k {\displaystyle {\hat {y}}=C_{k}} for some k as follows: y ^ = argmax k ∈ { 1 , … , K } p ( C k ) ∏ i = 1 n p ( x i ∣ C k ) . {\displaystyle {\hat {y}}={\underset {k\in \{1,\ldots ,K\}}{\operatorname {argmax} }}\ p(C_{k})\displays
Bootstrap aggregating
Bootstrap aggregating, also called bagging (from bootstrap aggregating) or bootstrapping, is a machine learning (ML) ensemble meta-algorithm designed to improve the stability and accuracy of ML classification and regression algorithms. It also reduces variance and overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the ensemble averaging approach. == Description of the technique == Given a standard training set D {\displaystyle D} of size n {\displaystyle n} , bagging generates m {\displaystyle m} new training sets D i {\displaystyle D_{i}} , each of size n ′ {\displaystyle n'} , by sampling from D {\displaystyle D} uniformly and with replacement. By sampling with replacement, some observations may be repeated in each D i {\displaystyle D_{i}} . If n ′ = n {\displaystyle n'=n} , then for large n {\displaystyle n} the set D i {\displaystyle D_{i}} is expected to have the fraction (1 - 1/e) (~63.2%) of the unique samples of D {\displaystyle D} , the rest being duplicates. This kind of sample is known as a bootstrap sample. Sampling with replacement ensures each bootstrap is independent from its peers, as it does not depend on previous chosen samples when sampling. Then, m {\displaystyle m} models are fitted using the above bootstrap samples and combined by averaging the output (for regression) or voting (for classification). Bagging leads to "improvements for unstable procedures", which include, for example, artificial neural networks, classification and regression trees, and subset selection in linear regression. Bagging was shown to improve preimage learning. On the other hand, it can mildly degrade the performance of stable methods such as k-nearest neighbors. == Process of the algorithm == === Key Terms === There are three types of datasets in bootstrap aggregating. These are the original, bootstrap, and out-of-bag datasets. Each section below will explain how each dataset is made except for the original dataset. The original dataset is whatever information is given. === Creating the bootstrap dataset === The bootstrap dataset is made by randomly picking objects from the original dataset. Also, it must be the same size as the original dataset. However, the difference is that the bootstrap dataset can have duplicate objects. Here is a simple example to demonstrate how it works along with the illustration below: Suppose the original dataset is a group of 12 people. Their names are Emily, Jessie, George, Constantine, Lexi, Theodore, John, James, Rachel, Anthony, Ellie, and Jamal. By randomly picking a group of names, let us say our bootstrap dataset had James, Ellie, Constantine, Lexi, John, Constantine, Theodore, Constantine, Anthony, Lexi, Constantine, and Theodore. In this case, the bootstrap sample contained four duplicates for Constantine, and two duplicates for Lexi, and Theodore. === Creating the out-of-bag dataset === The out-of-bag dataset represents the remaining people who were not in the bootstrap dataset. It can be calculated by taking the difference between the original and the bootstrap datasets. In this case, the remaining samples who were not selected are Emily, Jessie, George, Rachel, and Jamal. Keep in mind that since both datasets are sets, when taking the difference the duplicate names are ignored in the bootstrap dataset. The illustration below shows how the math is done: === Application === Creating the bootstrap and out-of-bag datasets is crucial since it is used to test the accuracy of ensemble learning algorithms like random forest. For example, a model that produces 50 trees using the bootstrap/out-of-bag datasets will have a better accuracy than if it produced 10 trees. Since the algorithm generates multiple trees and therefore multiple datasets the chance that an object is left out of the bootstrap dataset is low. The next few sections talk about how the random forest algorithm works in more detail. === Creation of Decision Trees === The next step of the algorithm involves the generation of decision trees from the bootstrapped dataset. To achieve this, the process examines each gene/feature and determines for how many samples the feature's presence or absence yields a positive or negative result. This information is then used to compute a confusion matrix, which lists the true positives, false positives, true negatives, and false negatives of the feature when used as a classifier. These features are then ranked according to various classification metrics based on their confusion matrices. Some common metrics include estimate of positive correctness (calculated by subtracting false positives from true positives), measure of "goodness", and information gain. These features are then used to partition the samples into two sets: those that possess the top feature, and those that do not. The diagram below shows a decision tree of depth two being used to classify data. For example, a data point that exhibits Feature 1, but not Feature 2, will be given a "No". Another point that does not exhibit Feature 1, but does exhibit Feature 3, will be given a "Yes". This process is repeated recursively for successive levels of the tree until the desired depth is reached. At the very bottom of the tree, samples that test positive for the final feature are generally classified as positive, while those that lack the feature are classified as negative. These trees are then used as predictors to classify new data. === Random Forests === The next part of the algorithm involves introducing yet another element of variability amongst the bootstrapped trees. In addition to each tree only examining a bootstrapped set of samples, only a small but consistent number of unique features are considered when ranking them as classifiers. This means that each tree only knows about the data pertaining to a small constant number of features, and a variable number of samples that is less than or equal to that of the original dataset. Consequently, the trees are more likely to return a wider array of answers, derived from more diverse knowledge. This results in a random forest, which possesses numerous benefits over a single decision tree generated without randomness. In a random forest, each tree "votes" on whether or not to classify a sample as positive based on its features. The sample is then classified based on majority vote. An example of this is given in the diagram below, where the four trees in a random forest vote on whether or not a patient with mutations A, B, F, and G has cancer. Since three out of four trees vote yes, the patient is then classified as cancer positive. Because of their properties, random forests are considered one of the most accurate data mining algorithms, are less likely to overfit their data, and run quickly and efficiently even for large datasets. They are primarily useful for classification as opposed to regression, which attempts to draw observed connections between statistical variables in a dataset. This makes random forests particularly useful in such fields as banking, healthcare, the stock market, and e-commerce where it is important to be able to predict future results based on past data. One of their applications would be as a useful tool for predicting cancer based on genetic factors, as seen in the above example. There are several important factors to consider when designing a random forest. If the trees in the random forests are too deep, overfitting can still occur due to over-specificity. If the forest is too large, the algorithm may become less efficient due to an increased runtime. Random forests also do not generally perform well when given sparse data with little variability. However, they still have numerous advantages over similar data classification algorithms such as neural networks, as they are much easier to interpret and generally require less data for training. As an integral component of random forests, bootstrap aggregating is very important to classification algorithms, and provides a critical element of variability that allows for increased accuracy when analyzing new data, as discussed below. == Improving Random Forests and Bagging == While the techniques described above utilize random forests and bagging (otherwise known as bootstrapping), there are certain techniques that can be used in order to improve their execution and voting time, their prediction accuracy, and their overall performance. The following are key steps in creating an efficient random forest: Specify the maximum depth of trees: Instead of allowing the random forest to continue until all nodes are pure, it is better to cut it off at a certain point in order to further decrease chances of overfitting. Prune the dataset: Using an extremely large dataset may create results that are less indicative of the data provided than a smaller set that more accurately represents what is being focused on. Continue pruning the data at each
Rhetorical structure theory
Rhetorical structure theory (RST) is a theory of text organization that describes relations that hold between parts of text. It was originally developed by William Mann, Sandra Thompson, Christian M. I. M. Matthiessen and others at the University of Southern California's Information Sciences Institute (ISI) and defined in a 1988 paper. The theory was developed as part of studies of computer-based text generation. Natural language processing researchers later began using RST in automatic summarization and other applications. It explains coherence by postulating a hierarchical, connected structure of texts, which are labeled using a small, predefined inventory of relation types - for example, one part of a text may provide an elaboration on another part, provide background or specify a cause for another. In the 2000s, following the release of the first large-scale dataset implementing the theory, the RST Discourse Treebank (RST-DT), Daniel Marcu demonstrated the feasibility of practical applications of RST to discourse parsing and summarization at ISI. Originally limited to written text, subsequent work in the 2010s expanded RST to spoken language analysis, and the framework has been applied to a variety of languages including Farsi, German, Mandarin Chinese, Russian and Spanish. Following the introduction of Transformers, LLMs have been applied to automatic RST parsing, with results approaching human performance on parsing text in English. == Rhetorical relations == Rhetorical relations, also called coherence or discourse relations, are paratactic (coordinate) or hypotactic (subordinate) relations that hold across two or more text spans. The logical arrangement of relations in a text contributes to its coherence by connecting different propositions in a relational structure. RST using rhetorical relations provides a systematic way for an analyst to analyze the underlying intention of a text. The analysis is usually built by reading the text and constructing a tree using the relations. The following example is a title and summary, appearing at the top of an article in Scientific American magazine (adapted from Ramachandran and Anstis, 1986). The original text, broken into numbered units, is: [Title:] The Perception of Apparent Motion [Abstract:] When the motion of an intermittently seen object is ambiguous the visual system resolves confusion by applying some tricks that reflect a built-in knowledge of properties of the physical world. In the figure, the numbers 1-5 show the corresponding units from the text above. Unit 5 provides an "elaboration" on unit 4, and therefore constitutes a less prominent satellite of unit 4, which acts as a nucleus for the relation. Units 4-5 form a relation "Means", explaining the means by which the visual system resolves confusion. Unit 3 is the Central Discourse Unit (CDU) of the text, since all units point to it directly or indirectly. Similarly units 1 and 2 form "preparation" and "circumstance" relations relative to their nuclei. Groups of units which serve as a satellite or nucleus together are called complex discourse units, and always span a set of adjacent EDUs. == Nuclearity in discourse == RST establishes two different types of units. Nuclei are considered as the most important parts of text whereas satellites contribute to the nuclei and are secondary. Nucleus contains basic information and satellite contains additional information about nucleus. The satellite is often incomprehensible without nucleus, whereas a text where satellites have been deleted can be understood to a certain extent. == Hierarchy in the analysis == RST relations are applied recursively in a text, until all units in that text are constituents in an RST relation. The result of such analyses is that RST structure are typically represented as trees, with one top level relation that encompasses other relations at lower levels. == Why RST? == From linguistic point of view, RST proposes a different view of text organization than most linguistic theories. RST points to a tight relation between relations and coherence in text From a computational point of view, it provides a characterization of text relations that has been implemented in different systems and for applications as text generation and summarization. == In design rationale == Computer scientists Ana Cristina Bicharra Garcia and Clarisse Sieckenius de Souz have used RST as the basis of a design rationale system called ADD+. In ADD+, RST is used as the basis for the rhetorical organization of a knowledge base, in a way comparable to other knowledge representation systems such as issue-based information system (IBIS). Similarly, RST has been used in representation schemes for argumentation.
Bioz
Bioz is a search engine for life science experimentation. == History == Bioz was founded by Karin Lachmi and Daniel Levitt. Lachmi is a scientist who completed her postdoc in molecular and cellular biology at the Stanford University School of Medicine. During her lab work she found little available data regarding preferable lab tools, reagents and related products for experimentation. There are 50,000 vendors selling 300 million scientific products. She decided to start the company in order to provide researchers with adequate information for that purpose. Co-founder Daniel Levitt is an entrepreneur who sold his company WebAppoint to Microsoft in the year 2000. He also co-founded the company StemRad. At Bioz, Lachmi serves as the Chief Scientific Officer and Levitt serves as the chief executive officer. Bioz claims to have over a million researcher-users from 196 countries. Among the investors are Esther Dyson and the Stanford-StartX Fund. The company's advisory board includes Nobel Laureates in Chemistry Michael Levitt, Roger Kornberg, and Ada Yonath. == Technology == The company uses artificial intelligence, machine learning and natural language processing in order to extract experimentation data from scientific articles, such as the products that researchers used, the companies that supply the products, the protocol conditions that researchers selected, and the types of experiments and techniques. The algorithm ranks products based on how frequently they were used by researchers in their experiments, how recently a product was used, and the impact factor of the journal. The algorithm's output is a Bioz stars score for each product that was mentioned in an article. Bioz is a data-driven platform for product recommendations, which is contrary to platforms such as TripAdvisor and OpenTable that are based on user-generated reviews and ratings. The recommendations and scoring system that the company has developed are meant to assist researchers with the process of developing future medications and finding cures for diseases. They are guided towards products and techniques that were previously used by other researchers when planning and performing experiments. The company's revenue is based on selling SaaS subscriptions to researchers in biopharma companies. They also charge product suppliers for content syndication.