AI Data Room

AI Data Room — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • KidDesk

    KidDesk

    KidDesk is an alternative desktop software application. The early childhood learning company Hatch Early Childhood created KidDesk; it subsequently went to Edmark, which was bought by IBM then sold to Riverdeep (now Houghton Mifflin Harcourt Learning Technology). KidDesk is compatible with Microsoft Windows 95 and newer, as well as Apple System 7 and newer. KidDesk can be set to start when the computer starts up, and can only be exited through password entry. Adults choose what programs are included for the child to use, what icon represented the desk, and customize the software programs available for use. == History == Edmark first started shipping KidDesk in 1992. In 1993, Edmark updated KidDesk with KidDesk Family Edition for Macintosh and DOS, adding more desk accessories and desk styles (Sometimes included as a free exclusive offer with the Early Learning House and Thinkin' Things Series). In 1995, KidDesk Family Edition was enhanced for Windows 95, and released one month after the new operating system shipped. In 1998, Edmark developed KidDesk Internet Safe. The Internet Safe edition was written for Windows 95, Windows 98, and Macintosh (including OS8). In 2008, HMH ported KidDesk Family Edition was to run on Windows Vista and in 2011 version 3.07 of KidDesk Family Edition was released as part of the 'Young Explorer' suite which is fully supported on Windows XP, Windows Vista and Windows 7. == Features == A picture editor incorporated into the desk. Used both in the Adult settings menu and in the desk itself. KidDesk users can edit their user logo with a pixel grid paint program. A calendar incorporated into the desk. This allows the user to set dates that the user finds important, and allows the date to be marked with a picture or text. A password exit feature. For security reasons, the adult can set a password so that KidDesk can only be exited if it is entered. As an extra security measure, the password exit function could only be accessed if the user pressed the ctrl + alt + A keyboard buttons simultaneously. A skin changer with several themes - farm, princess, sports, ocean, etc. These themes can be changed. The e-mail and voicemail features are customizable depending on the KidDesk installation. The ability to add websites that can be accessed on KidDesk, and the ability to block hyperlinks, JavaScript, data entry, etc., on said sites was an added for the 'Internet Safe' edition released in 1998. KidDesk Internet Safe edition is available in Spanish and Brazilian-Portuguese versions. == Reception == KidDesk was given a platinum award at the 1994 Oppenheim Toy Portfolio Awards. The judges praised the program's security features allowing "configur[ation] so that kids never have access to the possibly destructive DOS prompt", and concluded that "[i]f you and your kids share a computer, you need to install Kiddesk immediately!" === Awards === Since 1992, KidDesk has won 15 major awards.

    Read more →
  • Tucker decomposition

    Tucker decomposition

    In mathematics, Tucker decomposition decomposes a tensor into a set of matrices and one small core tensor. It is named after Ledyard R. Tucker although it goes back to Hitchcock in 1927. Initially described as a three-mode extension of factor analysis and principal component analysis it may actually be generalized to higher mode analysis, which is also called higher-order singular value decomposition (HOSVD) or the M-mode SVD. The algorithm to which the literature typically refers when discussing the Tucker decomposition or the HOSVD is the M-mode SVD algorithm introduced by Vasilescu and Terzopoulos, but misattributed to Tucker or De Lathauwer etal. It may be regarded as a more flexible PARAFAC (parallel factor analysis) model. In PARAFAC the core tensor is restricted to be "diagonal". In practice, Tucker decomposition is used as a modelling tool. For instance, it is used to model three-way (or higher way) data by means of relatively small numbers of components for each of the three or more modes, and the components are linked to each other by a three- (or higher-) way core array. The model parameters are estimated in such a way that, given fixed numbers of components, the modelled data optimally resemble the actual data in the least squares sense. The model gives a summary of the information in the data, in the same way as principal components analysis does for two-way data. For a 3rd-order tensor T ∈ F n 1 × n 2 × n 3 {\displaystyle T\in F^{n_{1}\times n_{2}\times n_{3}}} , where F {\displaystyle F} is either R {\displaystyle \mathbb {R} } or C {\displaystyle \mathbb {C} } , Tucker Decomposition can be denoted as follows, T = T × 1 U ( 1 ) × 2 U ( 2 ) × 3 U ( 3 ) {\displaystyle T={\mathcal {T}}\times _{1}U^{(1)}\times _{2}U^{(2)}\times _{3}U^{(3)}} where T ∈ F d 1 × d 2 × d 3 {\displaystyle {\mathcal {T}}\in F^{d_{1}\times d_{2}\times d_{3}}} is the core tensor, a 3rd-order tensor that contains the 1-mode, 2-mode and 3-mode singular values of T {\displaystyle T} , which are defined as the Frobenius norm of the 1-mode, 2-mode and 3-mode slices of tensor T {\displaystyle {\mathcal {T}}} respectively. U ( 1 ) , U ( 2 ) , U ( 3 ) {\displaystyle U^{(1)},U^{(2)},U^{(3)}} are unitary matrices in F d 1 × n 1 , F d 2 × n 2 , F d 3 × n 3 {\displaystyle F^{d_{1}\times n_{1}},F^{d_{2}\times n_{2}},F^{d_{3}\times n_{3}}} respectively. The k-mode product (k = 1, 2, 3) of T {\displaystyle {\mathcal {T}}} by U ( k ) {\displaystyle U^{(k)}} is denoted as T × U ( k ) {\displaystyle {\mathcal {T}}\times U^{(k)}} with entries as ( T × 1 U ( 1 ) ) ( i 1 , j 2 , j 3 ) = ∑ j 1 = 1 d 1 T ( j 1 , j 2 , j 3 ) U ( 1 ) ( j 1 , i 1 ) ( T × 2 U ( 2 ) ) ( j 1 , i 2 , j 3 ) = ∑ j 2 = 1 d 2 T ( j 1 , j 2 , j 3 ) U ( 2 ) ( j 2 , i 2 ) ( T × 3 U ( 3 ) ) ( j 1 , j 2 , i 3 ) = ∑ j 3 = 1 d 3 T ( j 1 , j 2 , j 3 ) U ( 3 ) ( j 3 , i 3 ) {\displaystyle {\begin{aligned}({\mathcal {T}}\times _{1}U^{(1)})(i_{1},j_{2},j_{3})&=\sum _{j_{1}=1}^{d_{1}}{\mathcal {T}}(j_{1},j_{2},j_{3})U^{(1)}(j_{1},i_{1})\\({\mathcal {T}}\times _{2}U^{(2)})(j_{1},i_{2},j_{3})&=\sum _{j_{2}=1}^{d_{2}}{\mathcal {T}}(j_{1},j_{2},j_{3})U^{(2)}(j_{2},i_{2})\\({\mathcal {T}}\times _{3}U^{(3)})(j_{1},j_{2},i_{3})&=\sum _{j_{3}=1}^{d_{3}}{\mathcal {T}}(j_{1},j_{2},j_{3})U^{(3)}(j_{3},i_{3})\end{aligned}}} Altogether, the decomposition may also be written more directly as T ( i 1 , i 2 , i 3 ) = ∑ j 1 = 1 d 1 ∑ j 2 = 1 d 2 ∑ j 3 = 1 d 3 T ( j 1 , j 2 , j 3 ) U ( 1 ) ( j 1 , i 1 ) U ( 2 ) ( j 2 , i 2 ) U ( 3 ) ( j 3 , i 3 ) {\displaystyle T(i_{1},i_{2},i_{3})=\sum _{j_{1}=1}^{d_{1}}\sum _{j_{2}=1}^{d_{2}}\sum _{j_{3}=1}^{d_{3}}{\mathcal {T}}(j_{1},j_{2},j_{3})U^{(1)}(j_{1},i_{1})U^{(2)}(j_{2},i_{2})U^{(3)}(j_{3},i_{3})} Taking d i = n i {\displaystyle d_{i}=n_{i}} for all i {\displaystyle i} is always sufficient to represent T {\displaystyle T} exactly, but often T {\displaystyle T} can be compressed or efficiently approximately by choosing d i < n i {\displaystyle d_{i} Read more →

  • Crossover (evolutionary algorithm)

    Crossover (evolutionary algorithm)

    Crossover in evolutionary algorithms and evolutionary computation, also called recombination, is a genetic operator used to combine the genetic information of two parents to generate new offspring. It is one way to stochastically generate new solutions from an existing population, and is analogous to the crossover that happens during sexual reproduction in biology. New solutions can also be generated by cloning an existing solution, which is analogous to asexual reproduction. Newly generated solutions may be mutated before being added to the population. The aim of recombination is to transfer good characteristics from two different parents to one child. Different algorithms in evolutionary computation may use different data structures to store genetic information, and each genetic representation can be recombined with different crossover operators. Typical data structures that can be recombined with crossover are bit arrays, vectors of real numbers, or trees. The list of operators presented below is by no means complete and serves mainly as an exemplary illustration of this dyadic genetic operator type. More operators and more details can be found in the literature. == Crossover for binary arrays == Traditional genetic algorithms store genetic information in a chromosome represented by a bit array. Crossover methods for bit arrays are popular and an illustrative example of genetic recombination. === One-point crossover === A point on both parents' chromosomes is picked randomly, and designated a 'crossover point'. Bits to the right of that point are swapped between the two parent chromosomes. This results in two offspring, each carrying some genetic information from both parents. === Two-point and k-point crossover === In two-point crossover, two crossover points are picked randomly from the parent chromosomes. The bits in between the two points are swapped between the parent organisms. Two-point crossover is equivalent to performing two single-point crossovers with different crossover points. This strategy can be generalized to k-point crossover for any positive integer k, picking k crossover points. === Uniform crossover === In uniform crossover, typically, each bit is chosen from either parent with equal probability. Other mixing ratios are sometimes used, resulting in offspring which inherit more genetic information from one parent than the other. In a uniform crossover, we don’t divide the chromosome into segments, rather we treat each gene separately. In this, we essentially flip a coin for each chromosome to decide whether or not it will be included in the off-spring. == Crossover for integer or real-valued genomes == For the crossover operators presented above and for most other crossover operators for bit strings, it holds that they can also be applied accordingly to integer or real-valued genomes whose genes each consist of an integer or real-valued number. Instead of individual bits, integer or real-valued numbers are then simply copied into the child genome. The offspring lie on the remaining corners of the hyperbody spanned by the two parents P 1 = ( 1.5 , 6 , 8 ) {\displaystyle P_{1}=(1.5,6,8)} and P 2 = ( 7 , 2 , 1 ) {\displaystyle P_{2}=(7,2,1)} , as exemplified in the accompanying image for the three-dimensional case. === Discrete recombination === If the rules of the uniform crossover for bit strings are applied during the generation of the offspring, this is also called discrete recombination. === Intermediate recombination === In this recombination operator, the allele values of the child genome a i {\displaystyle a_{i}} are generated by mixing the alleles of the two parent genomes a i , P 1 {\displaystyle a_{i,P_{1}}} and a i , P 2 {\displaystyle a_{i,P_{2}}} : α i = α i , P 1 ⋅ β i + α i , P 2 ⋅ ( 1 − β i ) w i t h β i ∈ [ − d , 1 + d ] {\displaystyle \alpha _{i}=\alpha _{i,P_{1}}\cdot \beta _{i}+\alpha _{i,P_{2}}\cdot \left(1-\beta _{i}\right)\quad {\mathsf {with}}\quad \beta _{i}\in \left[-d,1+d\right]} randomly equally distributed per gene i {\displaystyle i} The choice of the interval [ − d , 1 + d ] {\displaystyle [-d,1+d]} causes that besides the interior of the hyperbody spanned by the allele values of the parent genes additionally a certain environment for the range of values of the offspring is in question. A value of 0.25 {\displaystyle 0.25} is recommended for d {\displaystyle d} to counteract the tendency to reduce the allele values that otherwise exists at d = 0 {\displaystyle d=0} . The adjacent figure shows for the two-dimensional case the range of possible new alleles of the two exemplary parents P 1 = ( 3 , 6 ) {\displaystyle P_{1}=(3,6)} and P 2 = ( 9 , 2 ) {\displaystyle P_{2}=(9,2)} in intermediate recombination. The offspring of discrete recombination C 1 {\displaystyle C_{1}} and C 2 {\displaystyle C_{2}} are also plotted. Intermediate recombination satisfies the arithmetic calculation of the allele values of the child genome required by virtual alphabet theory. Discrete and intermediate recombination are used as a standard in the evolution strategy. == Crossover for permutations == For combinatorial tasks, permutations are usually used that are specifically designed for genomes that are themselves permutations of a set. The underlying set is usually a subset of N {\displaystyle \mathbb {N} } or N 0 {\displaystyle \mathbb {N} _{0}} . If 1- or n-point or uniform crossover for integer genomes is used for such genomes, a child genome may contain some values twice and others may be missing. This can be remedied by genetic repair, e.g. by replacing the redundant genes in positional fidelity for missing ones from the other child genome. In order to avoid the generation of invalid offspring, special crossover operators for permutations have been developed which fulfill the basic requirements of such operators for permutations, namely that all elements of the initial permutation are also present in the new one and only the order is changed. It can be distinguished between combinatorial tasks, where all sequences are admissible, and those where there are constraints in the form of inadmissible partial sequences. A well-known representative of the first task type is the traveling salesman problem (TSP), where the goal is to visit a set of cities exactly once on the shortest tour. An example of the constrained task type is the scheduling of multiple workflows. Workflows involve sequence constraints on some of the individual work steps. For example, a thread cannot be cut until the corresponding hole has been drilled in a workpiece. Such problems are also called order-based permutations. In the following, two crossover operators are presented as examples, the partially mapped crossover (PMX) motivated by the TSP and the order crossover (OX1) designed for order-based permutations. A second offspring can be produced in each case by exchanging the parent chromosomes. === Partially mapped crossover (PMX) === The PMX operator was designed as a recombination operator for TSP like Problems. The explanation of the procedure is illustrated by an example: === Order crossover (OX1) === The order crossover goes back to Davis in its original form and is presented here in a slightly generalized version with more than two crossover points. It transfers information about the relative order from the second parent to the offspring. First, the number and position of the crossover points are determined randomly. The resulting gene sequences are then processed as described below: Among other things, order crossover is well suited for scheduling multiple workflows, when used in conjunction with 1- and n-point crossover. === Further crossover operators for permutations === Over time, a large number of crossover operators for permutations have been proposed, so the following list is only a small selection. For more information, the reader is referred to the literature. cycle crossover (CX) order-based crossover (OX2) position-based crossover (POS) edge recombination voting recombination (VR) alternating-positions crossover (AP) maximal preservative crossover (MPX) merge crossover (MX) sequential constructive crossover operator (SCX) The usual approach to solving TSP-like problems by genetic or, more generally, evolutionary algorithms, presented earlier, is either to repair illegal descendants or to adjust the operators appropriately so that illegal offspring do not arise in the first place. Alternatively, Riazi suggests the use of a double chromosome representation, which avoids illegal offspring.

    Read more →
  • Frequent pattern discovery

    Frequent pattern discovery

    Frequent pattern discovery (or FP discovery, FP mining, or Frequent itemset mining) is part of knowledge discovery in databases, Massive Online Analysis, and data mining; it describes the task of finding the most frequent and relevant patterns in large datasets. The concept was first introduced for mining transaction databases. Frequent patterns are defined as subsets (itemsets, subsequences, or substructures) that appear in a data set with frequency no less than a user-specified or auto-determined threshold. == Techniques == Techniques for FP mining include: market basket analysis cross-marketing catalog design clustering classification recommendation systems For the most part, FP discovery can be done using association rule learning with particular algorithms Eclat, FP-growth and the Apriori algorithm. Other strategies include: Frequent subtree mining Structure mining Sequential pattern mining and respective specific techniques. Implementations exist for various machine learning systems or modules like MLlib for Apache Spark.

    Read more →
  • Tiki Wiki CMS Groupware

    Tiki Wiki CMS Groupware

    Tiki Wiki CMS Groupware or simply Tiki, originally known as TikiWiki, is a free and open source Wiki-based content management system and online office suite written primarily in PHP and distributed under the GNU Lesser General Public License (LGPL-2.1-only) license. In addition to enabling websites and portals on the internet and on intranets and extranets, Tiki contains a number of collaboration features allowing it to operate as a Geospatial Content Management System (GeoCMS) and Groupware web application. Tiki includes all the basic features common to most CMSs such as the ability to register and maintain individual user accounts within a flexible and rich permission / privilege system, create and manage menus, RSS-feeds, customize page layout, perform logging, and administer the system. All administration tasks are accomplished through a browser-based user interface. Tiki features an all-in-one design, as opposed to a core+extensions model followed by other CMSs. This allows for future-proof upgrades (since all features are released together), but has the drawback of an extremely large codebase (more than 1,000,000 lines). Tiki can run on any computing platform that supports both a web server capable of running PHP 5 (including Apache HTTP Server, IIS, Lighttpd, Hiawatha, Cherokee, and nginx) and a MySQL/MariaDB database to store content and settings. == Major components == Tiki has four major categories of components: content creation and management tools, content organization tools and navigation aids, communication tools, and configuration and administration tools. These components enable administrators and users to create and manage content, as well as letting them communicate to others and configure sites. In addition, Tiki allows each user to choose from various visual themes. These themes are implemented using CSS and the open source Smarty template engine. Additional themes can be created by a Tiki administrator for branding or customization as well. == Internationalization == Tiki is an international project, supporting many languages. The default interface language in Tiki is English, but any language that can be encoded and displayed using the UTF-8 encoding can be supported. Translated strings can be included via an external language file, or by translating interface strings directly, through the database. As of 29 September 2005, Tiki had been fully translated into eight languages and reportedly 90% or more translated into another five languages, as well as partial translations for nine additional languages. Tiki also supports interactive translation of actual wiki pages and was the initial wiki engine used in the Cross Lingual Wiki Engine Project. This allows Tiki-based web sites to have translated content — not just the user interface. == Implementation == Tiki is developed primarily in PHP with some JavaScript code. It uses MySQL/MariaDB as a database. It will run on any server that provides PHP 5, including Apache and Microsoft's IIS. Tiki components make extensive use of other open source projects, including Zend Framework, Smarty, jQuery, HTML Purifier, FCKeditor, Raphaël, phpCAS, and Morcego. When used with Mapserver Tiki can become a Geospatial Content Management System. == Project team == Tiki is under active development by a large international community of over 300 developers and translators, and is one of the largest open-source teams in the world. Project members have donated the resources and bandwidth required to host the tiki.org website and various subdomains. The project members refer to this dependence on their own product as "eating their own dogfood", which they have been doing since the early days of the project. Tiki community members also participate in various related events such as WikiSym and the Libre Software Meeting. == History == Tiki has been hosted on SourceForge.net since its initial release (Release 0.9, named Spica) in October 2002. It was primarily the development of Luis Argerich (Buenos Aires, Argentina), Eduardo Polidor (São Paulo, Brazil), and Garland Foster (Green Bay, WI, United States). In July 2003, Tiki was named the SourceForge.net July 2003 Project of the Month. In late 2003, a fork of Tiki was used to create Bitweaver. In 2006, Tiki was named to CMS Report's Top 30 Web Applications. In 2008, Tiki was named to EContent magazine's Top 100 In 2009, Tiki adopted a six-month release cycle and announced the selection of a Long Term Support (LTS) version and the Tiki Software Community Association was formed as the legal steward for Tiki. The Tiki Software Association is a not-for-profit entity established in Canada. Previously, the entire project was run entirely by volunteers. In 2010, Tiki received Best of Open Source Software Applications Award (BOSSIE) from InfoWorld, in the Applications category. In 2011, Tiki was named to CMS Report's Top 30 Web Applications. In 2012, Tiki was named "Best Web Tool" by WebHostingSearch.com, and "People's Choice: Best Free CMS" by CMS Critic. In 2016, Tiki was named as one of the "10 Best Open Source Collaboration Software Tools" by Small Business Computing. == Name == The name TikiWiki is written in CamelCase, a common Wiki syntax indicating a hyperlink within the Wiki. It is most likely a compound word combining two Polynesian terms, Tiki and Wiki, to create a self-rhyming name similar to wikiwiki, a common variant of wiki. A backronym has also been formed for Tiki: Tightly Integrated Knowledge Infrastructure. == Release Information and History == In general, the Tiki Software Community Association releases a new major version of Tiki Wiki every 8 months where prior, non-LTS, major versions are supported until the first minor version release of the next major version (i.e., 16.0 ⇒ 17.1). Starting with version 12.x, Tiki Wiki LTS is supported for 5 years where it enters a security/maintenance release cycle upon the release of the next LTS version. Tiki Wiki's release history is outlined below.

    Read more →
  • Latent Dirichlet allocation

    Latent Dirichlet allocation

    In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that explains how a collection of text documents can be described by a set of unobserved "topics." For example, given a set of news articles, LDA might discover that one topic is characterized by words like "president", "government", and "election", while another is characterized by "team", "game", and "score". It is one of the most common topic models. The LDA model was first presented as a graphical model for population genetics by J. K. Pritchard, M. Stephens and P. Donnelly in 2000. The model was subsequently applied to machine learning by David Blei, Andrew Ng, and Michael I. Jordan in 2003. Although its most frequent application is in modeling text corpora, it has also been used for other problems, such as in clinical psychology, social science, and computational musicology. The core assumption of LDA is that documents are represented as a random mixture of latent topics, and each topic is characterized by a probability distribution over words. The model is a generalization of probabilistic latent semantic analysis (pLSA), differing primarily in that LDA treats the topic mixture as a Dirichlet prior, leading to more reasonable mixtures and less susceptibility to overfitting. Learning the latent topics and their associated probabilities from a corpus is typically done using Bayesian inference, often with methods like Gibbs sampling or variational Bayes. == History == In the context of population genetics, LDA was proposed by J. K. Pritchard, M. Stephens and P. Donnelly in 2000. LDA was applied in machine learning by David Blei, Andrew Ng and Michael I. Jordan in 2003. == Overview == === Population genetics === In population genetics, the model is used to detect the presence of structured genetic variation in a group of individuals. The model assumes that alleles carried by individuals under study have origin in various extant or past populations. The model and various inference algorithms allow scientists to estimate the allele frequencies in those source populations and the origin of alleles carried by individuals under study. The source populations can be interpreted ex-post in terms of various evolutionary scenarios. In association studies, detecting the presence of genetic structure is considered a necessary preliminary step to avoid confounding. === Clinical psychology, mental health, and social science === In clinical psychology research, LDA has been used to identify common themes of self-images experienced by young people in social situations. Other social scientists have used LDA to examine large sets of topical data from discussions on social media (e.g., tweets about prescription drugs). Additionally, supervised Latent Dirichlet Allocation with covariates (SLDAX) has been specifically developed to combine latent topics identified in texts with other manifest variables. This approach allows for the integration of text data as predictors in statistical regression analyses, improving the accuracy of mental health predictions. One of the main advantages of SLDAX over traditional two-stage approaches is its ability to avoid biased estimates and incorrect standard errors, allowing for a more accurate analysis of psychological texts. In the field of social sciences, LDA has proven to be useful for analyzing large datasets, such as social media discussions. For instance, researchers have used LDA to investigate tweets discussing socially relevant topics, like the use of prescription drugs and cultural differences in China. By analyzing these large text corpora, it is possible to uncover patterns and themes that might otherwise go unnoticed, offering valuable insights into public discourse and perception in real time. === Musicology === In the context of computational musicology, LDA has been used to discover tonal structures in different corpora. === Machine learning === One application of LDA in machine learning – specifically, topic discovery, a subproblem in natural language processing – is to discover topics in a collection of documents, and then automatically classify any individual document within the collection in terms of how "relevant" it is to each of the discovered topics. A topic is considered to be a set of terms (i.e., individual words or phrases) that, taken together, suggest a shared theme. For example, in a document collection related to pet animals, the terms dog, spaniel, beagle, golden retriever, puppy, bark, and woof would suggest a DOG_related theme, while the terms cat, siamese, Maine coon, tabby, manx, meow, purr, and kitten would suggest a CAT_related theme. There may be many more topics in the collection – e.g., related to diet, grooming, healthcare, behavior, etc. that we do not discuss for simplicity's sake. (Very common, so called stop words in a language – e.g., "the", "an", "that", "are", "is", etc., – would not discriminate between topics and are usually filtered out by pre-processing before LDA is performed. Pre-processing also converts terms to their "root" lexical forms – e.g., "barks", "barking", and "barked" would be converted to "bark".) If the document collection is sufficiently large, LDA will discover such sets of terms (i.e., topics) based upon the co-occurrence of individual terms, though the task of assigning a meaningful label to an individual topic (i.e., that all the terms are DOG_related) is up to the user, and often requires specialized knowledge (e.g., for collection of technical documents). The LDA approach assumes that: The semantic content of a document is composed by combining one or more terms from one or more topics. Certain terms are ambiguous, belonging to more than one topic, with different probability. (For example, the term training can apply to both dogs and cats, but are more likely to refer to dogs, which are used as work animals or participate in obedience or skill competitions.) However, in a document, the accompanying presence of specific neighboring terms (which belong to only one topic) will disambiguate their usage. Most documents will contain only a relatively small number of topics. In the collection, e.g., individual topics will occur with differing frequencies. That is, they have a probability distribution, so that a given document is more likely to contain some topics than others. Within a topic, certain terms will be used much more frequently than others. In other words, the terms within a topic will also have their own probability distribution. When LDA machine learning is employed, both sets of probabilities are computed during the training phase, using Bayesian methods and an expectation–maximization algorithm. LDA is a generalization of older approach of probabilistic latent semantic analysis (pLSA), The pLSA model is equivalent to LDA under a uniform Dirichlet prior distribution. pLSA relies on only the first two assumptions above and does not care about the remainder. While both methods are similar in principle and require the user to specify the number of topics to be discovered before the start of training (as with k-means clustering) LDA has the following advantages over pLSA: LDA yields better disambiguation of words and a more precise assignment of documents to topics. Computing probabilities allows a "generative" process by which a collection of new "synthetic documents" can be generated that would closely reflect the statistical characteristics of the original collection. Unlike LDA, pLSA is vulnerable to overfitting especially when the size of corpus increases. The LDA algorithm is more readily amenable to scaling up for large data sets using the MapReduce approach on a computing cluster. == Model == With plate notation, which is often used to represent probabilistic graphical models (PGMs), the dependencies among the many variables can be captured concisely. The boxes are "plates" representing replicates, which are repeated entities. The outer plate represents documents, while the inner plate represents the repeated word positions in a given document; each position is associated with a choice of topic and word. The variable names are defined as follows: M denotes the number of documents N is number of words in a given document (document i has N i {\displaystyle N_{i}} words) α is the parameter of the Dirichlet prior on the per-document topic distributions β is the parameter of the Dirichlet prior on the per-topic word distribution θ i {\displaystyle \theta _{i}} is the topic distribution for document i φ k {\displaystyle \varphi _{k}} is the word distribution for topic k z i j {\displaystyle z_{ij}} is the topic for the j-th word in document i w i j {\displaystyle w_{ij}} is the specific word. The fact that W is grayed out means that words w i j {\displaystyle w_{ij}} are the only observable variables, and the other variables are latent variables. As proposed in the original paper, a sparse Dirichlet prior can be used to model the to

    Read more →
  • Sharpness aware minimization

    Sharpness aware minimization

    Sharpness Aware Minimization (SAM) is an optimization algorithm used in machine learning that aims to improve model generalization. The method seeks to find model parameters that are located in regions of the loss landscape with uniformly low loss values, rather than parameters that only achieve a minimal loss value at a single point. This approach is described as finding "flat" minima instead of "sharp" ones. The rationale is that models trained this way are less sensitive to variations between training and test data, which can lead to better performance on unseen data. The algorithm was introduced in a 2020 paper by a team of researchers including Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. == Underlying Principle == SAM modifies the standard training objective by minimizing a "sharpness-aware" loss. This is formulated as a minimax problem where the inner objective seeks to find the highest loss value in the immediate neighborhood of the current model weights, and the outer objective minimizes this value: min w max ‖ ϵ ‖ p ≤ ρ L train ( w + ϵ ) + λ ‖ w ‖ 2 2 {\displaystyle \min _{w}\max _{\|\epsilon \|_{p}\leq \rho }L_{\text{train}}(w+\epsilon )+\lambda \|w\|_{2}^{2}} In this formulation: w {\displaystyle w} represents the model's parameters (weights). L train {\displaystyle L_{\text{train}}} is the loss calculated on the training data. ϵ {\displaystyle \epsilon } is a perturbation applied to the weights. ρ {\displaystyle \rho } is a hyperparameter that defines the radius of the neighborhood (an L p {\displaystyle L_{p}} ball) to search for the highest loss. An optional L2 regularization term, scaled by λ {\displaystyle \lambda } , can be included. A direct solution to the inner maximization problem is computationally expensive. SAM approximates it by taking a single gradient ascent step to find the perturbation ϵ {\displaystyle \epsilon } . This is calculated as: ϵ ( w ) = ρ ∇ L train ( w ) ‖ ∇ L train ( w ) ‖ 2 {\displaystyle \epsilon (w)=\rho {\frac {\nabla L_{\text{train}}(w)}{\|\nabla L_{\text{train}}(w)\|_{2}}}} The optimization process for each training step involves two stages. First, an "ascent step" computes a perturbed set of weights, w adv = w + ϵ ( w ) {\displaystyle w_{\text{adv}}=w+\epsilon (w)} , by moving towards the direction of the highest local loss. Second, a "descent step" updates the original weights w {\displaystyle w} using the gradient calculated at these perturbed weights, ∇ L train ( w adv ) {\displaystyle \nabla L_{\text{train}}(w_{\text{adv}})} . This update is typically performed using a standard optimizer like SGD or Adam. == Application and Performance == SAM has been applied in various machine learning contexts, primarily in computer vision. Research has shown it can improve generalization performance in models such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) on image datasets including ImageNet, CIFAR-10, and CIFAR-100. The algorithm has also been found to be effective in training models with noisy labels, where it performs comparably to methods designed specifically for this problem. Some studies indicate that SAM and its variants can improve out-of-distribution (OOD) generalization, which is a model's ability to perform well on data from distributions not seen during training. Other areas where it has been applied include gradual domain adaptation and mitigating overfitting in scenarios with repeated exposure to training examples. == Limitations == A primary limitation of SAM is its computational cost. By requiring two gradient computations (one for the ascent and one for the descent) per optimization step, it approximately doubles the training time compared to standard optimizers. The theoretical convergence properties of SAM are still under investigation. Some research suggests that with a constant step size, SAM may not converge to a stationary point. The accuracy of the single gradient step approximation for finding the worst-case perturbation may also decrease during the training process. The effectiveness of SAM can also be domain-dependent. While it has shown benefits for computer vision tasks, its impact on other areas, such as GPT-style language models where each training example is seen only once, has been reported as limited in some studies. Furthermore, while SAM seeks flat minima, some research suggests that not all flat minima necessarily lead to good generalization. The algorithm also introduces the neighborhood size ρ {\displaystyle \rho } as a new hyperparameter, which requires tuning. == Research, Variants, and Enhancements == Active research on SAM focuses on reducing its computational overhead and improving its performance. Several variants have been proposed to make the algorithm more efficient. These include methods that attempt to parallelize the two gradient computations, apply the perturbation to only a subset of parameters, or reduce the number of computation steps required. Other approaches use historical gradient information or apply SAM steps intermittently to lower the computational burden. To improve performance and robustness, variants have been developed that adapt the neighborhood size based on model parameter scales (Adaptive SAM or ASAM) or incorporate information about the curvature of the loss landscape (Curvature Regularized SAM or CR-SAM). Other research explores refining the perturbation step by focusing on specific components of the gradient or combining SAM with techniques like random smoothing. Theoretical work continues to analyze the algorithm's behavior, including its implicit bias towards flatter minima and the development of broader frameworks for sharpness-aware optimization that use different measures of sharpness.

    Read more →
  • Softplus

    Softplus

    In mathematics and machine learning, the softplus function is f ( x ) = ln ⁡ ( 1 + e x ) . {\displaystyle f(x)=\ln(1+e^{x}).} It is a smooth approximation (in fact, an analytic function) to the ramp function, which is known as the rectifier or ReLU (rectified linear unit) in machine learning. For large negative x {\displaystyle x} it is ln ⁡ ( 1 + e x ) = ln ⁡ ( 1 + ϵ ) ⪆ ln ⁡ 1 = 0 {\displaystyle \ln(1+e^{x})=\ln(1+\epsilon )\gtrapprox \ln 1=0} , so just above 0, while for large positive x {\displaystyle x} it is ln ⁡ ( 1 + e x ) ⪆ ln ⁡ ( e x ) = x {\displaystyle \ln(1+e^{x})\gtrapprox \ln(e^{x})=x} , so just above x {\displaystyle x} . The names softplus and SmoothReLU are used in machine learning. The name "softplus" (2000), by analogy with the earlier softmax (1989) is presumably because it is a smooth (soft) approximation of the positive part of x, which is sometimes denoted with a superscript plus, x + := max ( 0 , x ) {\displaystyle x^{+}:=\max(0,x)} . == Alternative forms == This function can be approximated as: ln ⁡ ( 1 + e x ) ≈ { ln ⁡ 2 , x = 0 , x 1 − e − x / ln ⁡ 2 , x ≠ 0 {\displaystyle \ln \left(1+e^{x}\right)\approx {\begin{cases}\ln 2,&x=0,\\[6pt]{\frac {x}{1-e^{-x/\ln 2}}},&x\neq 0\end{cases}}} By making the change of variables x = y ln ⁡ ( 2 ) {\displaystyle x=y\ln(2)} , this is equivalent to log 2 ⁡ ( 1 + 2 y ) ≈ { 1 , y = 0 , y 1 − e − y , y ≠ 0. {\displaystyle \log _{2}(1+2^{y})\approx {\begin{cases}1,&y=0,\\[6pt]{\frac {y}{1-e^{-y}}},&y\neq 0.\end{cases}}} A sharpness parameter k {\displaystyle k} may be included: f ( x ) = ln ⁡ ( 1 + e k x ) k , f ′ ( x ) = e k x 1 + e k x = 1 1 + e − k x . {\displaystyle f(x)={\frac {\ln(1+e^{kx})}{k}},\qquad \qquad f'(x)={\frac {e^{kx}}{1+e^{kx}}}={\frac {1}{1+e^{-kx}}}.} Additionally, the softplus function is equivalent to the log of the sigmoid function in the following way: − ln ⁡ ( sigmoid ( − x ) ) = − ln ⁡ ( 1 1 + e x ) = ln ⁡ ( 1 + e x ) = softplus ( x ) {\displaystyle -\ln({\text{sigmoid}}(-x))=-\ln \left({\frac {1}{1+e^{x}}}\right)=\ln \left(1+e^{x}\right)={\text{softplus}}(x)} == Related functions == The derivative of softplus is the standard logistic function: f ′ ( x ) = e x 1 + e x = 1 1 + e − x {\displaystyle f'(x)={\frac {e^{x}}{1+e^{x}}}={\frac {1}{1+e^{-x}}}} The logistic function or the sigmoid function is a smooth approximation of the rectifier, the Heaviside step function. === LogSumExp === The multivariable generalization of single-variable softplus is the LogSumExp with the first argument set to zero: L S E 0 + ⁡ ( x 1 , … , x n ) := LSE ⁡ ( 0 , x 1 , … , x n ) = ln ⁡ ( 1 + e x 1 + ⋯ + e x n ) . {\displaystyle \operatorname {LSE_{0}} ^{+}(x_{1},\dots ,x_{n}):=\operatorname {LSE} (0,x_{1},\dots ,x_{n})=\ln(1+e^{x_{1}}+\cdots +e^{x_{n}}).} The LogSumExp function is LSE ⁡ ( x 1 , … , x n ) = ln ⁡ ( e x 1 + ⋯ + e x n ) , {\displaystyle \operatorname {LSE} (x_{1},\dots ,x_{n})=\ln(e^{x_{1}}+\cdots +e^{x_{n}}),} and its gradient is the softmax; the softmax with the first argument set to zero is the multivariable generalization of the logistic function. Both LogSumExp and softmax are used in machine learning. === Convex conjugate === The convex conjugate (specifically, the Legendre transformation) of the softplus function is the negative binary entropy function (with base e). This is because (following the definition of the Legendre transformation: the derivatives are inverse functions) the derivative of softplus is the logistic function, whose inverse function is the logit, which is the derivative of negative binary entropy. Softplus can be interpreted as logistic loss (as a positive number), so, by duality, minimizing logistic loss corresponds to maximizing entropy. This justifies the principle of maximum entropy as loss minimization.

    Read more →
  • Topincs

    Topincs

    Topincs is a software for rapid development of web databases and web applications. It is based on LAMP and the semantic technology Topic Maps. A Topincs web database makes information accessible through browsing very much like a Wiki. Editing a page on a subject is done through forms rather than markup editing. A web database can be tailored into a web application to provide specific user groups a contextualized approach to the data. All modeling and development tasks are performed in the web browser. No other development tools are necessary. The server requires Apache, MySQL and PHP. The client works on any standards-compliant web browser on desktops, laptops, tablets, and mobile phones. The layout is automatically adjusted to smaller screens. The programmatic access to data is done via a virtual object-oriented programming interface which is set up over the schema in a few minutes. It is interpreted rather than generated. Portions of the database can be pulled into memory to accelerate bulk access. == Features == Browseable data High-quality web forms Little to no programming Development done in the browser, no other tools required Client runs in any standard-compliant web browser Virtual object-oriented programming interface User interface adjusts to screen size Supports desktops, laptops, tablets, and mobile phones Flexible data modeling == Challenges == Requires rethinking the development process and dropping many hard learned habits Requires a familiarity with two ISO standards ISO 13259 and 19756 Forms cannot be easily adjusted in layout and behavior Server installation difficult and prone to error == License == Topincs can be used in a private network for any purpose for free. The use in a public network is restricted to non-commercial applications.

    Read more →
  • Clustering illusion

    Clustering illusion

    The clustering illusion is the tendency to erroneously consider the inevitable "streaks" or "clusters" arising in small samples from random distributions to be non-random. The illusion is caused by a human tendency to underpredict the amount of variability likely to appear in a small sample of random or pseudorandom data. Thomas Gilovich, an early author on the subject, argued that the effect occurs for different types of random dispersions. Some might perceive patterns in stock market price fluctuations over time, or clusters in two-dimensional data such as the locations of impact of World War II V-1 flying bombs on maps of London. Although Londoners developed specific theories about the pattern of impacts within London, a statistical analysis by R. D. Clarke originally published in 1946 showed that the impacts of V-2 rockets on London were a close fit to a random distribution. == Similar biases == Using this cognitive bias in causal reasoning may result in the Texas sharpshooter fallacy, in which differences in data are ignored and similarities are overemphasized. More general forms of erroneous pattern recognition are pareidolia and apophenia. Related biases are the illusion of control which the clustering illusion could contribute to, and insensitivity to sample size in which people don't expect greater variation in smaller samples. A different cognitive bias involving misunderstanding of chance streams is the gambler's fallacy. == Possible causes == Daniel Kahneman and Amos Tversky explained this kind of misprediction as being caused by the representativeness heuristic (which itself they also first proposed).

    Read more →
  • NETtalk (artificial neural network)

    NETtalk (artificial neural network)

    NETtalk is an artificial neural network that learns to pronounce written English text by supervised learning. It takes English text as input, and produces a matching phonetic transcriptions as output. It is the result of research carried out in the mid-1980s by Terrence Sejnowski and Charles Rosenberg. The intent behind NETtalk was to construct simplified models that might shed light on the complexity of learning human level cognitive tasks, and their implementation as a connectionist model that could also learn to perform a comparable task. The authors trained it by backpropagation. The network was trained on a large amount of English words and their corresponding pronunciations, and is able to generate pronunciations for unseen words with a high level of accuracy. The output of the network was a stream of phonemes, which fed into DECtalk to produce audible speech. It achieved popular success, appearing on the Today show. From the point of view of modeling human cognition, NETtalk does not specifically model the image processing stages and letter recognition of the visual cortex. Rather, it assumes that the letters have been pre-classified and recognized. It is NETtalk's task to learn proper associations between the correct pronunciation with a given sequence of letters based on the context in which the letters appear. A similar architecture was subsequently used for the opposite task, that of converting continuous speech signal to a phoneme sequence. == Training == The training dataset was a 20,008-word subset of the Brown Corpus, with manually annotated phoneme and stress for each letter. The development process was described in a 1993 interview. It took three months -- 250 person-hours -- to create the training dataset, but only a few days to train the network. After it was run successfully on this, the authors tried it on a phonological transcription of an interview with a young Latino boy from a barrio in Los Angeles. This resulted in a network that reproduced his Spanish accent. The original NETtalk was implemented on a Ridge 32, which took 0.275 seconds per learning step (one forward and one backward pass). Training NETtalk became a benchmark to test for the efficiency of backpropagation programs. For example, an implementation on Connection Machine-1 (with 16384 processors) ran at 52x speedup. An implementation on a 10-cell Warp ran at 340x speedup. The following table compiles the benchmark scores as of 1988. Speed is measured in "millions of connections per second" (MCPS). For example, the original NETtalk on Ridge 32 took 0.275 seconds per forward-backward pass, giving 18629 / 10 6 0.275 = 0.068 {\displaystyle {\frac {18629/10^{6}}{0.275}}=0.068} MCPS. Relative times are normalized to the MicroVax. == Architecture == The network had three layers and 18,629 adjustable weights, large by the standards of 1986. There were worries that it would overfit the dataset, but it was trained successfully. The input of the network has 203 units, divided into 7 groups of 29 units each. Each group is a one-hot encoding of one character. There are 29 possible characters: 26 letters, comma, period, and word boundary (whitespace). To produce the pronunciation of a single character, the network takes the character itself, as well as 3 characters before and 3 characters after it. The hidden layer has 80 units. The output has 26 units. 21 units encode for articulatory features (point of articulation, voicing, vowel height, etc.) of phonemes, and 5 units encode for stress and syllable boundaries. Sejnowski studied the learned representation in the network, and found that phonemes that sound similar are clustered together in representation space. The output of the network degrades, but remains understandable, when some hidden neurons are removed.

    Read more →
  • Bayesian network

    Bayesian network

    A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. Efficient algorithms can perform inference and learning in Bayesian networks. Bayesian networks that model sequences of variables (e.g. speech signals or protein sequences) are called dynamic Bayesian networks. Generalizations of Bayesian networks that can represent and solve decision problems under uncertainty are called influence diagrams. == Graphical model == Formally, Bayesian networks are directed acyclic graphs (DAGs) whose nodes represent variables in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters or hypotheses. Each edge represents a direct conditional dependency. Any pair of nodes that are not connected (i.e. no path connects one node to the other) represent variables that are conditionally independent of each other. Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node. For example, if m {\displaystyle m} parent nodes represent m {\displaystyle m} Boolean variables, then the probability function could be represented by a table of 2 m {\displaystyle 2^{m}} entries, one entry for each of the 2 m {\displaystyle 2^{m}} possible parent combinations. Similar ideas may be applied to undirected, and possibly cyclic, graphs such as Markov networks. == Example == Suppose we want to model the dependencies between three variables: the sprinkler (or more appropriately, its state - whether it is on or not), the presence or absence of rain and whether the grass is wet or not. Observe that two events can cause the grass to become wet: an active sprinkler or rain. Rain has a direct effect on the use of the sprinkler (namely that when it rains, the sprinkler usually is not active). This situation can be modeled with a Bayesian network (shown to the right). Each variable has two possible values, T (for true) and F (for false). The joint probability function is, by the chain rule of probability, Pr ( G , S , R ) = Pr ( G ∣ S , R ) Pr ( S ∣ R ) Pr ( R ) {\displaystyle \Pr(G,S,R)=\Pr(G\mid S,R)\Pr(S\mid R)\Pr(R)} where G = "Grass wet (true/false)", S = "Sprinkler turned on (true/false)", and R = "Raining (true/false)". The model can answer questions about the presence of a cause given the presence of an effect (so-called inverse probability) like "What is the probability that it is raining, given the grass is wet?" by using the conditional probability formula and summing over all nuisance variables: Pr ( R = T ∣ G = T ) = Pr ( G = T , R = T ) Pr ( G = T ) = ∑ x ∈ { T , F } Pr ( G = T , S = x , R = T ) ∑ x , y ∈ { T , F } Pr ( G = T , S = x , R = y ) {\displaystyle \Pr(R=T\mid G=T)={\frac {\Pr(G=T,R=T)}{\Pr(G=T)}}={\frac {\sum _{x\in \{T,F\}}\Pr(G=T,S=x,R=T)}{\sum _{x,y\in \{T,F\}}\Pr(G=T,S=x,R=y)}}} Using the expansion for the joint probability function Pr ( G , S , R ) {\displaystyle \Pr(G,S,R)} and the conditional probabilities from the conditional probability tables (CPTs) stated in the diagram, one can evaluate each term in the sums in the numerator and denominator. For example, Pr ( G = T , S = T , R = T ) = Pr ( G = T ∣ S = T , R = T ) Pr ( S = T ∣ R = T ) Pr ( R = T ) = 0.99 × 0.01 × 0.2 = 0.00198. {\displaystyle {\begin{aligned}\Pr(G=T,S=T,R=T)&=\Pr(G=T\mid S=T,R=T)\Pr(S=T\mid R=T)\Pr(R=T)\\&=0.99\times 0.01\times 0.2\\&=0.00198.\end{aligned}}} Then the numerical results (subscripted by the associated variable values) are Pr ( R = T ∣ G = T ) = 0.00198 T T T + 0.1584 T F T 0.00198 T T T + 0.288 T T F + 0.1584 T F T + 0.0 T F F = 891 2491 ≈ 35.77 % . {\displaystyle \Pr(R=T\mid G=T)={\frac {0.00198_{TTT}+0.1584_{TFT}}{0.00198_{TTT}+0.288_{TTF}+0.1584_{TFT}+0.0_{TFF}}}={\frac {891}{2491}}\approx 35.77\%.} To answer an interventional question, such as "What is the probability that it would rain, given that we wet the grass?" the answer is governed by the post-intervention joint distribution function Pr ( S , R ∣ do ( G = T ) ) = Pr ( S ∣ R ) Pr ( R ) {\displaystyle \Pr(S,R\mid {\text{do}}(G=T))=\Pr(S\mid R)\Pr(R)} obtained by removing the factor Pr ( G ∣ S , R ) {\displaystyle \Pr(G\mid S,R)} from the pre-intervention distribution. The do operator forces the value of G to be true. The probability of rain is unaffected by the action: Pr ( R ∣ do ( G = T ) ) = Pr ( R ) . {\displaystyle \Pr(R\mid {\text{do}}(G=T))=\Pr(R).} To predict the impact of turning the sprinkler on: Pr ( R , G ∣ do ( S = T ) ) = Pr ( R ) Pr ( G ∣ R , S = T ) {\displaystyle \Pr(R,G\mid {\text{do}}(S=T))=\Pr(R)\Pr(G\mid R,S=T)} with the term Pr ( S = T ∣ R ) {\displaystyle \Pr(S=T\mid R)} removed, showing that the action affects the grass but not the rain. These predictions may not be feasible given unobserved variables, as in most policy evaluation problems. The effect of the action do ( x ) {\displaystyle {\text{do}}(x)} can still be predicted, however, whenever the back-door criterion is satisfied. It states that, if a set Z of nodes can be observed that d-separates (or blocks) all back-door paths from X to Y then Pr ( Y , Z ∣ do ( x ) ) = Pr ( Y , Z , X = x ) Pr ( X = x ∣ Z ) . {\displaystyle \Pr(Y,Z\mid {\text{do}}(x))={\frac {\Pr(Y,Z,X=x)}{\Pr(X=x\mid Z)}}.} A back-door path is one that ends with an arrow into X. Sets that satisfy the back-door criterion are called "sufficient" or "admissible." For example, the set Z = R is admissible for predicting the effect of S = T on G, because R d-separates the (only) back-door path S ← R → G. However, if S is not observed, no other set d-separates this path and the effect of turning the sprinkler on (S = T) on the grass (G) cannot be predicted from passive observations. In that case P(G | do(S = T)) is not "identified". This reflects the fact that, lacking interventional data, the observed dependence between S and G is due to a causal connection or is spurious (apparent dependence arising from a common cause, R). (see Simpson's paradox) To determine whether a causal relation is identified from an arbitrary Bayesian network with unobserved variables, one can use the three rules of "do-calculus" and test whether all do terms can be removed from the expression of that relation, thus confirming that the desired quantity is estimable from frequency data. Using a Bayesian network can save considerable amounts of memory over exhaustive probability tables, if the dependencies in the joint distribution are sparse. For example, a naive way of storing the conditional probabilities of 10 two-valued variables as a table requires storage space for 2 10 = 1024 {\displaystyle 2^{10}=1024} values. If no variable's local distribution depends on more than three parent variables, the Bayesian network representation stores at most 10 ⋅ 2 3 = 80 {\displaystyle 10\cdot 2^{3}=80} values. One advantage of Bayesian networks is that it is intuitively easier for a human to understand (a sparse set of) direct dependencies and local distributions than complete joint distributions. == Inference and learning == Bayesian networks perform three main inference tasks: Inferring unobserved variables Parameter learning for the probability distributions of each node in the network Structure learning of the graphical network === Inferring unobserved variables === Because a Bayesian network is a complete model for its variables and their relationships, it can be used to answer probabilistic queries about them. For example, the network can be used to update knowledge of the state of a subset of variables when other variables (the evidence variables) are observed. This process of computing the posterior distribution of variables given evidence is called probabilistic inference. The posterior gives a universal sufficient statistic for detection applications, when choosing values for the variable subset that minimize some expected loss function, for instance the probability of decision error. A Bayesian network can thus be considered a mechanism for automatically applying Bayes' theorem to complex problems. The most common exact inference methods are: variable elimination, which eliminates (by integration or summation) the non-observed non-query variables one by one by distributing the sum over the prod

    Read more →
  • Zardoz (computer security)

    Zardoz (computer security)

    In computer security, the Security-Digest list, better known as the Zardoz list, was a semi-private full disclosure mailing list run by Neil Gorsuch from 1989 through 1991. It identified weaknesses in systems and gave directions on where to find them. It was a perennial target for computer hackers, who sought archives of the list for information on undisclosed software vulnerabilities. == Membership restrictions == Access to Zardoz was approved on a case-by-case basis by Gorsuch, principally by reference to the user account used to send subscription requests; requests were approved for root users, valid UUCP owners, or system administrators listed at the NIC. The openness of the list to users other than Unix system administrators was a regular topic of conversation, with participants expressing concern that vulnerabilities and exploitation details disclosed on the list were liable to spread to hackers. The circulation of Zardoz postings was an open secret among computer hackers, and mocked in a Phrack parody of an IRC channel populated by security experts. == Notable participants == Keith Bostic discussed BSD Sendmail vulnerabilities Chip Salzenberg discussed Peter Honeyman's posting of a UUCP worm, and shell script security Gene Spafford discussed VMS and Ultrix bugs, and relayed law enforcement enquiries about the Morris Worm Tom Christiansen discussed SUID shell scripts Chris Torek discussed devising exploits from general descriptions of vulnerabilities Henry Spencer discussed Unix security Brendan Kehoe discussed systems security Alec Muffett announced Crack, the Unix password cracker The majority of Zardoz participants were Unix systems administrators and C software developers. Neil Gorsuch and Gene Spafford were the most prolific contributors to the list.

    Read more →
  • Distribution learning theory

    Distribution learning theory

    The distributional learning theory or learning of probability distribution is a framework in computational learning theory. It has been proposed from Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert Schapire and Linda Sellie in 1994 and it was inspired from the PAC-framework introduced by Leslie Valiant. In this framework the input is a number of samples drawn from a distribution that belongs to a specific class of distributions. The goal is to find an efficient algorithm that, based on these samples, determines with high probability the distribution from which the samples have been drawn. Because of its generality, this framework has been used in a large variety of different fields like machine learning, approximation algorithms, applied probability and statistics. This article explains the basic definitions, tools and results in this framework from the theory of computation point of view. == Definitions == Let X {\displaystyle \textstyle X} be the support of the distributions of interest. As in the original work of Kearns et al. if X {\displaystyle \textstyle X} is finite it can be assumed without loss of generality that X = { 0 , 1 } n {\displaystyle \textstyle X=\{0,1\}^{n}} where n {\displaystyle \textstyle n} is the number of bits that have to be used in order to represent any y ∈ X {\displaystyle \textstyle y\in X} . We focus in probability distributions over X {\displaystyle \textstyle X} . There are two possible representations of a probability distribution D {\displaystyle \textstyle D} over X {\displaystyle \textstyle X} . probability distribution function (or evaluator) an evaluator E D {\displaystyle \textstyle E_{D}} for D {\displaystyle \textstyle D} takes as input any y ∈ X {\displaystyle \textstyle y\in X} and outputs a real number E D [ y ] {\displaystyle \textstyle E_{D}[y]} which denotes the probability that of y {\displaystyle \textstyle y} according to D {\displaystyle \textstyle D} , i.e. E D [ y ] = Pr [ Y = y ] {\displaystyle \textstyle E_{D}[y]=\Pr[Y=y]} if Y ∼ D {\displaystyle \textstyle Y\sim D} . generator a generator G D {\displaystyle \textstyle G_{D}} for D {\displaystyle \textstyle D} takes as input a string of truly random bits y {\displaystyle \textstyle y} and outputs G D [ y ] ∈ X {\displaystyle \textstyle G_{D}[y]\in X} according to the distribution D {\displaystyle \textstyle D} . Generator can be interpreted as a routine that simulates sampling from the distribution D {\displaystyle \textstyle D} given a sequence of fair coin tosses. A distribution D {\displaystyle \textstyle D} is called to have a polynomial generator (respectively evaluator) if its generator (respectively evaluator) exists and can be computed in polynomial time. Let C X {\displaystyle \textstyle C_{X}} a class of distribution over X, that is C X {\displaystyle \textstyle C_{X}} is a set such that every D ∈ C X {\displaystyle \textstyle D\in C_{X}} is a probability distribution with support X {\displaystyle \textstyle X} . The C X {\displaystyle \textstyle C_{X}} can also be written as C {\displaystyle \textstyle C} for simplicity. In order to evaluate learnability, it is necessary to have a way to measure how well an approximated distribution D ′ {\displaystyle \textstyle D'} fits the sampled distribution D {\displaystyle \textstyle D} . There are several ways to measure the divergence between two distributions. Three common possibilities are Kullback–Leibler divergence Total variation distance of probability measures Kolmogorov distance Total variation and Kolmogorov distance are true metrics, while KL divergence is not (it lacks symmetry). These measures are ordered by convergence strength: closeness in KL divergence implies closeness in total variation (via Pinsker's inequality), which in turn implies closeness in Kolmogorov distance. Therefore, a learnability result proven under KL divergence automatically holds under the weaker measures, but not vice versa. Since certain measures may be more appropriate in specific applications, we will use d ( D , D ′ ) {\displaystyle \textstyle d(D,D')} to denote a selected divergence between the distribution D {\displaystyle \textstyle D} and the distribution D ′ {\displaystyle \textstyle D'} . The basic input that we use in order to learn a distribution is a number of samples drawn by this distribution. For the computational point of view the assumption is that such a sample is given in a constant amount of time. So it's like having access to an oracle G E N ( D ) {\displaystyle \textstyle GEN(D)} that returns a sample from the distribution D {\displaystyle \textstyle D} . Sometimes the interest is, apart from measuring the time complexity, to measure the number of samples that have to be used in order to learn a specific distribution D {\displaystyle \textstyle D} in class of distributions C {\displaystyle \textstyle C} . This quantity is called sample complexity of the learning algorithm. In order for the problem of distribution learning to be more clear consider the problem of supervised learning as defined in. In this framework of statistical learning theory a training set S = { ( x 1 , y 1 ) , … , ( x n , y n ) } {\displaystyle \textstyle S=\{(x_{1},y_{1}),\dots ,(x_{n},y_{n})\}} and the goal is to find a target function f : X → Y {\displaystyle \textstyle f:X\rightarrow Y} that minimizes some loss function, e.g. the square loss function. More formally f = arg ⁡ min g ∫ V ( y , g ( x ) ) d ρ ( x , y ) {\displaystyle f=\arg \min _{g}\int V(y,g(x))d\rho (x,y)} , where V ( ⋅ , ⋅ ) {\displaystyle V(\cdot ,\cdot )} is the loss function, e.g. V ( y , z ) = ( y − z ) 2 {\displaystyle V(y,z)=(y-z)^{2}} and ρ ( x , y ) {\displaystyle \rho (x,y)} the probability distribution according to which the elements of the training set are sampled. If the conditional probability distribution ρ x ( y ) {\displaystyle \rho _{x}(y)} is known then the target function has the closed form f ( x ) = ∫ y y d ρ x ( y ) {\displaystyle f(x)=\int _{y}yd\rho _{x}(y)} . So the set S {\displaystyle S} is a set of samples from the probability distribution ρ ( x , y ) {\displaystyle \rho (x,y)} . Now the goal of distributional learning theory if to find ρ {\displaystyle \rho } given S {\displaystyle S} which can be used to find the target function f {\displaystyle f} . Definition of learnability A class of distributions C {\displaystyle \textstyle C} is called efficiently learnable if for every ϵ > 0 {\displaystyle \textstyle \epsilon >0} and 0 < δ ≤ 1 {\displaystyle \textstyle 0<\delta \leq 1} given access to G E N ( D ) {\displaystyle \textstyle GEN(D)} for an unknown distribution D ∈ C {\displaystyle \textstyle D\in C} , there exists a polynomial time algorithm A {\displaystyle \textstyle A} , called learning algorithm of C {\displaystyle \textstyle C} , that outputs a generator or an evaluator of a distribution D ′ {\displaystyle \textstyle D'} such that Pr [ d ( D , D ′ ) ≤ ϵ ] ≥ 1 − δ {\displaystyle \Pr[d(D,D')\leq \epsilon ]\geq 1-\delta } If we know that D ′ ∈ C {\displaystyle \textstyle D'\in C} then A {\displaystyle \textstyle A} is called proper learning algorithm, otherwise is called improper learning algorithm. In some settings the class of distributions C {\displaystyle \textstyle C} is a class with well known distributions which can be described by a set of parameters. For instance C {\displaystyle \textstyle C} could be the class of all the Gaussian distributions N ( μ , σ 2 ) {\displaystyle \textstyle N(\mu ,\sigma ^{2})} . In this case the algorithm A {\displaystyle \textstyle A} should be able to estimate the parameters μ , σ {\displaystyle \textstyle \mu ,\sigma } . In this case A {\displaystyle \textstyle A} is called parameter learning algorithm. Obviously the parameter learning for simple distributions is a very well studied field that is called statistical estimation and there is a very long bibliography on different estimators for different kinds of simple known distributions. But distributions learning theory deals with learning class of distributions that have more complicated description. == First results == In their seminal work, Kearns et al. deal with the case where A {\displaystyle \textstyle A} is described in term of a finite polynomial sized circuit and they proved the following for some specific classes of distribution. O R {\displaystyle \textstyle OR} gate distributions for this kind of distributions there is no polynomial-sized evaluator, unless # P ⊆ P / poly {\displaystyle \textstyle \#P\subseteq P/{\text{poly}}} . On the other hand, this class is efficiently learnable with generator. Parity gate distributions this class is efficiently learnable with both generator and evaluator. Mixtures of Hamming Balls this class is efficiently learnable with both generator and evaluator. Probabilistic Finite Automata this class is not efficiently learnable with evaluator under the Noisy Parity Assumption which is an impossibility assumption in the PAC learning fram

    Read more →
  • Induction of regular languages

    Induction of regular languages

    In computational learning theory, induction of regular languages refers to the task of learning a formal description (e.g. grammar) of a regular language from a given set of example strings. Although E. Mark Gold has shown that not every regular language can be learned this way (see language identification in the limit), approaches have been investigated for a variety of subclasses. They are sketched in this article. For learning of more general grammars, see Grammar induction. == Definitions == A regular language is defined as a (finite or infinite) set of strings that can be described by one of the mathematical formalisms called "finite automaton", "regular grammar", or "regular expression", all of which have the same expressive power. Since the latter formalism leads to shortest notations, it shall be introduced and used here. Given a set Σ of symbols (a.k.a. alphabet), a regular expression can be any of ∅ (denoting the empty set of strings), ε (denoting the singleton set containing just the empty string), a (where a is any character in Σ; denoting the singleton set just containing the single-character string a), r + s (where r and s are, in turn, simpler regular expressions; denoting their set's union) r ⋅ s (denoting the set of all possible concatenations of strings from r's and s's set), r + (denoting the set of n-fold repetitions of strings from r's set, for any n ≥ 1), or r (similarly denoting the set of n-fold repetitions, but also including the empty string, seen as 0-fold repetition). For example, using Σ = {0,1}, the regular expression (0+1+ε)⋅(0+1) denotes the set of all binary numbers with one or two digits (leading zero allowed), while 1⋅(0+1)⋅0 denotes the (infinite) set of all even binary numbers (no leading zeroes). Given a set of strings (also called "positive examples"), the task of regular language induction is to come up with a regular expression that denotes a set containing all of them. As an example, given {1, 10, 100}, a "natural" description could be the regular expression 1⋅0, corresponding to the informal characterization "a 1 followed by arbitrarily many (maybe even none) 0's". However, (0+1) and 1+(1⋅0)+(1⋅0⋅0) is another regular expression, denoting the largest (assuming Σ = {0,1}) and the smallest set containing the given strings, and called the trivial overgeneralization and undergeneralization, respectively. Some approaches work in an extended setting where also a set of "negative example" strings is given; then, a regular expression is to be found that generates all of the positive, but none of the negative examples. == Lattice of automata == Dupont et al. have shown that the set of all structurally complete finite automata generating a given input set of example strings forms a lattice, with the trivial undergeneralized and the trivial overgeneralized automaton as bottom and top element, respectively. Each member of this lattice can be obtained by factoring the undergeneralized automaton by an appropriate equivalence relation. For the above example string set {1, 10, 100}, the picture shows at its bottom the undergeneralized automaton Aa,b,c,d in grey, consisting of states a, b, c, and d. On the state set {a,b,c,d}, a total of 15 equivalence relations exist, forming a lattice. Mapping each equivalence E to the corresponding quotient automaton language L(Aa,b,c,d / E) obtains the partially ordered set shown in the picture. Each node's language is denoted by a regular expression. The language may be recognized by quotient automata w.r.t. different equivalence relations, all of which are shown below the node. An arrow between two nodes indicates that the lower node's language is a proper subset of the higher node's. If both positive and negative example strings are given, Dupont et al. build the lattice from the positive examples, and then investigate the separation border between automata that generate some negative example and such that do not. Most interesting are those automata immediately below the border. In the picture, separation borders are shown for the negative example strings 11 (green), 1001 (blue), 101 (cyan), and 0 (red). Coste and Nicolas present an own search method within the lattice, which they relate to Mitchell's version space paradigm. To find the separation border, they use a graph coloring algorithm on the state inequality relation induced by the negative examples. Later, they investigate several ordering relations on the set of all possible state fusions. Kudo and Shimbo use the representation by automaton factorizations to give a unique framework for the following approaches (sketched below): k-reversible languages and the "tail clustering" follow-up approach, Successor automata and the predecessor-successor method, and pumping-based approaches (framework-integration challenged by Luzeaux, however). Each of these approaches is shown to correspond to a particular kind of equivalence relations used for factorization. == Approaches == === k-reversible languages === Angluin considers so-called "k-reversible" regular automata, that is, deterministic automata in which each state can be reached from at most one state by following a transition chain of length k. Formally, if Σ, Q, and δ denote the input alphabet, the state set, and the transition function of an automaton A, respectively, then A is called k-reversible if: ∀a0, ..., ak ∈ Σ ∀s1, s2 ∈ Q: δ(s1, a0...ak) = δ(s2, a0...ak) ⇒ s1 = s2, where δ means the homomorphic extension of δ to arbitrary words. Angluin gives a cubic algorithm for learning of the smallest k-reversible language from a given set of input words; for k = 0, the algorithm has even almost linear complexity. The required state uniqueness after k + 1 given symbols forces unifying automaton states, thus leading to a proper generalization different from the trivial undergeneralized automaton. This algorithm has been used to learn simple parts of English syntax; later, an incremental version has been provided. Another approach based on k-reversible automata is the tail clustering method. === Successor automata === From a given set of input strings, Vernadat and Richetin build a so-called successor automaton, consisting of one state for each distinct character and a transition between each two adjacent characters' states. For example, the singleton input set {aabbaabb} leads to an automaton corresponding to the regular expression (a+⋅b+). An extension of this approach is the predecessor-successor method which generalizes each character repetition immediately to a Kleene + and then includes for each character the set of its possible predecessors in its state. Successor automata can learn exactly the class of local languages. Since each regular language is the homomorphic image of a local language, grammars from the former class can be learned by lifting, if an appropriate (depending on the intended application) homomorphism is provided. In particular, there is such a homomorphism for the class of languages learnable by the predecessor-successor method. The learnability of local languages can be reduced to that of k-reversible languages. === Early approaches === Chomsky and Miller (1957) used the pumping lemma: they guess a part v of an input string uvw and try to build a corresponding cycle into the automaton to be learned; using membership queries they ask, for appropriate k, which of the strings uw, uvvw, uvvvw, ..., uvkw also belongs to the language to be learned, thereby refining the structure of their automaton. In 1959, Solomonoff generalized this approach to context-free languages, which also obey a pumping lemma. === Cover automata === Câmpeanu et al. learn a finite automaton as a compact representation of a large finite language. Given such a language F, they search a so-called cover automaton A such that its language L(A) covers F in the following sense: L(A) ∩ Σ≤ l = F, where l is the length of the longest string in F, and Σ≤ l denotes the set of all strings not longer than l. If such a cover automaton exists, F is uniquely determined by A and l. For example, F = {ad, read, reread } has l = 6 and a cover automaton corresponding to the regular expression (r⋅e)⋅a⋅d. For two strings x and y, Câmpeanu et al. define x ~ y if xz ∈ F ⇔ yz ∈ F for all strings z of a length such that both xz and yz are not longer than l. Based on this relation, whose lack of transitivity causes considerable technical problems, they give an O(n4) algorithm to construct from F a cover automaton A of minimal state count. Moreover, for union, intersection, and difference of two finite languages they provide corresponding operations on their cover automata. Păun et al. improve the time complexity to O(n2). === Residual automata === For a set S of strings and a string u, the Brzozowski derivative u−1S is defined as the set of all rest-strings obtainable from a string in S by cutting off its prefix u (if possible), formally: u−1S = {v ∈ Σ: uv ∈ S}, cf. picture. Denis et al. define a

    Read more →