AI Chat On Google

AI Chat On Google — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Clips (software)

    Clips (software)

    Clips is a discontinued mobile video editing software application created by Apple Inc. It was released onto the iOS App Store on April 6, 2017, for free. Initially, it was only available on 64-bit devices running iOS 10.3 or later; as of version 3.1.3, it requires iOS 16.0 or later. Apple describes it as an app for "making and sharing fun videos with text, effects, graphics, and more.". Its final release was on May 9, 2024 before was removed from the App Store on October 10, 2025. == Features == After launching of the app, the user sees the view of the front-facing camera. The app allows the user to create a new clip by tapping on a red record button, or use photos or videos from the device's photo library. Once a clip is recorded, it can be added to a project timeline shown at the bottom of the screen. The user can share their project on social media platforms. The user can also add filters and effects to the project. "Live Titles" (available in several styles) can also be created by dictating to the device.

    Read more →
  • Self-organizing map

    Self-organizing map

    A self-organizing map (SOM) or self-organizing feature map (SOFM) is an unsupervised machine learning technique used to produce a low-dimensional (typically two-dimensional) representation of a higher-dimensional data set while preserving the topological structure of the data. For example, a data set with p {\displaystyle p} variables measured in n {\displaystyle n} observations could be represented as clusters of observations with similar values for the variables. These clusters then could be visualized as a two-dimensional "map" such that observations in proximal clusters have more similar values than observations in distal clusters. This can make high-dimensional data easier to visualize and analyze. A SOM is a type of artificial neural network but is trained using competitive learning rather than the error-correction learning (e.g., backpropagation with gradient descent) used by other artificial neural networks. The SOM was introduced by the Finnish professor Teuvo Kohonen in the 1980s and therefore is sometimes called a Kohonen map or Kohonen network. The Kohonen map or network is a computationally convenient abstraction building on biological models of neural systems from the 1970s and morphogenesis models dating back to Alan Turing in the 1950s. SOMs create internal representations reminiscent of the cortical homunculus, a distorted representation of the human body, based on a neurological "map" of the areas and proportions of the human brain dedicated to processing sensory functions, for different parts of the body. == Overview == Self-organizing maps, like most artificial neural networks, operate in two modes: training and mapping. First, training uses an input data set (the "input space") to generate a lower-dimensional representation of the input data (the "map space"). Second, mapping classifies additional input data using the generated map. The goal of training is to represent an input space with p dimensions as a map space with n dimensions, where p > n. Specifically, an input space with p variables is said to have p dimensions. A map space consists of components called "nodes" or "neurons", which are arranged as a hexagonal or rectangular grid with two dimensions. The number of nodes and their arrangement are specified beforehand based on the larger goals of the analysis and exploration of the data. Each node in the map space is associated with a "weight" vector, which is the position of the node in the input space. While nodes in the map space stay fixed, training consists in moving weight vectors toward the input data (reducing a distance metric such as Euclidean distance) without spoiling the topology induced from the map space. After training, the map can be used to classify additional observations for the input space by finding the node with the closest weight vector (smallest distance metric) to the input space vector. == Learning algorithm == The goal of learning in the self-organizing map is to cause different parts of the network to respond similarly to certain input patterns. This is partly motivated by how visual, auditory or other sensory information is handled in separate parts of the cerebral cortex in the human brain. The weights of the neurons are initialized either to small random values or sampled evenly from the subspace spanned by the two largest principal component eigenvectors. With the latter alternative, learning is much faster because the initial weights already give a good approximation of SOM weights. The network must be fed a large number of example vectors that represent, as close as possible, the kinds of vectors expected during mapping. The examples are usually administered several times as iterations. The training utilizes competitive learning. When a training example is fed to the network, its Euclidean distance to all weight vectors is computed. The neuron whose weight vector is most similar to the input is called the best matching unit (BMU). The weights of the BMU and neurons close to it in the SOM grid are adjusted towards the input vector. The magnitude of the change decreases with time and with the grid-distance from the BMU. The update formula for a neuron v with weight vector Wv(s) is W v ( s + 1 ) = W v ( s ) + θ ( u , v , s ) ⋅ α ( s ) ⋅ ( D ( t ) − W v ( s ) ) {\displaystyle W_{v}(s+1)=W_{v}(s)+\theta (u,v,s)\cdot \alpha (s)\cdot (D(t)-W_{v}(s))} , where s is the step index, t is an index into the training sample, u is the index of the BMU for the input vector D(t), α(s) is a monotonically decreasing learning coefficient; θ(u, v, s) is the neighborhood function which gives the distance between the neuron u and the neuron v in step s. Depending on the implementations, t can scan the training data set systematically (t is 0, 1, 2...T-1, then repeat, T being the training sample's size), be randomly drawn from the data set (bootstrap sampling), or implement some other sampling method (such as jackknifing). The neighborhood function θ(u, v, s) (also called function of lateral interaction) depends on the grid-distance between the BMU (neuron u) and neuron v. In the simplest form, it is 1 for all neurons close enough to BMU and 0 for others, but the Gaussian and Mexican-hat functions are common choices, too. Regardless of the functional form, the neighborhood function shrinks with time. At the beginning when the neighborhood is broad, the self-organizing takes place on the global scale. When the neighborhood has shrunk to just a couple of neurons, the weights are converging to local estimates. In some implementations, the learning coefficient α and the neighborhood function θ decrease steadily with increasing s, in others (in particular those where t scans the training data set) they decrease in step-wise fashion, once every T steps. This process is repeated for each input vector for a (usually large) number of cycles λ. The network winds up associating output nodes with groups or patterns in the input data set. If these patterns can be named, the names can be attached to the associated nodes in the trained net. During mapping, there will be one single winning neuron: the neuron whose weight vector lies closest to the input vector. This can be simply determined by calculating the Euclidean distance between input vector and weight vector. While representing input data as vectors has been emphasized in this article, any kind of object which can be represented digitally, which has an appropriate distance measure associated with it, and in which the necessary operations for training are possible can be used to construct a self-organizing map. This includes matrices, continuous functions or even other self-organizing maps. === Algorithm === Randomize the node weight vectors in a map For s = 0 , 1 , 2 , . . . , λ {\displaystyle s=0,1,2,...,\lambda } Randomly pick an input vector D ( t ) {\displaystyle {D}(t)} Find the node in the map closest to the input vector. This node is the best matching unit (BMU). Denote it by u {\displaystyle u} For each node v {\displaystyle v} , update its vector by pulling it closer to the input vector: W v ( s + 1 ) = W v ( s ) + θ ( u , v , s ) ⋅ α ( s ) ⋅ ( D ( t ) − W v ( s ) ) {\displaystyle W_{v}(s+1)=W_{v}(s)+\theta (u,v,s)\cdot \alpha (s)\cdot (D(t)-W_{v}(s))} The variable names mean the following, with vectors in bold, s {\displaystyle s} is the current iteration λ {\displaystyle \lambda } is the iteration limit t {\displaystyle t} is the index of the target input data vector in the input data set D {\displaystyle \mathbf {D} } D ( t ) {\displaystyle {D}(t)} is a target input data vector v {\displaystyle v} is the index of the node in the map W v {\displaystyle \mathbf {W} _{v}} is the current weight vector of node v {\displaystyle v} u {\displaystyle u} is the index of the best matching unit (BMU) in the map θ ( u , v , s ) {\displaystyle \theta (u,v,s)} is the neighbourhood function, α ( s ) {\displaystyle \alpha (s)} is the learning rate schedule. The key design choices are the shape of the SOM, the neighbourhood function, and the learning rate schedule. The idea of the neighborhood function is to make it such that the BMU is updated the most, its immediate neighbors are updated a little less, and so on. The idea of the learning rate schedule is to make it so that the map updates are large at the start, and gradually stop updating. For example, if we want to learn a SOM using a square grid, we can index it using ( i , j ) {\displaystyle (i,j)} where both i , j ∈ 1 : N {\displaystyle i,j\in 1:N} . The neighborhood function can make it so that the BMU updates in full, the nearest neighbors update in half, and their neighbors update in half again, etc. θ ( ( i , j ) , ( i ′ , j ′ ) , s ) = 1 2 | i − i ′ | + | j − j ′ | = { 1 if i = i ′ , j = j ′ 1 / 2 if | i − i ′ | + | j − j ′ | = 1 1 / 4 if | i − i ′ | + | j − j ′ | = 2 ⋯ ⋯ {\displaystyle \theta ((i,j),(i',j'),s)={\frac {1}{2^{|i-i'|+|j-j'|}}}={\begin{cases}1&{\text{if }}i=i',j=j'\\1/2&{\text{if

    Read more →
  • Influence diagram

    Influence diagram

    An influence diagram (ID) (also called a relevance diagram, decision diagram or a decision network) is a compact graphical and mathematical representation of a decision situation. It is a generalization of a Bayesian network, in which not only probabilistic inference problems but also decision making problems (following the maximum expected utility criterion) can be modeled and solved. ID was first developed in the mid-1970s by decision analysts with an intuitive semantic that is easy to understand. It is now adopted widely and becoming an alternative to the decision tree which typically suffers from exponential growth in number of branches with each variable modeled. ID is directly applicable in team decision analysis, since it allows incomplete sharing of information among team members to be modeled and solved explicitly. Extensions of ID also find their use in game theory as an alternative representation of the game tree. == Semantics == An ID is a directed acyclic graph with three types (plus one subtype) of node and three types of arc (or arrow) between nodes. Nodes: Decision node (corresponding to each decision to be made) is drawn as a rectangle. Uncertainty node (corresponding to each uncertainty to be modeled) is drawn as an oval. Deterministic node (corresponding to special kind of uncertainty that its outcome is deterministically known whenever the outcome of some other uncertainties are also known) is drawn as a double oval. Value node (corresponding to each component of additively separable Von Neumann-Morgenstern utility function) is drawn as an octagon (or diamond). Arcs: Functional arcs (ending in value node) indicate that one of the components of additively separable utility function is a function of all the nodes at their tails. Conditional arcs (ending in uncertainty node) indicate that the uncertainty at their heads is probabilistically conditioned on all the nodes at their tails. Conditional arcs (ending in deterministic node) indicate that the uncertainty at their heads is deterministically conditioned on all the nodes at their tails. Informational arcs (ending in decision node) indicate that the decision at their heads is made with the outcome of all the nodes at their tails known beforehand. Given a properly structured ID: Decision nodes and incoming information arcs collectively state the alternatives (what can be done when the outcome of certain decisions and/or uncertainties are known beforehand) Uncertainty/deterministic nodes and incoming conditional arcs collectively model the information (what are known and their probabilistic/deterministic relationships) Value nodes and incoming functional arcs collectively quantify the preference (how things are preferred over one another). Alternative, information, and preference are termed decision basis in decision analysis, they represent three required components of any valid decision situation. Formally, the semantic of influence diagram is based on sequential construction of nodes and arcs, which implies a specification of all conditional independencies in the diagram. The specification is defined by the d {\displaystyle d} -separation criterion of Bayesian network. According to this semantic, every node is probabilistically independent on its non-successor nodes given the outcome of its immediate predecessor nodes. Likewise, a missing arc between non-value node X {\displaystyle X} and non-value node Y {\displaystyle Y} implies that there exists a set of non-value nodes Z {\displaystyle Z} , e.g., the parents of Y {\displaystyle Y} , that renders Y {\displaystyle Y} independent of X {\displaystyle X} given the outcome of the nodes in Z {\displaystyle Z} . == Example == Consider the simple influence diagram representing a situation where a decision-maker is planning their vacation. There is 1 decision node (Vacation Activity), 2 uncertainty nodes (Weather Condition, Weather Forecast), and 1 value node (Satisfaction). There are 2 functional arcs (ending in Satisfaction), 1 conditional arc (ending in Weather Forecast), and 1 informational arc (ending in Vacation Activity). Functional arcs ending in Satisfaction indicate that Satisfaction is a utility function of Weather Condition and Vacation Activity. In other words, their satisfaction can be quantified if they know what the weather is like and what their choice of activity is. (Note that they do not value Weather Forecast directly) Conditional arc ending in Weather Forecast indicates their belief that Weather Forecast and Weather Condition can be dependent. Informational arc ending in Vacation Activity indicates that they will only know Weather Forecast, not Weather Condition, when making their choice. In other words, actual weather will be known after they make their choice, and only forecast is what they can count on at this stage. It also follows semantically, for example, that Vacation Activity is independent on (irrelevant to) Weather Condition given Weather Forecast is known. == Applicability to value of information == The above example highlights the power of the influence diagram in representing an extremely important concept in decision analysis known as the value of information. Consider the following three scenarios; Scenario 1: The decision-maker could make their Vacation Activity decision while knowing what Weather Condition will be like. This corresponds to adding extra informational arc from Weather Condition to Vacation Activity in the above influence diagram. Scenario 2: The original influence diagram as shown above. Scenario 3: The decision-maker makes their decision without even knowing the Weather Forecast. This corresponds to removing informational arc from Weather Forecast to Vacation Activity in the above influence diagram. Scenario 1 is the best possible scenario for this decision situation since there is no longer any uncertainty on what they care about (Weather Condition) when making their decision. Scenario 3, however, is the worst possible scenario for this decision situation since they need to make their decision without any hint (Weather Forecast) on what they care about (Weather Condition) will turn out to be. The decision-maker is usually better off (definitely no worse off, on average) to move from scenario 3 to scenario 2 through the acquisition of new information. The most they should be willing to pay for such move is called the value of information on Weather Forecast, which is essentially the value of imperfect information on Weather Condition. The applicability of this simple ID and the value of information concept is tremendous, especially in medical decision making when most decisions have to be made with imperfect information about their patients, diseases, etc. == Related concepts == Influence diagrams are hierarchical and can be defined either in terms of their structure or in greater detail in terms of the functional and numerical relation between diagram elements. An ID that is consistently defined at all levels—structure, function, and number—is a well-defined mathematical representation and is referred to as a well-formed influence diagram (WFID). WFIDs can be evaluated using reversal and removal operations to yield answers to a large class of probabilistic, inferential, and decision questions. More recent techniques have been developed by artificial intelligence researchers concerning Bayesian network inference (belief propagation). An influence diagram having only uncertainty nodes (i.e., a Bayesian network) is also called a relevance diagram. An arc connecting node A to B implies not only that "A is relevant to B", but also that "B is relevant to A" (i.e., relevance is a symmetric relationship).

    Read more →
  • Homogeneity blockmodeling

    Homogeneity blockmodeling

    In mathematics applied to analysis of social structures, homogeneity blockmodeling is an approach in blockmodeling, which is best suited for a preliminary or main approach to valued networks, when a prior knowledge about these networks is not available. This is because homogeneity blockmodeling emphasizes the similarity of link (tie) strengths within the blocks over the pattern of links. In this approach, tie (link) values (or statistical data computed on them) are assumed to be equal (homogenous) within blocks. This approach to the generalized blockmodeling of valued networks was first proposed by Aleš Žiberna in 2007 with the basic idea, "that the inconsistency of an empirical block with its ideal block can be measured by within block variability of appropriate values". The newly–formed ideal blocks, which are appropriate for blockmodeling of valued networks, are then presented together with the definitions of their block inconsistencies. Similar approach to the homogeneity blockmodeling, dealing with direct approach for structural equivalence, was previously suggested by Stephen P. Borgatti and Martin G. Everett (1992).

    Read more →
  • Local coordinates

    Local coordinates

    Local coordinates are the ones used in a local coordinate system or a local coordinate space. Simple examples: Houses. In order to work in a house construction, the measurements are referred to a control arbitrary point that will allow to check it: stick/sticks on the ground, steel bar, nails... Addresses. Using house numbers to locate a house on a street; the street is a local coordinate system within a larger system composed of city townships, states, countries, postal codes, etc. Local systems exist for convenience. On ancient times, every work was made on relative bases as there was no conception of global systems. Practically, it is better to use local systems for small works as houses, buildings... For most of the applications, it is desired the position of one element relative to one building or location, and in a more local way, relative to one furniture or person. In a regular way, you will not give your position by geographical coordinates rather than "I am 15 meters away of the entry to the building". So it is a pretty common way to locate things. It is possible to bring latitude and longitude for all terrestrial locations, but unless one has a highly precise GPS device or you make astronomical observations, this is impractical. It is much simpler to use a tape, a rope, a chain... The position information (global) should be transformed into a location. Position refers to a numeric or symbolic description within a spatial reference system, whereas location refers to information about surrounding objects and their interrelationships. (Topological space) == Use == In computer graphics and computer animation, local coordinate spaces are also useful for their ability to model independently transformable aspects of geometrical scene graphs. When modeling a car, for example, it is desirable to describe the center of each wheel with respect to the car's coordinate system, but then specify the shape of each wheel in separate local spaces centered about these points. This way, the information describing each wheel can be simply duplicated four times, and independent transformations (e.g., steering rotation) can be similarly effected. Bounding volumes of objects may be described more accurately using extents in the local coordinates, (i.e. an object oriented bounding box, contrasted with the simpler axis aligned bounding box). The trade-off for this flexibility is additional computational cost: the rendering system must access the higher-level coordinate system of the car and combine it with the space of each wheel in order to draw everything in its proper place. Local coordinates also afford digital designers a means around the finite limits of numerical representation. The tread marks on a tire, for example, can be described using millimeters by allowing the whole tire to occupy the entire range of numeric precision available. The larger aspects of the car, such as its frame, might be described in centimeters, and the terrain that the car travels on could be specified in meters. In differential topology, local coordinates on a manifold are defined by means of an atlas of charts. The basic idea behind coordinate charts is that each small patch of a manifold can be endowed with a set of local coordinates. These are collected together into an atlas, and stitched together in such a way that they are self-consistent on the manifold. In Cartography and Maps, the traditional way of works are local datum. With a local datum the land can be mapped on relative small areas as a country. With the need of global systems, the transformations on between datum became a problem, so geodetic datum have been created. More than 150 local datum have been used in the world.

    Read more →
  • Generalized blockmodeling

    Generalized blockmodeling

    In generalized blockmodeling, the blockmodeling is done by "the translation of an equivalence type into a set of permitted block types", which differs from the conventional blockmodeling, which is using the indirect approach. It's a special instance of the direct blockmodeling approach. Generalized blockmodeling was introduced in 1994 by Patrick Doreian, Vladimir Batagelj and Anuška Ferligoj. == Definition == Generalized blockmodeling approach is a direct one, "where the optimal partition(s) is (are) identified based on minimal values of a compatible criterion function defined by the difference between empirical blocks and corresponding ideal blocks". At the same time, the much broader set of block types is introduced (while in conventional blockmodeling only certain types are used). The conventional blockmodeling is inductive due to nonspecification of neither the clusters or the location of block types, while in generalized blockmodeling the blockmodel is specified with more detail than just the permition of certain block types (e.g., prespecification). Further, it's possible to define departures from the permitted (ideal) blocktype, using criterion function. Using local optimization procedure, firstly the initial clustering (with specified number of clusters is done, based on random creation. How the clusters are neighboring to each other, is based on two transformations: 1) a vertex is moved from one to another cluster or 2) a pair of vertices is interchanged between two different clusters. This process of transformation steps is repeated many times, until only the best fitting partitions (with the minimized value of the criterion function) are kept as blockmodels for the future exploration of the network. Different types of generalized blockmodeling are: generalized binary blockmodeling, generalized valued blockmodeling and generalized homogeneity blockmodeling. == Benefits == According to Patrick Doreian, the benefits of generalized blockmodeling, are as follows: usage of explicit criterion function, compatible with a given type of equivalence, results to in-built measure of fit, which is integral to the establishment of the blockmodels (in conventional blockmodeling, there is no compelling and coherent measures of fit); partitions, based on generalized blockmodeling, regularly outperform and never perform less well than the partitions, based on conventional approach; with generalized blockmodeling it's possible to specify new types of blockmodels; this potentially unlimited set of new block types also results in permittion of inclusion of substantively driven blockmodels; in generalized blockmodeling, the specification of the block types and the location of some of them in the blockmodel is possible; researcher can speficy which (pair of) vertices must be (not) clustered together; this approach also allows the imposition of penalties, resulting into identification of empirical null blocks without inconsistencies with a corresponding ideal null block. == Problems == According to Doreian, the problems of generalized blockmodeling, are as follows: unknown sensitivity to particular data features, examination of boundary problems, computationally burdensome, which results in a constraint regarding practical network size (generalized blockmodeling is thus primarily used to analyse smaller networks (below 100 units)), identifying structure from incomplete network information, most of generalized blockmodeling is based on binary networks, but there is also development in the field of valued networks, criterion function is minimized for a specified blockmodel, with results in issues of evaluating statistically, based on the structural data alone, problems regarding three dimensional network data, problems regarding the evolution of fundamental network structure. == Book == The book with the same title, Generalized blockmodeling, written by Patrick Doreian, Vladimir Batagelj and Anuška Ferligoj, was in 2007 awarded the Harrison White Outstanding Book Award by the Mathematical Sociology Section of American Sociological Association.

    Read more →
  • PVLV

    PVLV

    The primary value learned value (PVLV) model is a possible explanation for the reward-predictive firing properties of dopamine (DA) neurons. It simulates behavioral and neural data on Pavlovian conditioning and the midbrain dopaminergic neurons that fire in proportion to unexpected rewards. It is an alternative to the temporal-differences (TD) algorithm. It is used as part of Leabra.

    Read more →
  • Induction of regular languages

    Induction of regular languages

    In computational learning theory, induction of regular languages refers to the task of learning a formal description (e.g. grammar) of a regular language from a given set of example strings. Although E. Mark Gold has shown that not every regular language can be learned this way (see language identification in the limit), approaches have been investigated for a variety of subclasses. They are sketched in this article. For learning of more general grammars, see Grammar induction. == Definitions == A regular language is defined as a (finite or infinite) set of strings that can be described by one of the mathematical formalisms called "finite automaton", "regular grammar", or "regular expression", all of which have the same expressive power. Since the latter formalism leads to shortest notations, it shall be introduced and used here. Given a set Σ of symbols (a.k.a. alphabet), a regular expression can be any of ∅ (denoting the empty set of strings), ε (denoting the singleton set containing just the empty string), a (where a is any character in Σ; denoting the singleton set just containing the single-character string a), r + s (where r and s are, in turn, simpler regular expressions; denoting their set's union) r ⋅ s (denoting the set of all possible concatenations of strings from r's and s's set), r + (denoting the set of n-fold repetitions of strings from r's set, for any n ≥ 1), or r (similarly denoting the set of n-fold repetitions, but also including the empty string, seen as 0-fold repetition). For example, using Σ = {0,1}, the regular expression (0+1+ε)⋅(0+1) denotes the set of all binary numbers with one or two digits (leading zero allowed), while 1⋅(0+1)⋅0 denotes the (infinite) set of all even binary numbers (no leading zeroes). Given a set of strings (also called "positive examples"), the task of regular language induction is to come up with a regular expression that denotes a set containing all of them. As an example, given {1, 10, 100}, a "natural" description could be the regular expression 1⋅0, corresponding to the informal characterization "a 1 followed by arbitrarily many (maybe even none) 0's". However, (0+1) and 1+(1⋅0)+(1⋅0⋅0) is another regular expression, denoting the largest (assuming Σ = {0,1}) and the smallest set containing the given strings, and called the trivial overgeneralization and undergeneralization, respectively. Some approaches work in an extended setting where also a set of "negative example" strings is given; then, a regular expression is to be found that generates all of the positive, but none of the negative examples. == Lattice of automata == Dupont et al. have shown that the set of all structurally complete finite automata generating a given input set of example strings forms a lattice, with the trivial undergeneralized and the trivial overgeneralized automaton as bottom and top element, respectively. Each member of this lattice can be obtained by factoring the undergeneralized automaton by an appropriate equivalence relation. For the above example string set {1, 10, 100}, the picture shows at its bottom the undergeneralized automaton Aa,b,c,d in grey, consisting of states a, b, c, and d. On the state set {a,b,c,d}, a total of 15 equivalence relations exist, forming a lattice. Mapping each equivalence E to the corresponding quotient automaton language L(Aa,b,c,d / E) obtains the partially ordered set shown in the picture. Each node's language is denoted by a regular expression. The language may be recognized by quotient automata w.r.t. different equivalence relations, all of which are shown below the node. An arrow between two nodes indicates that the lower node's language is a proper subset of the higher node's. If both positive and negative example strings are given, Dupont et al. build the lattice from the positive examples, and then investigate the separation border between automata that generate some negative example and such that do not. Most interesting are those automata immediately below the border. In the picture, separation borders are shown for the negative example strings 11 (green), 1001 (blue), 101 (cyan), and 0 (red). Coste and Nicolas present an own search method within the lattice, which they relate to Mitchell's version space paradigm. To find the separation border, they use a graph coloring algorithm on the state inequality relation induced by the negative examples. Later, they investigate several ordering relations on the set of all possible state fusions. Kudo and Shimbo use the representation by automaton factorizations to give a unique framework for the following approaches (sketched below): k-reversible languages and the "tail clustering" follow-up approach, Successor automata and the predecessor-successor method, and pumping-based approaches (framework-integration challenged by Luzeaux, however). Each of these approaches is shown to correspond to a particular kind of equivalence relations used for factorization. == Approaches == === k-reversible languages === Angluin considers so-called "k-reversible" regular automata, that is, deterministic automata in which each state can be reached from at most one state by following a transition chain of length k. Formally, if Σ, Q, and δ denote the input alphabet, the state set, and the transition function of an automaton A, respectively, then A is called k-reversible if: ∀a0, ..., ak ∈ Σ ∀s1, s2 ∈ Q: δ(s1, a0...ak) = δ(s2, a0...ak) ⇒ s1 = s2, where δ means the homomorphic extension of δ to arbitrary words. Angluin gives a cubic algorithm for learning of the smallest k-reversible language from a given set of input words; for k = 0, the algorithm has even almost linear complexity. The required state uniqueness after k + 1 given symbols forces unifying automaton states, thus leading to a proper generalization different from the trivial undergeneralized automaton. This algorithm has been used to learn simple parts of English syntax; later, an incremental version has been provided. Another approach based on k-reversible automata is the tail clustering method. === Successor automata === From a given set of input strings, Vernadat and Richetin build a so-called successor automaton, consisting of one state for each distinct character and a transition between each two adjacent characters' states. For example, the singleton input set {aabbaabb} leads to an automaton corresponding to the regular expression (a+⋅b+). An extension of this approach is the predecessor-successor method which generalizes each character repetition immediately to a Kleene + and then includes for each character the set of its possible predecessors in its state. Successor automata can learn exactly the class of local languages. Since each regular language is the homomorphic image of a local language, grammars from the former class can be learned by lifting, if an appropriate (depending on the intended application) homomorphism is provided. In particular, there is such a homomorphism for the class of languages learnable by the predecessor-successor method. The learnability of local languages can be reduced to that of k-reversible languages. === Early approaches === Chomsky and Miller (1957) used the pumping lemma: they guess a part v of an input string uvw and try to build a corresponding cycle into the automaton to be learned; using membership queries they ask, for appropriate k, which of the strings uw, uvvw, uvvvw, ..., uvkw also belongs to the language to be learned, thereby refining the structure of their automaton. In 1959, Solomonoff generalized this approach to context-free languages, which also obey a pumping lemma. === Cover automata === Câmpeanu et al. learn a finite automaton as a compact representation of a large finite language. Given such a language F, they search a so-called cover automaton A such that its language L(A) covers F in the following sense: L(A) ∩ Σ≤ l = F, where l is the length of the longest string in F, and Σ≤ l denotes the set of all strings not longer than l. If such a cover automaton exists, F is uniquely determined by A and l. For example, F = {ad, read, reread } has l = 6 and a cover automaton corresponding to the regular expression (r⋅e)⋅a⋅d. For two strings x and y, Câmpeanu et al. define x ~ y if xz ∈ F ⇔ yz ∈ F for all strings z of a length such that both xz and yz are not longer than l. Based on this relation, whose lack of transitivity causes considerable technical problems, they give an O(n4) algorithm to construct from F a cover automaton A of minimal state count. Moreover, for union, intersection, and difference of two finite languages they provide corresponding operations on their cover automata. Păun et al. improve the time complexity to O(n2). === Residual automata === For a set S of strings and a string u, the Brzozowski derivative u−1S is defined as the set of all rest-strings obtainable from a string in S by cutting off its prefix u (if possible), formally: u−1S = {v ∈ Σ: uv ∈ S}, cf. picture. Denis et al. define a

    Read more →
  • U-Net

    U-Net

    U-Net is a convolutional neural network that was developed for image segmentation. The network is based on a fully convolutional neural network whose architecture was modified and extended to work with fewer training images and to yield more precise segmentation. Segmentation of a 512 × 512 image takes less than a second on a modern (2015) GPU using the U-Net architecture. The U-Net architecture has also been employed in diffusion models for iterative image denoising. This technology underlies many modern image generation models, such as DALL-E, Midjourney, and Stable Diffusion. U-Net is also being explored for language models. Tokenization is not a separate step, allowing the model to more easily understand spelling and concurrently vectorizing / tokenizing higher level concepts. == Description == The U-Net architecture stems from the so-called "fully convolutional network". The main idea is to supplement a usual contracting network by successive layers, where pooling operations are replaced by upsampling operators. Hence these layers increase the resolution of the output. A successive convolutional layer can then learn to assemble a precise output based on this information. One important modification in U-Net is that there are a large number of feature channels in the upsampling part, which allow the network to propagate context information to higher resolution layers. As a consequence, the expansive path is more or less symmetric to the contracting part, and yields a u-shaped architecture. The network only uses the valid part of each convolution without any fully connected layers. To predict the pixels in the border region of the image, the missing context is extrapolated by mirroring the input image. This tiling strategy is important to apply the network to large images, since otherwise the resolution would be limited by the GPU memory. Recently, there had also been an interest in receptive field based U-Net models for medical image segmentation. == Network architecture == The network consists of a contracting path and an expansive path, which gives it the u-shaped architecture. The contracting path is a typical convolutional network that consists of repeated application of convolutions, each followed by a rectified linear unit (ReLU) and a max pooling operation. During the contraction, the spatial information is reduced while feature information is increased. The expansive pathway combines the feature and spatial information through a sequence of up-convolutions and concatenations with high-resolution features from the contracting path. == Applications == There are many applications of U-Net in biomedical image segmentation, such as brain image segmentation (''BRATS'') and liver image segmentation ("siliver07") as well as protein binding site prediction. U-Net implementations have also found use in the physical sciences, for example in the analysis of micrographs of materials. Variations of the U-Net have also been applied for medical image reconstruction. Here are some variants and applications of U-Net as follows: Pixel-wise regression using U-Net and its application on pansharpening; 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation; TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation. Image-to-image translation to estimate fluorescent stains In binding site prediction of protein structure. == History == U-Net was created by Olaf Ronneberger, Philipp Fischer, Thomas Brox in 2015 and reported in the paper "U-Net: Convolutional Networks for Biomedical Image Segmentation". It is an improvement and development of FCN: Evan Shelhamer, Jonathan Long, Trevor Darrell (2014). "Fully convolutional networks for semantic segmentation".

    Read more →
  • Characteristic samples

    Characteristic samples

    Characteristic samples is a concept in the field of grammatical inference, related to passive learning. In passive learning, an inference algorithm I {\displaystyle I} is given a set of pairs of strings and labels S {\displaystyle S} , and returns a representation R {\displaystyle R} that is consistent with S {\displaystyle S} . Characteristic samples consider the scenario when the goal is not only finding a representation consistent with S {\displaystyle S} , but finding a representation that recognizes a specific target language. A characteristic sample of language L {\displaystyle L} is a set of pairs of the form ( s , l ( s ) ) {\displaystyle (s,l(s))} where: l ( s ) = 1 {\displaystyle l(s)=1} if and only if s ∈ L {\displaystyle s\in L} l ( s ) = − 1 {\displaystyle l(s)=-1} if and only if s ∉ L {\displaystyle s\notin L} Given the characteristic sample S {\displaystyle S} , I {\displaystyle I} 's output on it is a representation R {\displaystyle R} , e.g. an automaton, that recognizes L {\displaystyle L} . == Formal Definition == === The Learning Paradigm associated with Characteristic Samples === There are three entities in the learning paradigm connected to characteristic samples, the adversary, the teacher and the inference algorithm. Given a class of languages C {\displaystyle \mathbb {C} } and a class of representations for the languages R {\displaystyle \mathbb {R} } , the paradigm goes as follows: The adversary A {\displaystyle A} selects a language L ∈ C {\displaystyle L\in \mathbb {C} } and reports it to the teacher The teacher T {\displaystyle T} then computes a set of strings and label them correctly according to L {\displaystyle L} , trying to make sure that the inference algorithm will compute L {\displaystyle L} The adversary can add correctly labeled words to the set in order to confuse the inference algorithm The inference algorithm I {\displaystyle I} gets the sample and computes a representation R ∈ R {\displaystyle R\in \mathbb {R} } consistent with the sample. The goal is that when the inference algorithm receives a characteristic sample for a language L {\displaystyle L} , or a sample that subsumes a characteristic sample for L {\displaystyle L} , it will return a representation that recognizes exactly the language L {\displaystyle L} . === Sample === Sample S {\displaystyle S} is a set of pairs of the form ( s , l ( s ) ) {\displaystyle (s,l(s))} such that l ( s ) ∈ { − 1 , 1 } {\displaystyle l(s)\in \{-1,1\}} ==== Sample consistent with a language ==== We say that a sample S {\displaystyle S} is consistent with language L {\displaystyle L} if for every pair ( s , l ( s ) ) {\displaystyle (s,l(s))} in S {\displaystyle S} : l ( s ) = 1 if and only if s ∈ L {\displaystyle l(s)=1{\text{ if and only if }}s\in L} l ( s ) = − 1 if and only if s ∉ L {\displaystyle l(s)=-1{\text{ if and only if }}s\notin L} === Characteristic sample === Given an inference algorithm I {\displaystyle I} and a language L {\displaystyle L} , a sample S {\displaystyle S} that is consistent with L {\displaystyle L} is called a characteristic sample of L {\displaystyle L} for I {\displaystyle I} if: I {\displaystyle I} 's output on S {\displaystyle S} is a representation R {\displaystyle R} that recognizes L {\displaystyle L} . For every sample D {\displaystyle D} that is consistent with L {\displaystyle L} and also fulfils S ⊆ D {\displaystyle S\subseteq D} , I {\displaystyle I} 's output on D {\displaystyle D} is a representation R {\displaystyle R} that recognizes L {\displaystyle L} . A Class of languages C {\displaystyle \mathbb {C} } is said to have charistaristic samples if every L ∈ C {\displaystyle L\in \mathbb {C} } has a characteristic sample. == Related Theorems == === Theorem === If equivalence is undecidable for a class C {\textstyle \mathbb {C} } over Σ {\textstyle \Sigma } of cardinality bigger than 1, then C {\textstyle \mathbb {C} } doesn't have characteristic samples. ==== Proof ==== Given a class of representations C {\textstyle \mathbb {C} } such that equivalence is undecidable, for every polynomial p ( x ) {\displaystyle p(x)} and every n ∈ N {\displaystyle n\in \mathbb {N} } , there exist two representations r 1 {\displaystyle r_{1}} and r 2 {\displaystyle r_{2}} of sizes bounded by n {\displaystyle n} , that recognize different languages but are inseparable by any string of size bounded by p ( n ) {\displaystyle p(n)} . Assuming this is not the case, we can decide if r 1 {\displaystyle r_{1}} and r 2 {\displaystyle r_{2}} are equivalent by simulating their run on all strings of size smaller than p ( n ) {\displaystyle p(n)} , contradicting the assumption that equivalence is undecidable. === Theorem === If S 1 {\displaystyle S_{1}} is a characteristic sample for L 1 {\displaystyle L_{1}} and is also consistent with L 2 {\displaystyle L_{2}} , then every characteristic sample of L 2 {\displaystyle L_{2}} , is inconsistent with L 1 {\displaystyle L_{1}} . ==== Proof ==== Given a class C {\textstyle \mathbb {C} } that has characteristic samples, let R 1 {\displaystyle R_{1}} and R 2 {\displaystyle R_{2}} be representations that recognize L 1 {\displaystyle L_{1}} and L 2 {\displaystyle L_{2}} respectively. Under the assumption that there is a characteristic sample for L 1 {\displaystyle L_{1}} , S 1 {\displaystyle S_{1}} that is also consistent with L 2 {\displaystyle L_{2}} , we'll assume falsely that there exist a characteristic sample for L 2 {\displaystyle L_{2}} , S 2 {\displaystyle S_{2}} that is consistent with L 1 {\displaystyle L_{1}} . By the definition of characteristic sample, the inference algorithm I {\displaystyle I} must return a representation which recognizes the language if given a sample that subsumes the characteristic sample itself. But for the sample S 1 ∪ S 2 {\displaystyle S_{1}\cup S_{2}} , the answer of the inferring algorithm needs to recognize both L 1 {\displaystyle L_{1}} and L 2 {\displaystyle L_{2}} , in contradiction. === Theorem === If a class is polynomially learnable by example based queries, it is learnable with characteristic samples. == Polynomialy characterizable classes == === Regular languages === The proof that DFA's are learnable using characteristic samples, relies on the fact that every regular language has a finite number of equivalence classes with respect to the right congruence relation, ∼ L {\displaystyle \sim _{L}} (where x ∼ L y {\displaystyle x\sim _{L}y} for x , y ∈ Σ ∗ {\displaystyle x,y\in \Sigma ^{}} if and only if ∀ z ∈ Σ ∗ : x z ∈ L ↔ y z ∈ L {\displaystyle \forall z\in \Sigma ^{}:xz\in L\leftrightarrow yz\in L} ). Note that if x {\displaystyle x} , y {\displaystyle y} are not congruent with respect to ∼ L {\displaystyle \sim _{L}} , there exists a string z {\displaystyle z} such that x z ∈ L {\displaystyle xz\in L} but y z ∉ L {\displaystyle yz\notin L} or vice versa, this string is called a separating suffix. ==== Constructing a characteristic sample ==== The construction of a characteristic sample for a language L {\displaystyle L} by the teacher goes as follows. Firstly, by running a depth first search on a deterministic automaton A {\displaystyle A} recognizing L {\displaystyle L} , starting from its initial state, we get a suffix closed set of words, W {\displaystyle W} , ordered in shortlex order. From the fact above, we know that for every two states in the automaton, there exists a separating suffix that separates between every two strings that the run of A {\displaystyle A} on them ends in the respective states. We refer to the set of separating suffixes as S {\displaystyle S} . The labeled set (sample) of words the teacher gives the adversary is { ( w , l ( w ) ) | w ∈ W ⋅ S ∪ W ⋅ Σ ⋅ S } {\displaystyle \{(w,l(w))|w\in W\cdot S\cup W\cdot \Sigma \cdot S\}} where l ( w ) {\displaystyle l(w)} is the correct label of w {\displaystyle w} (whether it is in L {\displaystyle L} or not). We may assume that ϵ ∈ S {\displaystyle \epsilon \in S} . ==== Constructing a deterministic automata ==== Given the sample from the adversary W {\displaystyle W} , the construction of the automaton by the inference algorithm I {\displaystyle I} starts with defining P = prefix ( W ) {\displaystyle P={\text{prefix}}(W)} and S = suffix ( W ) {\displaystyle S={\text{suffix}}(W)} , which are the set of prefixes and suffixes of W {\displaystyle W} respectively. Now the algorithm constructs a matrix M {\displaystyle M} where the elements of P {\displaystyle P} function as the rows, ordered by the shortlex order, and the elements of S {\displaystyle S} function as the columns, ordered by the shortlex order. Next, the cells in the matrix are filled in the following manner for prefix p i {\displaystyle p_{i}} and suffix s j {\displaystyle s_{j}} : If p i s j ∈ W → M i j = l ( p i s j ) {\displaystyle p_{i}s_{j}\in W\rightarrow M_{ij}=l(p_{i}s_{j})} else, M i j = 0 {\displaystyle M_{ij}=0} Now, we say row i {\displaystyle i} and t {\displaystyle t} are distinguishable if there exi

    Read more →
  • Constrained clustering

    Constrained clustering

    In computer science, constrained clustering is a class of semi-supervised learning algorithms. Typically, constrained clustering incorporates either a set of must-link constraints, cannot-link constraints, or both, with a data clustering algorithm. A cluster in which the members conform to all must-link and cannot-link constraints is called a chunklet. == Types of constraints == Both a must-link and a cannot-link constraint define a relationship between two data instances. Together, the sets of these constraints act as a guide for which a constrained clustering algorithm will attempt to find chunklets (clusters in the dataset which satisfy the specified constraints). A must-link constraint is used to specify that the two instances in the must-link relation should be associated with the same cluster. A cannot-link constraint is used to specify that the two instances in the cannot-link relation should not be associated with the same cluster. Some constrained clustering algorithms will abort if no such clustering exists which satisfies the specified constraints. Others will try to minimize the amount of constraint violation should it be impossible to find a clustering which satisfies the constraints. Constraints could also be used to guide the selection of a clustering model among several possible solutions. == Examples == Examples of constrained clustering algorithms include: COP K-means PCKmeans (Pairwise Constrained K-means) CMWK-Means (Constrained Minkowski Weighted K-Means)

    Read more →
  • Soft independent modelling of class analogies

    Soft independent modelling of class analogies

    Soft independent modelling by class analogy (SIMCA) is a statistical method for supervised classification of data. The method requires a training data set consisting of samples (or objects) with a set of attributes and their class membership. The term soft refers to the fact the classifier can identify samples as belonging to multiple classes and not necessarily producing a classification of samples into non-overlapping classes. == Method == In order to build the classification models, the samples belonging to each class need to be analysed using principal component analysis (PCA); only the significant components are retained. For a given class, the resulting model then describes either a line (for one Principal Component or PC), plane (for two PCs) or hyper-plane (for more than two PCs). For each modelled class, the mean orthogonal distance of training data samples from the line, plane, or hyper-plane (calculated as the residual standard deviation) is used to determine a critical distance for classification. This critical distance is based on the F-distribution and is usually calculated using 95% or 99% confidence intervals. New observations are projected into each PC model and the residual distances calculated. An observation is assigned to the model class when its residual distance from the model is below the statistical limit for the class. The observation may be found to belong to multiple classes and a measure of goodness of the model can be found from the number of cases where the observations are classified into multiple classes. The classification efficiency is usually indicated by Receiver operating characteristics. In the original SIMCA method, the ends of the hyper-plane of each class are closed off by setting statistical control limits along the retained principal components axes (i.e., score value between plus and minus 0.5 times score standard deviation). More recent adaptations of the SIMCA method close off the hyper-plane by construction of ellipsoids (e.g. Hotelling's T2 or Mahalanobis distance). With such modified SIMCA methods, classification of an object requires both that its orthogonal distance from the model and its projection within the model (i.e. score value within the region defined by the ellipsoid) are not significant. == Application == SIMCA as a method of classification has gained widespread use especially in applied statistical fields such as chemometrics and spectroscopic data analysis.

    Read more →
  • INDECT

    INDECT

    INDECT is a research project in the area of intelligent security systems performed by several European universities since 2009 and funded by the European Union. The purpose of the project is to involve European scientists and researchers in the development of solutions to and tools for automatic threat detection through e.g. processing of CCTV camera data streams, standardization of video sequence quality for user applications, threat detection in computer networks as well as data and privacy protection. The area of research, applied methods, and techniques are described in the public deliverables which are available to the public on the project's website. Practically, all information related to the research is public. Only documents that comprise information related to financial data or information that could negatively influence the competitiveness and law enforcement capabilities of parties involved in the project are not published. This follows regulations and practices applied in EU research projects. == Application and target users == The main end-user of INDECT solutions are police forces and security services. The principle of operation of the project is detecting threats and identifying sources of threats, without monitoring and searching for particular citizens or groups of citizens. Then, the system operator (i.e. police officer) decides whether an intervention of services responsible for public security are required or not. Further investigation eventually leading to persons related to threats is performed, preserving the presumption of innocence, based on existing procedures already used by police services and prosecutors. As it can be found in the project deliverables, INDECT does not involve storage of personal data (such as names, addresses, identity document numbers, etc.). A similar, behavior-based surveillance program was SAMURAI (Suspicious and Abnormal behavior Monitoring Using a netwoRk of cAmeras & sensors for sItuation awareness enhancement). == Expected results == The main expected results of the INDECT project are: Trial of intelligent analysis of video and audio data for threat detection in urban environments Creation of tools and technology for privacy and data protection during storage and transmission of information using quantum cryptography and new methods of digital watermarking Performing computer-aided detection of threats and targeted crimes in Internet resources with privacy-protecting solutions Construction of a search engine for rapid semantic search based on watermarking of content related to child pornography and human organ trafficking Implementation of a distributed computer system that is capable of effective intelligent processing == Controversy == Some media and other sources accuse INDECT of privacy abuse, collecting personal data, and keeping information from the public. Consequently, these issues have been commented and discussed by some Members of the European Parliament. As seen in the project's documentation, INDECT does not involve mobile phone tracking or call interception. The rumors about testing INDECT during 2012 UEFA European Football Championship also turned out to be false. The mid-term review of the Seventh Framework Programme to the European Parliament strongly urges the European Commission to immediately make all documents available and to define a clear and strict mandate for the research goal, the application, and the end users of INDECT, and stresses a thorough investigation of the possible impact on fundamental rights. Nevertheless, according to Mr. Paweł Kowal, MEP, the project had the ethical review on 15 March 2011 in Brussels with the participation of ethics experts from Austria, France, Netherlands, Germany and Great Britain.

    Read more →
  • Ordination (statistics)

    Ordination (statistics)

    Ordination or gradient analysis, in multivariate analysis, is a method complementary to data clustering, and used mainly in exploratory data analysis (rather than in hypothesis testing). In contrast to cluster analysis, ordination orders quantities in a (usually lower-dimensional) latent space. In the ordination space, quantities that are near each other share attributes (i.e., are similar to some degree), and dissimilar objects are farther from each other. Such relationships between the objects, on each of several axes or latent variables, are then characterized numerically and/or graphically in a biplot. The first ordination method, principal components analysis, was suggested by Karl Pearson in 1901. == Methods == Ordination methods can broadly be categorized in eigenvector-, algorithm-, or model-based methods. Many classical ordination techniques, including principal components analysis, correspondence analysis (CA) and its derivatives (detrended correspondence analysis, canonical correspondence analysis, and redundancy analysis, belong to the first group). The second group includes some distance-based methods such as non-metric multidimensional scaling, and machine learning methods such as T-distributed stochastic neighbor embedding and nonlinear dimensionality reduction. The third group includes model-based ordination methods, which can be considered as multivariate extensions of Generalized Linear Models. Model-based ordination methods are more flexible in their application than classical ordination methods, so that it is for example possible to include random-effects. Unlike in the aforementioned two groups, there is no (implicit or explicit) distance measure in the ordination. Instead, a distribution needs to be specified for the responses as is typical for statistical models. These and other assumptions, such as the assumed mean-variance relationship, can be validated with the use of residual diagnostics, unlike in other ordination methods. == Applications == Ordination can be used on the analysis of any set of multivariate objects. It is frequently used in several environmental or ecological sciences, particularly plant community ecology. It is also used in genetics and systems biology for microarray data analysis and in psychometrics.

    Read more →
  • Operational taxonomic unit

    Operational taxonomic unit

    An operational taxonomic unit (OTU) is an operational definition used to classify groups of closely related individuals. The term was originally introduced in 1963 by Robert R. Sokal and Peter H. A. Sneath in the context of numerical taxonomy, where an "operational taxonomic unit" is simply the group of organisms currently being studied. In this sense, an OTU is a pragmatic definition to group individuals by similarity, equivalent to but not necessarily in line with classical Linnaean taxonomy or modern evolutionary taxonomy. Nowadays, however, the term is commonly used in a different context and refers to clusters of (uncultivated or unknown) organisms, grouped by DNA sequence similarity of a specific taxonomic marker gene (originally coined as mOTU; molecular OTU). In other words, OTUs are pragmatic proxies for "species" at different taxonomic levels, in the absence of traditional systems of biological classification as are available for macroscopic organisms. For several years, OTUs have been the most commonly used units of diversity, especially when analysing small subunit 16S (for prokaryotes) or 18S rRNA (for eukaryotes) marker gene sequence datasets. == Molecular OTU by clustering of marker gene sequences == In the approach represented by DNA barcoding, a particular locus is chosen to be used as the marker gene for classification. This locus should be universally present in the scope selected, variable enough to be different among close-related species, and be flanked by conservative sequences that allow for easy amplification and detection. There are databases containing sequences for such marker genes from many different species, allowing for comparison. (Sometimes only using one locus does not provide sufficient resolution, so multiple marker genes are used. This is the case for plants, where rbcL+matK is common.) Sequences obtained this way can be clustered according to their similarity to one another, and operational taxonomic units are defined based on the similarity threshold set by the researcher. The exact threshold depends on the taxa in question and the mutational rates of the selected locus in the taxon. 97–99% are commonly used, but "it is now recognized to be somewhat arbitrary as sequence variation within and among species varies across taxa". 100% similarity (fully identical) is also common, also known as single variants. It remains debatable how well this commonly used method recapitulates true microbial species phylogeny or ecology. Although OTUs can be calculated differently when using different algorithms or thresholds, research by Schmidt et al. (2014) demonstrated that 16S-derived microbial OTUs were generally ecologically consistent across habitats and several clustering approaches. The number of OTUs defined may be inflated due to errors in DNA sequencing. === OTU clustering approaches === There are three main approaches to clustering OTUs: De novo, for which the clustering is based on similarities between sequencing reads. Closed-reference, for which the clustering is performed against a reference database of sequences. Open-reference, where clustering is first performed against a reference database of sequences, then any remaining sequences that could not be mapped to the reference are clustered de novo. Using a reference provides taxonomic context for the OTUs found. Alternatively, taxonomic context can be found after the construction of clusters by comparing representative sequences from clusters against a reference database. There are also specialized classifiers for this purpose which are much faster than naive comparison using BLAST. === OTU clustering algorithms === Hierarchical clustering algorithms (HCA): uclust & cd-hit & ESPRIT Bayesian clustering: CROP == Molecular OTU by other methods == In addition to similarity-based grouping, marker gene sequences can be sorted into OTUs using molecular phylogeny, k-mer composition, or hybrid methods combining these methods with similarity. There are also Bayesian tree-less methods and machine learning approaches. Using phylogeny often involves manually assigning terminal clades or single nodes to an OTU, so this is usually only done for refinement. Genome skimming can be used to obtain high-copy DNA without the need to choose marker genes or to design PCR primers for the chosen genes. It can provide fairly good coverage of organelle DNA and repetitive elements such as ribosomal DNA, both of which can be used like marker genes in OTU analysis. Whole-genome sequencing is more expensive and involves the production and processing of more data. By considering the entire genome, many (sometimes over 100) marker genes can be used at the same time, producing highly resolved phylogenies that correctly identify problematic taxa. It is also possible to use entire genomes for OTU assignment. For example, genomes from different bacterial species almost always have an average nucleotide identity lower than 95%, a fact that can be used to define new OTUs (and likely new species).

    Read more →