AI App Similar To Grok

AI App Similar To Grok — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Model-based clustering

    Model-based clustering

    In statistics, cluster analysis is the algorithmic grouping of objects into homogeneous groups based on numerical measurements. Model-based clustering based on a statistical model for the data, usually a mixture model. This has several advantages, including a principled statistical basis for clustering, and ways to choose the number of clusters, to choose the best clustering model, to assess the uncertainty of the clustering, and to identify outliers that do not belong to any group. == Model-based clustering == Suppose that for each of n {\displaystyle n} observations we have data on d {\displaystyle d} variables, denoted by y i = ( y i , 1 , … , y i , d ) {\displaystyle y_{i}=(y_{i,1},\ldots ,y_{i,d})} for observation i {\displaystyle i} . Then model-based clustering expresses the probability density function of y i {\displaystyle y_{i}} as a finite mixture, or weighted average of G {\displaystyle G} component probability density functions: p ( y i ) = ∑ g = 1 G τ g f g ( y i ∣ θ g ) , {\displaystyle p(y_{i})=\sum _{g=1}^{G}\tau _{g}f_{g}(y_{i}\mid \theta _{g}),} where f g {\displaystyle f_{g}} is a probability density function with parameter θ g {\displaystyle \theta _{g}} , τ g {\displaystyle \tau _{g}} is the corresponding mixture probability where ∑ g = 1 G τ g = 1 {\displaystyle \sum _{g=1}^{G}\tau _{g}=1} . Then in its simplest form, model-based clustering views each component of the mixture model as a cluster, estimates the model parameters, and assigns each observation to cluster corresponding to its most likely mixture component. === Gaussian mixture model === The most common model for continuous data is that f g {\displaystyle f_{g}} is a multivariate normal distribution with mean vector μ g {\displaystyle \mu _{g}} and covariance matrix Σ g {\displaystyle \Sigma _{g}} , so that θ g = ( μ g , Σ g ) {\displaystyle \theta _{g}=(\mu _{g},\Sigma _{g})} . This defines a Gaussian mixture model. The parameters of the model, τ g {\displaystyle \tau _{g}} and θ g {\displaystyle \theta _{g}} for g = 1 , … , G {\displaystyle g=1,\ldots ,G} , are typically estimated by maximum likelihood estimation using the expectation-maximization algorithm (EM); see also EM algorithm and GMM model. Bayesian inference is also often used for inference about finite mixture models. The Bayesian approach also allows for the case where the number of components, G {\displaystyle G} , is infinite, using a Dirichlet process prior, yielding a Dirichlet process mixture model for clustering. === Choosing the number of clusters === An advantage of model-based clustering is that it provides statistically principled ways to choose the number of clusters. Each different choice of the number of groups G {\displaystyle G} corresponds to a different mixture model. Then standard statistical model selection criteria such as the Bayesian information criterion (BIC) can be used to choose G {\displaystyle G} . The integrated completed likelihood (ICL) is a different criterion designed to choose the number of clusters rather than the number of mixture components in the model; these will often be different if highly non-Gaussian clusters are present. === Parsimonious Gaussian mixture model === For data with high dimension, d {\displaystyle d} , using a full covariance matrix for each mixture component requires estimation of many parameters, which can result in a loss of precision, generalizabity and interpretability. Thus it is common to use more parsimonious component covariance matrices exploiting their geometric interpretation. Gaussian clusters are ellipsoidal, with their volume, shape and orientation determined by the covariance matrix. Consider the eigendecomposition of a matrix Σ g = λ g D g A g D g T , {\displaystyle \Sigma _{g}=\lambda _{g}D_{g}A_{g}D_{g}^{T},} where D g {\displaystyle D_{g}} is the matrix of eigenvectors of Σ g {\displaystyle \Sigma _{g}} , A g = diag { A 1 , g , … , A d , g } {\displaystyle A_{g}={\mbox{diag}}\{A_{1,g},\ldots ,A_{d,g}\}} is a diagonal matrix whose elements are proportional to the eigenvalues of Σ g {\displaystyle \Sigma _{g}} in descending order, and λ g {\displaystyle \lambda _{g}} is the associated constant of proportionality. Then λ g {\displaystyle \lambda _{g}} controls the volume of the ellipsoid, A g {\displaystyle A_{g}} its shape, and D g {\displaystyle D_{g}} its orientation. Each of the volume, shape and orientation of the clusters can be constrained to be equal (E) or allowed to vary (V); the orientation can also be spherical, with identical eigenvalues (I). This yields 14 possible clustering models, shown in this table: It can be seen that many of these models are more parsimonious, with far fewer parameters than the unconstrained model that has 90 parameters when G = 4 {\displaystyle G=4} and d = 9 {\displaystyle d=9} . Several of these models correspond to well-known heuristic clustering methods. For example, k-means clustering is equivalent to estimation of the EII clustering model using the classification EM algorithm. The Bayesian information criterion (BIC) can be used to choose the best clustering model as well as the number of clusters. It can also be used as the basis for a method to choose the variables in the clustering model, eliminating variables that are not useful for clustering. Different Gaussian model-based clustering methods have been developed with an eye to handling high-dimensional data. These include the pgmm method, which is based on the mixture of factor analyzers model, and the HDclassif method, based on the idea of subspace clustering. The mixture-of-experts framework extends model-based clustering to include covariates. == Example == We illustrate the method with a dateset consisting of three measurements (glucose, insulin, sspg) on 145 subjects for the purpose of diagnosing diabetes and the type of diabetes present. The subjects were clinically classified into three groups: normal, chemical diabetes and overt diabetes, but we use this information only for evaluating clustering methods, not for classifying subjects. The BIC plot shows the BIC values for each combination of the number of clusters, G {\displaystyle G} , and the clustering model from the Table. Each curve corresponds to a different clustering model. The BIC favors 3 groups, which corresponds to the clinical assessment. It also favors the unconstrained covariance model, VVV. This fits the data well, because the normal patients have low values of both sspg and insulin, while the distributions of the chemical and overt diabetes groups are elongated, but in different directions. Thus the volumes, shapes and orientations of the three groups are clearly different, and so the unconstrained model is appropriate, as selected by the model-based clustering method. The classification plot shows the classification of the subjects by model-based clustering. The classification was quite accurate, with a 12% error rate as defined by the clinical classification. Other well-known clustering methods performed worse with higher error rates, such as single-linkage clustering with 46%, average link clustering with 30%, complete-linkage clustering also with 30%, and k-means clustering with 28%. == Outliers in clustering == An outlier in clustering is a data point that does not belong to any of the clusters. One way of modeling outliers in model-based clustering is to include an additional mixture component that is very dispersed, with for example a uniform distribution. Another approach is to replace the multivariate normal densities by t {\displaystyle t} -distributions, with the idea that the long tails of the t {\displaystyle t} -distribution would ensure robustness to outliers. However, this is not breakdown-robust. A third approach is the "tclust" or data trimming approach which excludes observations identified as outliers when estimating the model parameters. == Non-Gaussian clusters and merging == Sometimes one or more clusters deviate strongly from the Gaussian assumption. If a Gaussian mixture is fitted to such data, a strongly non-Gaussian cluster will often be represented by several mixture components rather than a single one. In that case, cluster merging can be used to find a better clustering. A different approach is to use mixtures of complex component densities to represent non-Gaussian clusters. == Non-continuous data == === Categorical data === Clustering multivariate categorical data is most often done using the latent class model. This assumes that the data arise from a finite mixture model, where within each cluster the variables are independent. === Mixed data === These arise when variables are of different types, such as continuous, categorical or ordinal data. A latent class model for mixed data assumes local independence between the variable. The location model relaxes the local independence assumption. The clustMD approach assumes that the observed variables are manifestations of underlying continuous Gaussian latent

    Read more →
  • How to Choose an AI Text-to-video Tool

    How to Choose an AI Text-to-video Tool

    Comparing the best AI text-to-video tool? An AI text-to-video tool is software that uses machine learning to help you get more done — it lowers the barrier so anyone can produce professional output. Privacy matters too: check whether your data trains the model and whether a no-log or enterprise tier is available. Whether you are a beginner or a pro, the right AI text-to-video tool slots into your workflow and pays for itself fast. Below we compare features, pricing, and real output so you can choose with confidence.

    Read more →
  • Talkman

    Talkman

    Talkman is an edutainment video game developed and published by Sony Computer Entertainment for the PlayStation Portable. It utilizes voice-activated translation software that operates in four languages, Japanese, English, Korean, and Mandarin Chinese. The name "Talkman" is a reference to Sony's Walkman line of portable audio products. It was released in Japan on November 17, 2005, and in America on August 5, 2008 (via the PlayStation Store), as Talkman Travel. In America, however, instead of receiving all the languages included in the Japanese version in one package, single-language packs are available for $2.99 each. Available packs are: Paris (French), Rome (Italian), and Tokyo (Japanese). The software is designed for travelers and entertainment, mostly containing slang and useful travel phrases. While originally sold in and designed for the Japanese market for Japanese users, its translation function operates between all four languages. In Japan, the software has proven popular with the middle-aged female demographic due to an interest in South Korean products, and Korean-language soap operas and movies; and as a fun English education aid for children. Outside of pure translations, Talkman also lets players play games to test their fluency of a language. The program comes with a USB microphone included. This microphone draws power through two gold-colored contacts on the top of the PSP, one on each side of the mini-USB port. This is uncommon due to the ability for most USB products to draw power through USB. These proprietary contacts are similar to the gold-colored contacts on the bottom-right of the device, which are used for charging. Note: The Chotto Shot (aka "Go!Cam") has a built-in microphone that also can be used with the Talkman program. Furthermore, the PSP-3000 model and PSP Go have built-in microphones that work with this application, without the need for any external attachments. == Talkman Euro == Following the success of the Asian version of Talkman, a version designed for translating European languages was developed and released on June 16, 2006. Talkman Euro is available in two versions. The Japanese version contains support for English, Italian, Spanish, German, French, and Japanese, while the Chinese version contains support for Traditional Chinese instead of Japanese. The differences on the packaging (the Japanese flag as opposed to a flag with the word "mie" in Chinese) are minimal and hard to notice. == Talkman UMD-only package == Talkman is also released as a UMD-only package, so users who already have the USB mic or camera can choose to purchase this standalone version. The Sony PSP Headset and the built-in microphone on later model PSPs have also been confirmed to work with Talkman.

    Read more →
  • Ronald J. Williams

    Ronald J. Williams

    Ronald James Williams (1945 – February 16, 2024) was an American mathematician and computer scientist who spent the majority of his career at Northeastern University. He is considered one of the pioneers of neural networks. In 1986, he co-authored the seminal paper in Nature on the backpropagation algorithm along with David Rumelhart and Geoffrey Hinton, which triggered a boom in neural network research. == Education and career == Williams was born in Southern California. He studied at California Institute of Technology as a undergraduate student and received a B.S. in mathematics there in 1966. He received his M.A. and Ph.D. in mathematics, both at University of California, San Diego (UCSD) in 1972 and 1975, respectively. His Ph.D. thesis was supervised by Donald Werner Anderson. He worked for a defense contractor for some time after graduation. From 1983 to 1986, Williams was a member of the Parallel Distributed Processing research group headed by David Rumelhart at the Institute for Cognitive Science at UCSD. In 1986, Williams accepted a professorship in computer science at Northeastern University in Boston, where he remained afterwards. In addition to the backpropagation paper, Williams made fundamental contributions to the fields of recurrent neural networks, where he, along with David Zipser, invented the teacher forcing algorithm and made important contributions to backpropagation through time. In reinforcement learning, Williams introduced the REINFORCE algorithm in 1992, which became the first policy gradient method. Besides his works on neural networks, Williams, together with Wenxu Tong and Mary Jo Ondrechen, developed Partial Order Optimum Likelihood (POOL), a machine learning method used in the prediction of active amino acids in protein structures. POOL is a maximum likelihood method with a monotonicity constraint and is a general predictor of properties that depend monotonically on the input features.

    Read more →
  • Augmented Analytics

    Augmented Analytics

    Augmented Analytics is an approach of data analytics that employs the use of machine learning and natural language processing to automate analysis processes normally done by a specialist or data scientist. The term was introduced in 2017 by Rita Sallam, Cindi Howson, and Carlie Idoine in a Gartner research paper. Augmented analytics is based on business intelligence and analytics. In the graph extraction step, data from different sources are investigated. == Defining Augmented Analytics == Machine Learning – a systematic computing method that uses algorithms to sift through data to identify relationships, trends, and patterns. It is a process that allows algorithms to dynamically learn from data instead of having a set base of programmed rules. Natural language generation (NLG) – a software capability that takes unstructured data and translates it into plain-English, readable, language. Automating Insights – using machine learning algorithms to automate data analysis processes. Natural Language Query – enabling users to query data using business terms that are either typed onto a search box or spoken. == Data Democratization == Data Democratization is the democratizing data access in order to relieve data congestion and get rid of any sense of data "gatekeepers". This process must be implemented alongside a method for users to make sense of the data. This process is used in hopes of speeding up company decision making and uncovering opportunities hidden in data. There are three aspects to democratising data: Data Parameterisation and Characterisation. Data Decentralisation using an OS of blockchain and DLT technologies, as well as an independently governed secure data exchange to enable trust. Consent Market-driven Data Monetisation. When it comes to connecting assets, there are two features that will accelerate the adoption and usage of data democratisation: decentralized identity management and business data object monetization of data ownership. It enables multiple individuals and organizations to identify, authenticate, and authorize participants and organizations, enabling them to access services, data or systems across multiple networks, organizations, environments, and use cases. It empowers users and enables a personalized, self-service digital onboarding system so that users can self-authenticate without relying on a central administration function to process their information. Simultaneously, decentralized identity management ensures the user is authorized to perform actions subject to the system’s policies based on their attributes (role, department, organization, etc.) and/ or physical location. == Use cases == Agriculture – Farmers collect data on water use, soil temperature, moisture content and crop growth, augmented analytics can be used to make sense of this data and possibly identify insights that the user can then use to make business decisions. Smart Cities – Many cities across the United States, known as Smart Cities collect large amounts of data on a daily basis. Augmented analytics can be used to simplify this data in order to increase effectiveness in city management (transportation, natural disasters, etc.). Analytic Dashboards – Augmented analytics has the ability to take large data sets and create highly interactive and informative analytical dashboards that assist in many organizational decisions. Augmented Data Discovery – Using an augmented analytics process can assist organizations in automatically finding, visualizing and narrating potentially important data correlations and trends. Data Preparation – Augmented analytics platforms have the ability to take large amounts of data and organize and "clean" the data in order for it to be usable for future analyses. Business – Businesses collect large amounts of data, daily. Some examples of types of data collected in business operations include; sales data, consumer behavior data, distribution data. An augmented analytics platform provides access to analysis of this data, which could be used in making business decisions.

    Read more →
  • Writer invariant

    Writer invariant

    Writer invariant, also called authorial invariant or author's invariant, is a property of a text which is invariant of its author, that is, it will be similar in all texts of a given author and different in texts of different authors. It can be used to find plagiarism or discover who is real author of anonymously published text. Writer invariant is also an author's pattern of writing a letter in handwritten text recognition. While it is generally recognised that writer invariants exist, it is not agreed what properties of a text should be used. Among the first ones used was distribution of word lengths; other proposed invariants include average sentence length, average word length, noun, verb or adjective usage frequency, vocabulary richness, and frequency of function words, or specific function words. Of these, average sentence lengths can be very similar in works of different authors or vary significantly even within a single work; average word lengths likewise turn out to be very similar in works of different authors. Analysis of function words shows promise because they are used by authors unconsciously.

    Read more →
  • How to Choose an AI Photo Editor

    How to Choose an AI Photo Editor

    In search of the best AI photo editor? An AI photo editor is software that uses machine learning to help you get more done — it turns a rough idea into a polished result in seconds. When choosing one, weigh output quality, pricing, export formats, and how well it fits the tools you already use. Whether you are a beginner or a pro, the right AI photo editor slots into your workflow and pays for itself fast. Below we compare features, pricing, and real output so you can choose with confidence.

    Read more →
  • Geoffrey J. Gordon

    Geoffrey J. Gordon

    Geoffrey J. Gordon is a professor at the Machine Learning Department at Carnegie Mellon University in Pittsburgh and director of research at the Microsoft Montréal lab. He is known for his research in statistical relational learning (a subdiscipline of artificial intelligence and machine learning) and on anytime dynamic variants of the A search algorithm. His research interests include multi-agent planning, reinforcement learning, decision-theoretic planning, statistical models of difficult data (e.g. maps, video, text), computational learning theory, and game theory. Gordon received a B.A. in computer science from Cornell University in 1991, and a PhD at Carnegie Mellon in 1999.

    Read more →
  • AI Mode

    AI Mode

    AI Mode is a search feature used within Google Search. In March 2025, Google introduced an experimental "AI Mode" within its search platform, enabling users to input complex, multi-part queries and receive comprehensive, AI-generated responses. This feature uses Google's Gemini model, which enhances the system's reasoning capabilities and supports multimodal inputs, including text, images, and voice. Users need to be signed in to be able to use the image generation features. Initially, AI Mode was available to Google One AI Premium subscribers in the United States, who could access it through the Search Labs platform. This phased rollout allowed Google to gather user feedback and refine the feature before a broader release.

    Read more →
  • Flex (lexical analyzer generator)

    Flex (lexical analyzer generator)

    Flex (fast lexical analyzer generator) is a free and open-source software alternative to lex. It is a computer program that generates lexical analyzers (also known as "scanners" or "lexers"). It is frequently used as the lex implementation together with Berkeley Yacc parser generator on BSD-derived operating systems (as both lex and yacc are part of POSIX), or together with GNU bison (a version of yacc) in BSD ports and in Linux distributions. Unlike Bison, flex is not part of the GNU Project and is not released under the GNU General Public License, although a manual for Flex was produced and published by the Free Software Foundation. == History == Flex was written in C around 1987 by Vern Paxson, with the help of many ideas and much inspiration from Van Jacobson. Original version by Jef Poskanzer. The fast table representation is a partial implementation of a design done by Van Jacobson. The implementation was done by Kevin Gong and Vern Paxson. == Example lexical analyzer == This is an example of a Flex scanner for the instructional programming language PL/0. The tokens recognized are: '+', '-', '', '/', '=', '(', ')', ',', ';', '.', ':=', '<', '<=', '<>', '>', '>='; numbers: 0-9 {0-9}; identifiers: a-zA-Z {a-zA-Z0-9} and keywords: begin, call, const, do, end, if, odd, procedure, then, var, while. == Internals == These programs perform character parsing and tokenizing via the use of a deterministic finite automaton (DFA). A DFA is a theoretical machine accepting regular languages, and is equivalent to read-only right moving Turing machines. The syntax is based on the use of regular expressions. See also nondeterministic finite automaton. == Issues == === Time complexity === A Flex lexical analyzer usually has time complexity O ( n ) {\displaystyle O(n)} in the length of the input. That is, it performs a constant number of operations for each input symbol. This constant is quite low: GCC generates 12 instructions for the DFA match loop. Note that the constant is independent of the length of the token, the length of the regular expression and the size of the DFA. However, using the REJECT macro in a scanner with the potential to match extremely long tokens can cause Flex to generate a scanner with non-linear performance. This feature is optional. In this case, the programmer has explicitly told Flex to "go back and try again" after it has already matched some input. This will cause the DFA to backtrack to find other accept states. The REJECT feature is not enabled by default, and because of its performance implications its use is discouraged in the Flex manual. === Reentrancy === By default the scanner generated by Flex is not reentrant. This can cause serious problems for programs that use the generated scanner from different threads. To overcome this issue there are options that Flex provides in order to achieve reentrancy. A detailed description of these options can be found in the Flex manual. === Usage under non-Unix environments === Normally the generated scanner contains references to the unistd.h header file, which is Unix specific. To avoid generating code that includes unistd.h, %option nounistd should be used. Another issue is the call to isatty (a Unix library function), which can be found in the generated code. The %option never-interactive forces flex to generate code that does not use isatty. === Using flex from other languages === Flex can only generate code for C and C++. To use the scanner code generated by flex from other languages a language binding tool such as SWIG can be used. === Unicode support === Flex is limited to matching 1-byte (8-bit) binary values and therefore does not support Unicode. RE/flex and other alternatives do support Unicode matching. == Flex++ == flex++ is a similar lexical scanner for C++ which is included as part of the flex package. The generated code does not depend on any runtime or external library except for a memory allocator (malloc or a user-supplied alternative) unless the input also depends on it. This can be useful in embedded and similar situations where traditional operating system or C runtime facilities may not be available. The flex++ generated C++ scanner includes the header file FlexLexer.h, which defines the interfaces of the two C++ generated classes.

    Read more →
  • Multiple sequence alignment

    Multiple sequence alignment

    Multiple sequence alignment (MSA) is the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. These alignments are used to infer evolutionary relationships via phylogenetic analysis and can highlight homologous features between sequences. Alignments highlight mutation events such as point mutations (single amino acid or nucleotide changes), insertion mutations and deletion mutations, and alignments are used to assess sequence conservation and infer the presence and activity of protein domains, tertiary structures, secondary structures, and individual amino acids or nucleotides. Multiple sequence alignments require more sophisticated methodologies than pairwise alignments, as they are more computationally complex. Most multiple sequence alignment programs use heuristic methods rather than global optimization because identifying the optimal alignment between more than a few sequences of moderate length is prohibitively computationally expensive. However, heuristic methods generally cannot guarantee high-quality solutions and have been shown to fail to yield near-optimal solutions on benchmark test cases. == Problem statement == Given m {\displaystyle m} sequences S i {\displaystyle S_{i}} , i = 1 , ⋯ , m {\displaystyle i=1,\cdots ,m} similar to the form below: S := { S 1 = ( S 11 , S 12 , … , S 1 n 1 ) S 2 = ( S 21 , S 22 , ⋯ , S 2 n 2 ) ⋮ S m = ( S m 1 , S m 2 , … , S m n m ) {\displaystyle S:={\begin{cases}S_{1}=(S_{11},S_{12},\ldots ,S_{1n_{1}})\\S_{2}=(S_{21},S_{22},\cdots ,S_{2n_{2}})\\\,\,\,\,\,\,\,\,\,\,\vdots \\S_{m}=(S_{m1},S_{m2},\ldots ,S_{mn_{m}})\end{cases}}} A multiple sequence alignment is taken of this set of sequences S {\displaystyle S} by inserting any amount of gaps needed into each of the S i {\displaystyle S_{i}} sequences of S {\displaystyle S} until the modified sequences, S i ′ {\displaystyle S'_{i}} , all conform to length L ≥ max { n i ∣ i = 1 , … , m } {\displaystyle L\geq \max\{n_{i}\mid i=1,\ldots ,m\}} and no values in the sequences of S {\displaystyle S} of the same column consists of only gaps. The mathematical form of an MSA of the above sequence set is shown below: S ′ := { S 1 ′ = ( S 11 ′ , S 12 ′ , … , S 1 L ′ ) S 2 ′ = ( S 21 ′ , S 22 ′ , … , S 2 L ′ ) ⋮ S m ′ = ( S m 1 ′ , S m 2 ′ , … , S m L ′ ) {\displaystyle S':={\begin{cases}S'_{1}=(S'_{11},S'_{12},\ldots ,S'_{1L})\\S'_{2}=(S'_{21},S'_{22},\ldots ,S'_{2L})\\\,\,\,\,\,\,\,\,\,\,\vdots \\S'_{m}=(S'_{m1},S'_{m2},\ldots ,S'_{mL})\end{cases}}} To return from each particular sequence S i ′ {\displaystyle S'_{i}} to S i {\displaystyle S_{i}} , remove all gaps. == Graphing approach == A general approach when calculating multiple sequence alignments is to use graphs to identify all of the different alignments. When finding alignments via graph, a complete alignment is created in a weighted graph that contains a set of vertices and a set of edges. Each of the graph edges has a weight based on a certain heuristic that helps to score each alignment or subset of the original graph. === Tracing alignments === When determining the best suited alignments for each MSA, a trace is usually generated. A trace is a set of realized, or corresponding and aligned, vertices that has a specific weight based on the edges that are selected between corresponding vertices. When choosing traces for a set of sequences it is necessary to choose a trace with a maximum weight to get the best alignment of the sequences. == Alignment methods == There are various alignment methods used within multiple sequence to maximize scores and correctness of alignments. Each is usually based on a certain heuristic with an insight into the evolutionary process. Most try to replicate evolution to get the most realistic alignment possible to best predict relations between sequences. === Dynamic programming === A direct method for producing an MSA uses the dynamic programming technique to identify the globally optimal alignment solution. For proteins, this method usually involves two sets of parameters: a gap penalty and a substitution matrix assigning scores or probabilities to the alignment of each possible pair of amino acids based on the similarity of the amino acids' chemical properties and the evolutionary probability of the mutation. For nucleotide sequences, a similar gap penalty is used, but a much simpler substitution matrix, wherein only identical matches and mismatches are considered, is typical. The scores in the substitution matrix may be either all positive or a mix of positive and negative in the case of a global alignment, but must be both positive and negative, in the case of a local alignment. For n individual sequences, the naive method requires constructing the n-dimensional equivalent of the matrix formed in standard pairwise sequence alignment. The search space thus increases exponentially with increasing n and is also strongly dependent on sequence length. Expressed with the big O notation commonly used to measure computational complexity, a naïve MSA takes O(LengthNseqs) time to produce. To find the global optimum for n sequences this way has been shown to be an NP-complete problem. In 1989, based on Carrillo-Lipman Algorithm, Altschul introduced a practical method that uses pairwise alignments to constrain the n-dimensional search space. In this approach pairwise dynamic programming alignments are performed on each pair of sequences in the query set, and only the space near the n-dimensional intersection of these alignments is searched for the n-way alignment. The MSA program optimizes the sum of all of the pairs of characters at each position in the alignment (the so-called sum of pair score) and has been implemented in a software program for constructing multiple sequence alignments. In 2019, Hosseininasab and van Hoeve showed that by using decision diagrams, MSA may be modeled in polynomial space complexity. === Progressive alignment construction === The most widely used approach to multiple sequence alignments uses a heuristic search known as progressive technique (also known as the hierarchical or tree method) developed by Da-Fei Feng and Doolittle in 1987. Progressive alignment builds up a final MSA by combining pairwise alignments beginning with the most similar pair and progressing to the most distantly related. All progressive alignment methods require two stages: a first stage in which the relationships between the sequences are represented as a phylogenetic tree, called a guide tree, and a second step in which the MSA is built by adding the sequences sequentially to the growing MSA according to the guide tree. The initial guide tree is determined by an efficient clustering method such as neighbor-joining or unweighted pair group method with arithmetic mean (UPGMA), and may use distances based on the number of identical two-letter sub-sequences (as in FASTA rather than a dynamic programming alignment). Progressive alignments are not guaranteed to be globally optimal. The primary problem is that when errors are made at any stage in growing the MSA, these errors are then propagated through to the final result. Performance is also particularly bad when all of the sequences in the set are rather distantly related. Most modern progressive methods modify their scoring function with a secondary weighting function that assigns scaling factors to individual members of the query set in a nonlinear fashion based on their phylogenetic distance from their nearest neighbors. This corrects for non-random selection of the sequences given to the alignment program. Progressive alignment methods are efficient enough to implement on a large scale for many (100s to 1000s) sequences. A popular progressive alignment method has been the Clustal family. ClustalW is used extensively for phylogenetic tree construction, in spite of the author's explicit warnings that unedited alignments should not be used in such studies and as input for protein structure prediction by homology modeling. European Bioinformatics Institute (EMBL-EBI) announced that CLustalW2 will expire in August 2015. They recommend Clustal Omega which performs based on seeded guide trees and HMM profile-profile techniques for protein alignments. An alternative tool for progressive DNA alignments is multiple alignment using fast Fourier transform (MAFFT). Another common progressive alignment method named T-Coffee is slower than Clustal and its derivatives but generally produces more accurate alignments for distantly related sequence sets. T-Coffee calculates pairwise alignments by combining the direct alignment of the pair with indirect alignments that aligns each sequence of the pair to a third sequence. It uses the output from Clustal as well as another local alignment program LALIGN, which finds multiple regions of local alignment between two sequences. The resulting alignment and phylogenetic tree are used as a guide to produce new and more accurate w

    Read more →
  • Markov chain central limit theorem

    Markov chain central limit theorem

    In the mathematical theory of random processes, the Markov chain central limit theorem has a conclusion somewhat similar in form to that of the classic central limit theorem (CLT) of probability theory, but the quantity in the role taken by the variance in the classic CLT has a more complicated definition. See also the general form of Bienaymé's identity. == Statement == Suppose that: the sequence X 1 , X 2 , X 3 , … {\textstyle X_{1},X_{2},X_{3},\ldots } of random elements of some set is a Markov chain that has a stationary probability distribution; and the initial distribution of the process, i.e. the distribution of X 1 {\textstyle X_{1}} , is the stationary distribution, so that X 1 , X 2 , X 3 , … {\textstyle X_{1},X_{2},X_{3},\ldots } are identically distributed. In the classic central limit theorem these random variables would be assumed to be independent, but here we have only the weaker assumption that the process has the Markov property; and g {\textstyle g} is some (measurable) real-valued function for which var ⁡ ( g ( X 1 ) ) < + ∞ . {\textstyle \operatorname {var} (g(X_{1}))<+\infty .} Now let μ = E ⁡ ( g ( X 1 ) ) , μ ^ n = 1 n ∑ k = 1 n g ( X k ) σ 2 := lim n → ∞ var ⁡ ( n μ ^ n ) = lim n → ∞ n var ⁡ ( μ ^ n ) = var ⁡ ( g ( X 1 ) ) + 2 ∑ k = 1 ∞ cov ⁡ ( g ( X 1 ) , g ( X 1 + k ) ) . {\displaystyle {\begin{aligned}\mu &=\operatorname {E} (g(X_{1})),\\{\widehat {\mu }}_{n}&={\frac {1}{n}}\sum _{k=1}^{n}g(X_{k})\\\sigma ^{2}&:=\lim _{n\to \infty }\operatorname {var} ({\sqrt {n}}{\widehat {\mu }}_{n})=\lim _{n\to \infty }n\operatorname {var} ({\widehat {\mu }}_{n})=\operatorname {var} (g(X_{1}))+2\sum _{k=1}^{\infty }\operatorname {cov} (g(X_{1}),g(X_{1+k})).\end{aligned}}} Then as n → ∞ , {\textstyle n\to \infty ,} we have n ( μ ^ n − μ ) → D Normal ( 0 , σ 2 ) , {\displaystyle {\sqrt {n}}({\hat {\mu }}_{n}-\mu )\ {\xrightarrow {\mathcal {D}}}\ {\text{Normal}}(0,\sigma ^{2}),} where the decorated arrow indicates convergence in distribution. == Monte Carlo Setting == The Markov chain central limit theorem can be guaranteed for functionals of general state space Markov chains under certain conditions. In particular, this can be done with a focus on Monte Carlo settings. An example of the application in a MCMC (Markov Chain Monte Carlo) setting is the following: Consider a simple hard spheres model on a grid. Suppose X = { 1 , … , n 1 } × { 1 , … , n 2 } ⊆ Z 2 {\displaystyle X=\{1,\ldots ,n_{1}\}\times \{1,\ldots ,n_{2}\}\subseteq Z^{2}} . A proper configuration on X {\displaystyle X} consists of coloring each point either black or white in such a way that no two adjacent points are white. Let χ {\displaystyle \chi } denote the set of all proper configurations on X {\displaystyle X} , N χ ( n 1 , n 2 ) {\displaystyle N_{\chi }(n_{1},n_{2})} be the total number of proper configurations and π be the uniform distribution on χ {\displaystyle \chi } so that each proper configuration is equally likely. Suppose our goal is to calculate the typical number of white points in a proper configuration; that is, if W ( x ) {\displaystyle W(x)} is the number of white points in x ∈ χ {\displaystyle x\in \chi } then we want the value of E π W = ∑ x ∈ χ W ( x ) N χ ( n 1 , n 2 ) {\displaystyle E_{\pi }W=\sum _{x\in \chi }{\frac {W(x)}{N_{\chi }{\bigl (}n_{1},n_{2}{\bigr )}}}} If n 1 {\displaystyle n_{1}} and n 2 {\displaystyle n_{2}} are even moderately large then we will have to resort to an approximation to E π W {\displaystyle E_{\pi }W} . Consider the following Markov chain on χ {\displaystyle \chi } . Fix p ∈ ( 0 , 1 ) {\displaystyle p\in (0,1)} and set X 1 = x 1 {\displaystyle X_{1}=x_{1}} where x 1 ∈ χ {\displaystyle x_{1}\in \chi } is an arbitrary proper configuration. Randomly choose a point ( x , y ) ∈ X {\displaystyle (x,y)\in X} and independently draw U ∼ U n i f o r m ( 0 , 1 ) {\displaystyle U\sim \mathrm {Uniform} (0,1)} . If u ≤ p {\displaystyle u\leq p} and all of the adjacent points are black then color ( x , y ) {\displaystyle (x,y)} white leaving all other points alone. Otherwise, color ( x , y ) {\displaystyle (x,y)} black and leave all other points alone. Call the resulting configuration X 1 {\displaystyle X_{1}} . Continuing in this fashion yields a Harris ergodic Markov chain { X 1 , X 2 , X 3 , … } {\displaystyle \{X_{1},X_{2},X_{3},\ldots \}} having π {\displaystyle \pi } as its invariant distribution. It is now a simple matter to estimate E π W {\displaystyle E_{\pi }W} with w n ¯ = ∑ i = 1 n W ( X i ) / n {\displaystyle {\overline {w_{n}}}=\sum _{i=1}^{n}W(X_{i})/n} . Also, since χ {\displaystyle \chi } is finite (albeit potentially large) it is well known that X {\displaystyle X} will converge exponentially fast to π {\displaystyle \pi } which implies that a CLT holds for w n ¯ {\displaystyle {\overline {w_{n}}}} . == Implications == Not taking into account the additional terms in the variance which stem from correlations (e.g. serial correlations in markov chain monte carlo simulations) can result in the problem of pseudoreplication when computing e.g. the confidence intervals for the sample mean.

    Read more →
  • Index locking

    Index locking

    In databases an index is a data structure, part of the database, used by a database system to efficiently navigate access to user data. Index data are system data distinct from user data, and consist primarily of pointers. Changes in a database (by insert, delete, or modify operations), may require indexes to be updated to maintain accurate user data accesses. Index locking is a technique used to maintain index integrity. A portion of an index is locked during a database transaction when this portion is being accessed by the transaction as a result of attempt to access related user data. Additionally, special database system transactions (not user-invoked transactions) may be invoked to maintain and modify an index, as part of a system's self-maintenance activities. When a portion of an index is locked by a transaction, other transactions may be blocked from accessing this index portion (blocked from modifying, and even from reading it, depending on lock type and needed operation). Index Locking Protocol guarantees that phantom read phenomenon won't occur. Index locking protocol states: Every relation must have at least one index. A transaction can access tuples only after finding them through one or more indices on the relation A transaction Ti that performs a lookup must lock all the index leaf nodes that it accesses, in S-mode, even if the leaf node does not contain any tuple satisfying the index lookup (e.g. for a range query, no tuple in a leaf is in the range) A transaction Ti that inserts, updates or deletes a tuple ti in a relation r must update all indices to r and it must obtain exclusive locks on all index leaf nodes affected by the insert/update/delete The rules of the two-phase locking protocol must be observed. Specialized concurrency control techniques exist for accessing indexes. These techniques depend on the index type, and take advantage of its structure. They are typically much more effective than applying to indexes common concurrency control methods applied to user data. Notable and widely researched are specialized techniques for B-trees (B-Tree concurrency control) which are regularly used as database indexes. Index locks are used to coordinate threads accessing indexes concurrently, and typically shorter-lived than the common transaction locks on user data. In professional literature, they are often called latches.

    Read more →
  • Ben Goertzel

    Ben Goertzel

    Ben Goertzel is a computer scientist, artificial intelligence (AI) researcher, and businessman. He helped popularize the term artificial general intelligence (AGI). == Early life and education == Three of Goertzel's Jewish great-grandparents immigrated to New York from Lithuania and Poland (in the Russian Empire). Goertzel's father is Ted Goertzel, a former professor of sociology at Rutgers University. Goertzel left high school after the tenth grade to attend Bard College at Simon's Rock, where he graduated with a bachelor's degree in Quantitative Studies. Goertzel graduated with a PhD in mathematics from Temple University under the supervision of Avi Lin in 1990, at age 23. == Career == Goertzel is the founder and CEO of SingularityNET, a project which was founded to distribute artificial intelligence data via blockchains. He is a leading developer of the OpenCog framework for artificial general intelligence. Goertzel was an associate and grant recipient of Jeffrey Epstein. He received a $100,000 grant from the Jeffrey Epstein Foundation for artificial general intelligence research in 2001. When interviewed by The New York Times about Epstein in 2019, Goertzel said, "I have no desire to talk about Epstein right now... The stuff I'm reading about him in the papers is pretty disturbing and goes way beyond what I thought his misdoings and kinks were. Yecch." === Sophia the Robot === Goertzel was the Chief Scientist of Hanson Robotics, the company that created the Sophia robot. As of 2018, Sophia's architecture includes scripting software, a chat system, and OpenCog, an AI system designed for general reasoning. Experts in the field have treated the project mostly as a PR stunt, stating that Hanson's claims that Sophia was "basically alive" are "grossly misleading" because the project does not involve AI technology, while computer scientist Yann LeCun, then Meta's chief AI scientist, made several unflattering remarks including calling the project "complete bullshit". === Views on AI === In May 2007, Goertzel spoke at a Google tech talk about his approach to creating artificial general intelligence. He defines intelligence as the ability to detect patterns in the world and in the agent itself, measurable in terms of emergent behavior of "achieving complex goals in complex environments". A "baby-like" artificial intelligence is initialized, then trained as an agent in a simulated or virtual world such as Second Life to produce a more powerful intelligence. Knowledge is represented in a network whose nodes and links carry probabilistic truth values as well as "attention values", with the attention values resembling the weights in a neural network. Several algorithms operate on this network, the central one being a combination of a probabilistic inference engine and a custom version of evolutionary programming. The 2012 documentary The Singularity by independent filmmaker Doug Wolens discussed Goertzel's views on AGI. In 2023 Goertzel postulated that artificial intelligence could replace up to 80 percent of human jobs in the coming years "without having an AGI, by my guess. Not with ChatGPT exactly as a product. But with systems of that nature". At the Web Summit 2023 in Rio de Janeiro, Goertzel spoke out against efforts to curb AI research and that AGI is only a few years away. Goertzel's belief is that AGI will be a net positive for humanity by assisting with societal problems such as, but not limited to, climate change.

    Read more →
  • Is an AI Humanizer Worth It in 2026?

    Is an AI Humanizer Worth It in 2026?

    Shopping for the best AI humanizer? An AI humanizer is software that uses machine learning to help you get more done — it keeps getting smarter as the underlying models improve. Pricing, accuracy, and the size of the model behind the tool are the three factors that most affect daily usefulness. Whether you are a beginner or a pro, the right AI humanizer slots into your workflow and pays for itself fast. We tested the leading options and ranked them by quality, value, and ease of use.

    Read more →