AI Data Bias

AI Data Bias — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Are You Dead?

    Are You Dead?

    Are You Dead? (Chinese: 死了么; pinyin: Sǐleme), also known by its English name Demumu, is a Chinese application designed for young people living alone. It requires setting up one emergency contact and sends automatic notifications if the user has not checked in via the app for consecutive days. The app was released on the App Store on 10 June 2025. In early January 2026, the application gained popularity due to its name and the issue of safety for people living alone, and ranked high on the list of paid applications in the Chinese region of the Apple App Store before being removed. The app's rise in popularity sparked discussions about taboos about death in China. == History == Are You Dead? was founded and operated independently by three people born in the 1990s, and developed in a way that involved remote collaboration in their spare time. According to the New Yellow River report, Guo, the product manager, said that the application was designed for young people and that the inspiration came from the discussion of netizens on social platforms about "an app that everyone must have and will definitely download" that he observed two or three years ago. The name was also "not their original creation". After realizing its potential demand and social significance, the team successfully registered the name and completed the product development in about a month. Regarding the development entity, the New Yellow River cited information from the Apple App Store that the application was developed by Yuejing (Zhengzhou) Technology Service Co., Ltd. According to Tianyancha information, the company was established in March 2025 with a registered capital of 100,000 yuan. === Rise in popularity === The app has been generating buzz on social media since 9 January 2026, due to its name and the topic of safety for people living alone. Around 10 January, it topped the Apple paid app chart. As of 10:00 a.m. on January 11, it ranked first in the App Store paid app chart. It also ranked highly in the utility app chart; it ranked first or second in the paid utility app charts in the United States, Singapore and Hong Kong, and first or fourth in Australia and Spain. The app was subsequently removed from the Apple App Store in China. In terms of functionality and usage, First Financial praised the product for its "simple interface and single function," but pointed out that the interface lacks a display of consecutive check-in days, and there is also the possibility that users may forget to check in, leading to the mistaken issuance of reminders. In addition, since the application mainly relies on email reminders and lacks SMS or telephone notifications, it does not conform to Chinese social habits; the untimely notifications also make the application more like a "death notification" tool, losing its early warning significance for emergency rescue. Hu Xijin, former editor-in-chief of the Global Times, commented on the application on Weibo that it is "really good and can help many lonely elderly people." The Beijing News Quick Review pointed out that the role of technical tools is limited and needs to be connected with real support such as community patrols and liaison mechanisms. Due to the price increase, there have also been questions about the motivation for the price increase. The app's rise in popularity sparked discussions about taboos about death in China. Regarding the popularity of the application, both Southern Metropolis Daily and The Beijing News commented that it reflects the public issue of the risks of living alone and reflects the general anxiety of the living alone group about dying alone. Shangguan News further pointed out that although such technology products provide a certain "low-cost sense of security", their "cold notifications" may not only cause false alarms, but also highlight the embarrassing reality that "there is no one to fill in the emergency contact". It also emphasized that algorithms or applications cannot bring true happiness and called on society to reconstruct a support network full of humanistic care while relying on technology. The name of the application has also sparked controversy. Most netizens believe that the name "Are You Dead?" is unlucky and makes it awkward to share the application. They suggest changing it to a milder name such as "Are You Alive?". Hu Xijin also said that the name change could "give the elderly who use it more psychological comfort" and "believe that the application will become more popular after the name change". Some people also believe that this straightforward name just points out the real dilemma faced by people living alone and has a special meaning. BBC News commented that the name "Are You Dead" is playing a word game with Ele.me (Chinese: 饿了么; pinyin: Èleme) and the pronunciation is also similar. Legal professionals believe that its name is highly similar to Ele.me and may cause confusion. They also raised the possibility of trademark infringement and unfair competition. However, the developers said that the application is developed for young people and death is not a sensitive topic. They will "consider launching a new application that is more suitable for middle-aged and elderly people". They have not yet received any name change requests from relevant departments. On the evening of 13 January 2026, the Are You Dead? team announced that it would change its name to the English brand name Demumu in the upcoming new version. On 11 January, the development team also issued a statement through its official Weibo account, stating that it would study the renaming suggestion and plan to enrich the SMS reminder function, consider adding the message function and explore the direction of age-friendly products; it also stated that it would launch an 8 yuan paid plan to cover the costs of SMS, servers, etc., and welcomed investors to discuss cooperation. In terms of financing and valuation, it plans to sell 10% of the company's shares for 1 million yuan and proposed a valuation of 10 million yuan. On the evening of January 15, the application was removed from the app store in mainland China. == Functions == The application does not require users to enter phone numbers or other information to register. After filling in their name and setting an emergency contact, users can click the sign-in button every day. If they fail to sign in for two consecutive days, the system will send an email reminder to the emergency contact the next day. In addition, users can also bind a smart bracelet to monitor physiological signs, pre-designate a hearse driver and funeral music, and trigger the "one-click body collection" function when no pulse is detected. The application was initially available for free download, but a one yuan paid download option was introduced at the end of 2025. In January 2026, the application team issued a statement saying that an 8 yuan paid option would be launched based on the costs of SMS, servers, etc.

    Read more →
  • Multilinear principal component analysis

    Multilinear principal component analysis

    Multilinear principal component analysis (MPCA) is a multilinear extension of principal component analysis (PCA) that is used to analyze M-way arrays, also informally referred to as "data tensors". M-way arrays may be modeled by linear tensor models, such as CANDECOMP/Parafac, or by multilinear tensor models, such as multilinear principal component analysis (MPCA) or multilinear (tensor) independent component analysis (MICA). In 2005, Vasilescu and Terzopoulos introduced the Multilinear PCA terminology as a way to better differentiate between multilinear data models that employed 2nd order statistics versus higher order statistics to compute a set of independent components for each mode, such as Multilinear ICA Multilinear PCA may be applied to compute the causal factors of data formation, or as signal processing tool on data tensors whose individual observation have either been vectorized, or whose observations are treated as a collection of column/row observations, an "observation as a matrix", and concatenated into a data tensor. The latter approach is suitable for compression and reducing redundancy in the rows, columns and fibers that are unrelated to the causal factors of data formation. Vasilescu and Terzopoulos in their paper "TensorFaces" introduced the M-mode SVD algorithm which are algorithms misidentified in the literature as the HOSVD or the Tucker which employ the power method or gradient descent, respectively. Vasilescu and Terzopoulos framed the data analysis, recognition and synthesis problems as multilinear tensor problems. Data is viewed as the compositional consequence of several causal factors, that are well suited for multi-modal tensor factor analysis. The power of the tensor framework was showcased by analyzing human motion joint angles, facial images or textures in the following papers: Human Motion Signatures (CVPR 2001, ICPR 2002), face recognition – TensorFaces, (ECCV 2002, CVPR 2003, etc.) and computer graphics – TensorTextures (Siggraph 2004). == The algorithm == The MPCA solution follows the alternating least square (ALS) approach. It is iterative in nature. As in PCA, MPCA works on centered data. Centering is a little more complicated for tensors, and it is problem dependent. == Feature selection == MPCA features: Supervised MPCA is employed in causal factor analysis that facilitates object recognition while a semi-supervised MPCA feature selection is employed in visualization tasks. == Extensions == Various extension of MPCA: Robust MPCA (RMPCA) Multi-Tensor Factorization, that also finds the number of components automatically (MTF)

    Read more →
  • Constructing skill trees

    Constructing skill trees

    Constructing skill trees (CST) is a hierarchical reinforcement learning algorithm which can build skill trees from a set of sample solution trajectories obtained from demonstration. CST uses an incremental MAP (maximum a posteriori) change point detection algorithm to segment each demonstration trajectory into skills and integrate the results into a skill tree. CST was introduced by George Konidaris, Scott Kuindersma, Andrew Barto and Roderic Grupen in 2010. == Algorithm == CST consists of mainly three parts;change point detection, alignment and merging. The main focus of CST is online change-point detection. The change-point detection algorithm is used to segment data into skills and uses the sum of discounted reward R t {\displaystyle R_{t}} as the target regression variable. Each skill is assigned an appropriate abstraction. A particle filter is used to control the computational complexity of CST. The change point detection algorithm is implemented as follows. The data for times t ∈ T {\displaystyle t\in T} and models Q with prior p ( q ∈ Q ) {\displaystyle p(q\in Q)} are given. The algorithm is assumed to be able to fit a segment from time j + 1 {\displaystyle j+1} to t using model q with the fit probability P ( j , t , q ) {\displaystyle P(j,t,q)_{}^{}} . A linear regression model with Gaussian noise is used to compute P ( j , t , q ) {\displaystyle P(j,t,q)} . The Gaussian noise prior has mean zero, and variance which follows I n v e r s e G a m m a ( v 2 , u 2 ) {\displaystyle \mathrm {InverseGamma} \left({\frac {v}{2}},{\frac {u}{2}}\right)} . The prior for each weight follows N o r m a l ( 0 , σ 2 δ ) {\displaystyle \mathrm {Normal} (0,\sigma ^{2}\delta )} . The fit probability P ( j , t , q ) {\displaystyle P(j,t,q)} is computed by the following equation. P ( j , t , q ) = π − n 2 δ m | ( A + D ) − 1 | 1 2 u v 2 ( y + u ) u + v 2 Γ ( n + v 2 ) Γ ( v 2 ) {\displaystyle P(j,t,q)={\frac {\pi ^{-{\frac {n}{2}}}}{\delta ^{m}}}\left|(A+D)^{-1}\right|^{\frac {1}{2}}{\frac {u^{\frac {v}{2}}}{(y+u)^{\frac {u+v}{2}}}}{\frac {\Gamma ({\frac {n+v}{2}})}{\Gamma ({\frac {v}{2}})}}} Then, CST compute the probability of the changepoint at time j with model q, P t ( j , q ) {\displaystyle P_{t}(j,q)} and P j MAP {\displaystyle P_{j}^{\text{MAP}}} using a Viterbi algorithm. P t ( j , q ) = ( 1 − G ( t − j − 1 ) ) P ( j , t , q ) p ( q ) P j MAP {\displaystyle P_{t}(j,q)=(1-G(t-j-1))P(j,t,q)p(q)P_{j}^{\text{MAP}}} P j MAP = max i , q P j ( i , q ) g ( j − i ) 1 − G ( j − i − 1 ) , ∀ j < t {\displaystyle P_{j}^{\text{MAP}}=\max _{i,q}{\frac {P_{j}(i,q)g(j-i)}{1-G(j-i-1)}},\forall j Read more →

  • Deterministic blockmodeling

    Deterministic blockmodeling

    Deterministic blockmodeling is an approach in blockmodeling that does not assume a probabilistic model, and instead relies on the exact or approximate algorithms, which are used to find blockmodel(s). This approach typically minimizes some inconsistency that can occur with the ideal block structure. Such analysis is focused on clustering (grouping) of the network (or adjacency matrix) that is obtained with minimizing an objective function, which measures discrepancy from the ideal block structure. However, some indirect approaches (or methods between direct and indirect approaches, such as CONCOR) do not explicitly minimize inconsistencies or optimize some criterion function. This approach was popularized in the 1970s, due to the presence of two computer packages (CONCOR and STRUCTURE) that were used to "find a permutation of the rows and columns in the adjacency matrix leading to an approximate block structure". The opposite approach to deterministic blockmodeling is a stochastic blockmodeling approach.

    Read more →
  • Trigger list

    Trigger list

    Trigger list in its most general meaning refers to a list whose items are used to initiate ("trigger") certain actions. == United States: Private financial information == In the United States, when a person applies for a mortgage loan, the lender makes a credit inquiry about the potential borrower from the national credit bureaus, Equifax, Experian and TransUnion. Unless the borrower is opted out, the credit bureaus put the applicants onto a "trigger list" of "leads" about persons who are interested in new loans. These lists are sold to numerous lenders all over the United States, and soon after the application the applicant starts receiving offers from all parts of the country. The trigger lists contain a significant amount of personal financial information. Among the buyers of trigger lists are "lead generators" which resell filtered information to borrowers, e.g., of people who live in a certain area and have a certain credit score. While the Federal Trade Commission considers the market of "trigger lists" to be a legal business, many people and organizations (such as the National Association of Mortgage Brokers) consider this a serious breach of privacy and lobby for putting this practice under regulatory controls. As of now, American consumers may opt-out from "trigger lists" by calling 1-888-5-OPTOUT (1-888-567-8688). == Nuclear non-proliferation == The Zangger Committee and the Nuclear Suppliers Group maintain lists of items that may contribute to nuclear proliferation; The nuclear non-proliferation treaty forbids its members to export such items to non-treaty members. these items are said to trigger the countries' responsibilities under the NPT, hence the name.

    Read more →
  • IDistance

    IDistance

    In pattern recognition, iDistance is an indexing and query processing technique for k-nearest neighbor queries on point data in multi-dimensional metric spaces. The kNN query is one of the hardest problems on multi-dimensional data, especially when the dimensionality of the data is high. iDistance is designed to process kNN queries in high-dimensional spaces efficiently and performs extremely well for skewed data distributions, which usually occur in real-life data sets. iDistance employs a two-phase search strategy involving an initial filtering of candidate regions and a subsequent refinement of results, an approach aligned with the Filter and Refine Principle (FRP). This means that the index first prunes the search space to eliminate unlikely candidates, then verifies the true nearest neighbors in a refinement step, following the general FRP paradigm used in database search algorithms. The iDistance index can also be augmented with machine learning models to learn data distributions for improved searching and storage of multi-dimensional data. == Indexing == Building the iDistance index has two steps: A number of reference points in the data space are chosen. There are various ways of choosing reference points. Using cluster centers as reference points is the most efficient way. The data points are partitioned into Voronoi cells based on well-chosen reference points. The distance between a data point and its closest reference point is calculated. This distance plus a scaling value is called the point's iDistance. By this means, points in a multi-dimensional space are mapped to one-dimensional values, and then a B+-tree can be adopted to index the points using the iDistance as the key. The figure on the right shows an example where three reference points (O1, O2, O3) are chosen. The data points are then mapped to a one-dimensional space and indexed in a B+-tree. Various extensions have been proposed to make the selection of reference points for effective query performance, including employing machine learning to learn the identification of reference points. == Query processing == To process a kNN query, the query is mapped to a number of one-dimensional range queries, which can be processed efficiently on a B+-tree. In the above figure, the query Q is mapped to a value in the B+-tree while the kNN search ``sphere" is mapped to a range in the B+-tree. The search sphere expands gradually until the k NNs are found. This corresponds to gradually expanding range searches in the B+-tree. The iDistance technique can be viewed as a way of accelerating the sequential scan. Instead of scanning records from the beginning to the end of the data file, the iDistance starts the scan from spots where the nearest neighbors can be obtained early with a very high probability. == Applications == The iDistance has been used in many applications including Image retrieval Video indexing Similarity search in P2P systems Mobile computing Recommender system == Historical background == The iDistance was first proposed by Cui Yu, Beng Chin Ooi, Kian-Lee Tan and H. V. Jagadish in 2001. Later, together with Rui Zhang, they improved the technique and performed a more comprehensive study on it in 2005.

    Read more →
  • Bioz

    Bioz

    Bioz is a search engine for life science experimentation. == History == Bioz was founded by Karin Lachmi and Daniel Levitt. Lachmi is a scientist who completed her postdoc in molecular and cellular biology at the Stanford University School of Medicine. During her lab work she found little available data regarding preferable lab tools, reagents and related products for experimentation. There are 50,000 vendors selling 300 million scientific products. She decided to start the company in order to provide researchers with adequate information for that purpose. Co-founder Daniel Levitt is an entrepreneur who sold his company WebAppoint to Microsoft in the year 2000. He also co-founded the company StemRad. At Bioz, Lachmi serves as the Chief Scientific Officer and Levitt serves as the chief executive officer. Bioz claims to have over a million researcher-users from 196 countries. Among the investors are Esther Dyson and the Stanford-StartX Fund. The company's advisory board includes Nobel Laureates in Chemistry Michael Levitt, Roger Kornberg, and Ada Yonath. == Technology == The company uses artificial intelligence, machine learning and natural language processing in order to extract experimentation data from scientific articles, such as the products that researchers used, the companies that supply the products, the protocol conditions that researchers selected, and the types of experiments and techniques. The algorithm ranks products based on how frequently they were used by researchers in their experiments, how recently a product was used, and the impact factor of the journal. The algorithm's output is a Bioz stars score for each product that was mentioned in an article. Bioz is a data-driven platform for product recommendations, which is contrary to platforms such as TripAdvisor and OpenTable that are based on user-generated reviews and ratings. The recommendations and scoring system that the company has developed are meant to assist researchers with the process of developing future medications and finding cures for diseases. They are guided towards products and techniques that were previously used by other researchers when planning and performing experiments. The company's revenue is based on selling SaaS subscriptions to researchers in biopharma companies. They also charge product suppliers for content syndication.

    Read more →
  • Homogeneity blockmodeling

    Homogeneity blockmodeling

    In mathematics applied to analysis of social structures, homogeneity blockmodeling is an approach in blockmodeling, which is best suited for a preliminary or main approach to valued networks, when a prior knowledge about these networks is not available. This is because homogeneity blockmodeling emphasizes the similarity of link (tie) strengths within the blocks over the pattern of links. In this approach, tie (link) values (or statistical data computed on them) are assumed to be equal (homogenous) within blocks. This approach to the generalized blockmodeling of valued networks was first proposed by Aleš Žiberna in 2007 with the basic idea, "that the inconsistency of an empirical block with its ideal block can be measured by within block variability of appropriate values". The newly–formed ideal blocks, which are appropriate for blockmodeling of valued networks, are then presented together with the definitions of their block inconsistencies. Similar approach to the homogeneity blockmodeling, dealing with direct approach for structural equivalence, was previously suggested by Stephen P. Borgatti and Martin G. Everett (1992).

    Read more →
  • Fake nude photography

    Fake nude photography

    Fake nude photography is the creation of nude photographs designed to appear as genuine nudes of an individual. The motivations for the creation of these modified photographs include curiosity, sexual gratification, the stigmatization or embarrassment of the subject, and commercial gain, such as through the sale of the photographs via pornographic websites. Fakes can be created using image editing software or through machine learning. Fake images created using the latter method are called deepfakes. == History == Magazines such as Celebrity Skin published non-fake paparazzi shots and illicitly obtained nude photos, showing there was a market for such images. Subsequently, some websites hosted fake nude or pornographic photos of celebrities, which are sometimes referred to as celebrity fakes. In the 1990s and 2000s, fake nude images of celebrities proliferated on Usenet and on websites, leading to campaigns to take legal action against the creators of the images and websites dedicated to determining the veracity of nude photos. "Deepfakes", which use artificial neural networks to superimpose one person's face into an image or video of someone else, were popularized in the late 2010s, leading to concerns about the technology's use in fake news and revenge porn. Fake nude photography is sometimes confused with Deepfake pornography, but the two are distinct. Fake nude photography typically starts with human-made non-sexual images, and merely makes it appear that the people in them are nude (but not having sex). Deepfake pornography typically starts with human-made sexual (pornographic) images or videos, and alters the actors' facial features to make the participants in the sexual act look like someone else. === DeepNude === In June 2019, a downloadable Windows and Linux application called DeepNude was released which used a Generative Adversarial Network to remove clothing from images of women. The images it produced were typically not pornographic, merely nude. Because there were more images of nude women than men available to its creator, the images it produced were all female, even when the original was male. The app had both a paid and unpaid version. A few days later, on June 27, the creators removed the application and refunded consumers, although various copies of the app, both free and for charge, continue to exist. On GitHub, the open-source version of this program called "open-deepnude" was deleted. The open-source version had the advantage of allowing it to be trained on a larger dataset of nude images to increase the resulting nude image's accuracy level. A successor free software application, Dreamtime, was later released, and some copies of it remain available, though some have been suppressed. === Deepfake Telegram Bot === In July 2019 a deepfake bot service was launched on messaging app Telegram that used AI technology to create nude images of women. The service was free and enabled users to submit photos and receive manipulated nude images within minutes. The service was connected to seven Telegram channels, including the main channel that hosts the bot, technical support, and image sharing channels. While the total number of users was unknown, the main channel had over 45,000 members. As of July 2020, it is estimated that approximately 24,000 manipulated images had been shared across the image sharing channels. === Nudify websites === By late 2024, most ways to produce nude images from photographs of clothed people were accessible at websites rather than in apps, and required payment. == Purposes == The reasons for the creation of nude photos may range from a need to discredit the target publicly, personal hatred for the target, or the promise of pecuniary gains for such work on the part of the creator of such photos. Fake nude photos often target prominent figures such as businesspeople or politicians. == Notable cases == In 2010, 97 people were arrested in Korea after spreading fake nude pictures of the group Girls' Generation on the internet. In 2011, a 53-year-old Incheon man was arrested after spreading more fake pictures of the same group. In 2012, South Korean police identified 157 Korean artists of whom fake nudes were circulating. In 2012, when Liu Yifei's fake nude photography released on the network, Liu Yifei Red Star Land Company declared a legal search to find out who created and released the photos. In the same year, Chinese actor Huang Xiaoming released nude photos that sparked public controversy, but they were ultimately proven to be real pictures. In 2014, supermodel Kate Upton threatened to sue a website for posting her fake nude photos. Previously, in 2011, this page was threatened by Taylor Swift. In November 2014, singer Rain was angry because of a fake nude photo that spread throughout the internet. Information reveals that: "Rain's nude photo was released from Kim Tae-hee's lost phone." Rain's label, Cube Entertainment, stated that the person in the nude photo is not Rain and the company has since stated that it will take strict legal action against those who post photos together with false comments. In July 2018, Seoul police launched an investigation after a fake nude photo of President Moon Jae-in was posted on the website of the Korean radical feminist group WOMAD. In early 2019, Alexandria Ocasio-Cortez, a Democratic politician, was berated by other political parties over a fake nude photo of her in the bathroom. The picture created a huge wave of media controversy in the United States. == Methods == Fake nude images can be created using image editing software or neural network applications. There are two basic methods: Combine and superimpose existing images onto source images, adding the face of the subject onto a nude model. Remove clothes from the source image to make it look like a nude photo. == Impact == Images of this type may have a negative psychological impact on the victims and may be used for extortion purposes.

    Read more →
  • Operational taxonomic unit

    Operational taxonomic unit

    An operational taxonomic unit (OTU) is an operational definition used to classify groups of closely related individuals. The term was originally introduced in 1963 by Robert R. Sokal and Peter H. A. Sneath in the context of numerical taxonomy, where an "operational taxonomic unit" is simply the group of organisms currently being studied. In this sense, an OTU is a pragmatic definition to group individuals by similarity, equivalent to but not necessarily in line with classical Linnaean taxonomy or modern evolutionary taxonomy. Nowadays, however, the term is commonly used in a different context and refers to clusters of (uncultivated or unknown) organisms, grouped by DNA sequence similarity of a specific taxonomic marker gene (originally coined as mOTU; molecular OTU). In other words, OTUs are pragmatic proxies for "species" at different taxonomic levels, in the absence of traditional systems of biological classification as are available for macroscopic organisms. For several years, OTUs have been the most commonly used units of diversity, especially when analysing small subunit 16S (for prokaryotes) or 18S rRNA (for eukaryotes) marker gene sequence datasets. == Molecular OTU by clustering of marker gene sequences == In the approach represented by DNA barcoding, a particular locus is chosen to be used as the marker gene for classification. This locus should be universally present in the scope selected, variable enough to be different among close-related species, and be flanked by conservative sequences that allow for easy amplification and detection. There are databases containing sequences for such marker genes from many different species, allowing for comparison. (Sometimes only using one locus does not provide sufficient resolution, so multiple marker genes are used. This is the case for plants, where rbcL+matK is common.) Sequences obtained this way can be clustered according to their similarity to one another, and operational taxonomic units are defined based on the similarity threshold set by the researcher. The exact threshold depends on the taxa in question and the mutational rates of the selected locus in the taxon. 97–99% are commonly used, but "it is now recognized to be somewhat arbitrary as sequence variation within and among species varies across taxa". 100% similarity (fully identical) is also common, also known as single variants. It remains debatable how well this commonly used method recapitulates true microbial species phylogeny or ecology. Although OTUs can be calculated differently when using different algorithms or thresholds, research by Schmidt et al. (2014) demonstrated that 16S-derived microbial OTUs were generally ecologically consistent across habitats and several clustering approaches. The number of OTUs defined may be inflated due to errors in DNA sequencing. === OTU clustering approaches === There are three main approaches to clustering OTUs: De novo, for which the clustering is based on similarities between sequencing reads. Closed-reference, for which the clustering is performed against a reference database of sequences. Open-reference, where clustering is first performed against a reference database of sequences, then any remaining sequences that could not be mapped to the reference are clustered de novo. Using a reference provides taxonomic context for the OTUs found. Alternatively, taxonomic context can be found after the construction of clusters by comparing representative sequences from clusters against a reference database. There are also specialized classifiers for this purpose which are much faster than naive comparison using BLAST. === OTU clustering algorithms === Hierarchical clustering algorithms (HCA): uclust & cd-hit & ESPRIT Bayesian clustering: CROP == Molecular OTU by other methods == In addition to similarity-based grouping, marker gene sequences can be sorted into OTUs using molecular phylogeny, k-mer composition, or hybrid methods combining these methods with similarity. There are also Bayesian tree-less methods and machine learning approaches. Using phylogeny often involves manually assigning terminal clades or single nodes to an OTU, so this is usually only done for refinement. Genome skimming can be used to obtain high-copy DNA without the need to choose marker genes or to design PCR primers for the chosen genes. It can provide fairly good coverage of organelle DNA and repetitive elements such as ribosomal DNA, both of which can be used like marker genes in OTU analysis. Whole-genome sequencing is more expensive and involves the production and processing of more data. By considering the entire genome, many (sometimes over 100) marker genes can be used at the same time, producing highly resolved phylogenies that correctly identify problematic taxa. It is also possible to use entire genomes for OTU assignment. For example, genomes from different bacterial species almost always have an average nucleotide identity lower than 95%, a fact that can be used to define new OTUs (and likely new species).

    Read more →
  • FERET database

    FERET database

    The Facial Recognition Technology (FERET) database is a dataset used for facial recognition system evaluation as part of the Face Recognition Technology (FERET) program. It was first established in 1993 under a collaborative effort between Harry Wechsler at George Mason University and Jonathon Phillips at the Army Research Laboratory in Adelphi, Maryland. The FERET database serves as a standard database of facial images for researchers to use to develop various algorithms and report results. The use of a common database also allowed one to compare the effectiveness of different approaches in methodology and gauge their strengths and weaknesses. The facial images for the database were collected between December 1993 and August 1996, accumulating a total of 14,126 images pertaining to 1,199 individuals along with 365 duplicate sets of images that were taken on a different day. In 2003, the Defense Advanced Research Projects Agency (DARPA) released a high-resolution, 24-bit color version of these images. The dataset tested includes 2,413 still facial images, representing 856 individuals. The FERET database has been used by more than 460 research groups and is managed by the National Institute of Standards and Technology (NIST).

    Read more →
  • Santa Fe Trail problem

    Santa Fe Trail problem

    The Santa Fe Trail problem is a genetic programming exercise in which artificial ants search for food pellets according to a programmed set of instructions. The layout of food pellets in the Santa Fe Trail problem has become a standard for comparing different genetic programming algorithms and solutions. One method for programming and testing algorithms on the Santa Fe Trail problem is by using the NetLogo application. There is at least one case of a student creating a Lego robotic ant to solve the problem.

    Read more →
  • Rademacher complexity

    Rademacher complexity

    In computational learning theory (machine learning and theory of computation), Rademacher complexity, named after Hans Rademacher, measures richness of a class of sets with respect to a probability distribution. The concept can also be extended to real valued functions. == Definitions == === Rademacher complexity of a set === Given a set A ⊆ R m {\displaystyle A\subseteq \mathbb {R} ^{m}} , the Rademacher complexity of A is defined as follows: Rad ⁡ ( A ) := 1 m E σ [ sup a ∈ A ∑ i = 1 m σ i a i ] {\displaystyle \operatorname {Rad} (A):={\frac {1}{m}}\mathbb {E} _{\sigma }\left[\sup _{a\in A}\sum _{i=1}^{m}\sigma _{i}a_{i}\right]} where σ 1 , σ 2 , … , σ m {\displaystyle \sigma _{1},\sigma _{2},\dots ,\sigma _{m}} are independent random variables drawn from the Rademacher distribution i.e. Pr ( σ i = + 1 ) = Pr ( σ i = − 1 ) = 1 / 2 {\displaystyle \Pr(\sigma _{i}=+1)=\Pr(\sigma _{i}=-1)=1/2} for i ∈ { 1 , 2 , … , m } {\displaystyle i\in \{1,2,\dots ,m\}} , and a = ( a 1 , … , a m ) ∈ A {\displaystyle a=(a_{1},\ldots ,a_{m})\in A} . Some authors take the absolute value of the sum before taking the supremum, but if A {\displaystyle A} is symmetric this makes no difference. === Rademacher complexity of a function class === Let S = { z 1 , z 2 , … , z m } ⊆ Z {\displaystyle S=\{z_{1},z_{2},\dots ,z_{m}\}\subseteq Z} be a sample of points and consider a function class F {\displaystyle {\mathcal {F}}} of real-valued functions over Z {\displaystyle Z} . Then, the empirical Rademacher complexity of F {\displaystyle {\mathcal {F}}} given S {\displaystyle S} is defined as: Rad S ⁡ ( F ) = 1 m E σ [ sup f ∈ F | ∑ i = 1 m σ i f ( z i ) | ] {\displaystyle \operatorname {Rad} _{S}({\mathcal {F}})={\frac {1}{m}}\mathbb {E} _{\sigma }\left[\sup _{f\in {\mathcal {F}}}\left|\sum _{i=1}^{m}\sigma _{i}f(z_{i})\right|\right]} This can also be written using the previous definition: Rad S ⁡ ( F ) = Rad ⁡ ( F ∘ S ) {\displaystyle \operatorname {Rad} _{S}({\mathcal {F}})=\operatorname {Rad} ({\mathcal {F}}\circ S)} where F ∘ S {\displaystyle {\mathcal {F}}\circ S} denotes function composition, i.e.: F ∘ S := { ( f ( z 1 ) , … , f ( z m ) ) ∣ f ∈ F } {\displaystyle {\mathcal {F}}\circ S:=\{(f(z_{1}),\ldots ,f(z_{m}))\mid f\in {\mathcal {F}}\}} The worst case empirical Rademacher complexity is Rad ¯ m ( F ) = sup S = { z 1 , … , z m } Rad S ⁡ ( F ) {\displaystyle {\overline {\operatorname {Rad} }}_{m}({\mathcal {F}})=\sup _{S=\{z_{1},\dots ,z_{m}\}}\operatorname {Rad} _{S}({\mathcal {F}})} Let P {\displaystyle P} be a probability distribution over Z {\displaystyle Z} . The Rademacher complexity of the function class F {\displaystyle {\mathcal {F}}} with respect to P {\displaystyle P} for sample size m {\displaystyle m} is: Rad P , m ⁡ ( F ) := E S ∼ P m [ Rad S ⁡ ( F ) ] {\displaystyle \operatorname {Rad} _{P,m}({\mathcal {F}}):=\mathbb {E} _{S\sim P^{m}}\left[\operatorname {Rad} _{S}({\mathcal {F}})\right]} where the above expectation is taken over an identically independently distributed (i.i.d.) sample S = ( z 1 , z 2 , … , z m ) {\displaystyle S=(z_{1},z_{2},\dots ,z_{m})} generated according to P {\displaystyle P} . == Intuition == The Rademacher complexity is typically applied on a function class of models that are used for classification, with the goal of measuring their ability to classify points drawn from a probability space under arbitrary labellings. When the function class is rich enough, it contains functions that can appropriately adapt for each arrangement of labels, simulated by the random draw of σ i {\displaystyle \sigma _{i}} under the expectation, so that this quantity in the sum is maximized. The Rademacher complexity of a set A {\displaystyle A} can be rewritten as Rad ⁡ ( A ) := 1 m E σ [ sup a ∈ A ∑ i = 1 m σ i a i ] = 1 m 2 m ∑ σ ∈ { − 1 / m , + 1 / m } m [ sup a ∈ A ⟨ σ , a ⟩ ] . {\displaystyle \operatorname {Rad} (A):={\frac {1}{m}}\mathbb {E} _{\sigma }\left[\sup _{a\in A}\sum _{i=1}^{m}\sigma _{i}a_{i}\right]={\frac {1}{{\sqrt {m}}2^{m}}}\sum _{\sigma \in \{-1/{\sqrt {m}},+1/{\sqrt {m}}\}^{m}}\left[\sup _{a\in A}\langle \sigma ,a\rangle \right].} Each term in the summation is the farthest distance of the set A {\displaystyle A} from the origin, along a unit-length direction σ {\displaystyle \sigma } . The directions are along the vertices of a hypercube. Thus, we can also write it as Rad ⁡ ( A ) = 1 2 m 1 2 m − 1 ∑ σ ∈ { − 1 / m , + 1 / m } m / { − 1 , + 1 } [ sup a ∈ A ⟨ σ , a ⟩ − inf a ∈ A ⟨ σ , a ⟩ ] {\displaystyle \operatorname {Rad} (A)={\frac {1}{2{\sqrt {m}}}}{\frac {1}{2^{m-1}}}\sum _{\sigma \in \{-1/{\sqrt {m}},+1/{\sqrt {m}}\}^{m}/\{-1,+1\}}\left[\sup _{a\in A}\langle \sigma ,a\rangle -\inf _{a\in A}\langle \sigma ,a\rangle \right]} Here, the set { − 1 / m , + 1 / m } m / { − 1 , + 1 } {\displaystyle \{-1/{\sqrt {m}},+1/{\sqrt {m}}\}^{m}/\{-1,+1\}} denotes half of the vertices of a hypercube, selected so that each diagonal has exactly one vertex selected. In words, this states that 2 m Rad ⁡ ( A ) {\displaystyle 2{\sqrt {m}}\operatorname {Rad} (A)} is precisely the average width of the set A {\displaystyle A} along all diagonal directions of a hypercube. == Examples == A singleton set has 0 width in any direction, so it has Rademacher complexity 0. The set A = { ( 1 , 1 ) , ( 1 , 2 ) } ⊆ R 2 {\displaystyle A=\{(1,1),(1,2)\}\subseteq \mathbb {R} ^{2}} has average width 1 / 2 {\displaystyle 1/{\sqrt {2}}} along the two diagonal directions of the square, so it has Rademacher complexity 1 / 4 {\displaystyle 1/4} . The unit cube [ 0 , 1 ] m {\displaystyle [0,1]^{m}} has constant width m {\displaystyle {\sqrt {m}}} along the diagonal directions, so it has Rademacher complexity 1 / 2 {\displaystyle 1/2} . Similarly, the unit cross-polytope { x ∈ R m : ‖ x ‖ 1 ≤ 1 } {\displaystyle \{x\in \mathbb {R} ^{m}:\|x\|_{1}\leq 1\}} has constant width 2 / m {\displaystyle 2/{\sqrt {m}}} along the diagonal directions, so it has Rademacher complexity 1 / m {\displaystyle 1/m} . == Using the Rademacher complexity == The Rademacher complexity can be used to derive data-dependent upper-bounds on the learnability of function classes. Intuitively, a function-class with smaller Rademacher complexity is easier to learn. === Bounding the representativeness === In machine learning, it is desired to have a training set that represents the true distribution of some sample data S {\displaystyle S} . This can be quantified using the notion of representativeness. Denote by P {\displaystyle P} the probability distribution from which the samples are drawn. Denote by H {\displaystyle H} the set of hypotheses (potential classifiers) and denote by F {\displaystyle {\mathcal {F}}} the corresponding set of error functions, i.e., for every hypothesis h ∈ H {\displaystyle h\in H} , there is a function f h ∈ F {\displaystyle f_{h}\in F} , that maps each training sample (features,label) to the error of the classifier h {\displaystyle h} (note in this case hypothesis and classifier are used interchangeably). For example, in the case that h {\displaystyle h} represents a binary classifier, the error function is a 0–1 loss function, i.e. the error function f h {\displaystyle f_{h}} returns 0 if h {\displaystyle h} correctly classifies a sample and 1 else. We omit the index and write f {\displaystyle f} instead of f h {\displaystyle f_{h}} when the underlying hypothesis is irrelevant. Define: L P ( f ) := E z ∼ P [ f ( z ) ] {\displaystyle L_{P}(f):=\mathbb {E} _{z\sim P}[f(z)]} – the expected error of some error function f ∈ F {\displaystyle f\in {\mathcal {F}}} on the real distribution P {\displaystyle P} ; L S ( f ) := 1 m ∑ i = 1 m f ( z i ) {\displaystyle L_{S}(f):={1 \over m}\sum _{i=1}^{m}f(z_{i})} – the estimated error of some error function f ∈ F {\displaystyle f\in {\mathcal {F}}} on the sample S {\displaystyle S} . The representativeness of the sample S {\displaystyle S} , with respect to P {\displaystyle P} and F {\displaystyle {\mathcal {F}}} , is defined as: Rep P ⁡ ( F , S ) := sup f ∈ F ( L P ( f ) − L S ( f ) ) {\displaystyle \operatorname {Rep} _{P}({\mathcal {F}},S):=\sup _{f\in F}(L_{P}(f)-L_{S}(f))} Smaller representativeness is better, since it provides a way to avoid overfitting: it means that the true error of a classifier is not much higher than its estimated error, and so selecting a classifier that has low estimated error will ensure that the true error is also low. Note however that the concept of representativeness is relative and hence can not be compared across distinct samples. The expected representativeness of a sample can be bounded above by the Rademacher complexity of the function class: If F {\displaystyle {\mathcal {F}}} is a set of functions with range within [ 0 , 1 ] {\displaystyle [0,1]} , then Rad P , m ⁡ ( F ) − ln ⁡ 2 2 m ≤ E S ∼ P m [ Rep P ⁡ ( F , S ) ] ≤ 2 Rad P , m ⁡ ( F ) {\displaystyle \operatorname {Rad} _{P,m}({\mathcal {F}})-{\sqrt {\frac {\ln 2}{2m}}}\leq \mathbb {E} _{S\sim P^{m}}[\operatorname {Rep} _{P}({\

    Read more →
  • Vladimir Batagelj

    Vladimir Batagelj

    Vladimir Batagelj (born June 14, 1948 in Idrija, Yugoslavia) is a Slovenian mathematician and an emeritus professor of mathematics at the University of Ljubljana. He is known for his work in discrete mathematics and combinatorial optimization, particularly analysis of social networks and other large networks (blockmodeling). == Education and career == Vladimir Batagelj completed his Ph.D. at the University of Ljubljana in 1986 under the direction of Tomaž Pisanski. He stayed at the University of Ljubljana as a professor until his retirement, where he was a professor of sociology and statistics, while also being a chair of the Department of Sociology of the Faculty of Social Sciences. As visiting professor, he was taught at the University of Pittsburgh (1990-91) and at the University of Konstanz (2002). He was also a member of editorial boards of two journals: Informatica and Journal of Social Structure. His work has been cited over 11000 times. His book Exploratory Social Network Analysis with Pajek on blockmodeling, coauthored with Wouter de Nooy and Andrej Mrvar, is Batagelj's most cited work and has over 3300 citations. The book was translated into Chinese and Japanese. The revised and expanded third edition has been published by Cambridge University Press. In 1975, 11 years before completing his PhD, Batagelj published a solo paper in Communications of the ACM. Batagelj authored more than 20 textbooks in Slovenian, covering topics like TeX, combinatorics and discrete mathematics. He has also written extensively in the Slovenian popular science journal Presek. Batagelj has advised 9 Ph.D. students. == Pajek == Batagelj is particularly known for his work on Pajek, a freely available software for analysis and visualization of large networks. He began work on Pajek in 1996 with Andrej Mrvar, who was then his PhD student. == Awards and honors == First prizes for contributions (with Andrej Mrvar) to Graph Drawing Contests in years: 1995, 1996, 1997, 1998, 1999, 2000 and 2005 / Graph Drawing Hall of Fame. In 2007 the book Generalized blockmodeling was awarded the Harrison White Outstanding Book Award by the Mathematical Sociology Section of American Sociological Association In 2007 he was awarded (together with Anuška Ferligoj) the Simmel Award by INSNA. In 2013, Vladimir Batagelj and Andrej Mrvar received the INSNA's William D. Richards Software award for their work on Pajek. == Selected bibliography == Vladimir Batagelj, Social Network Analysis, Large-Scale [1]. in R.A. Meyers, ed., Encyclopedia of Complexity and Systems Science, Springer 2009: 8245–8265. Vladimir Batagelj, Complex Networks, Visualization of [2]. in R.A. Meyers, ed., Encyclopedia of Complexity and Systems Science, Springer 2009: 1253–1268. Wouter de Nooy, Andrej Mrvar, Vladimir Batagelj, Mark Granovetter (Series Editor), Exploratory Social Network Analysis with Pajek (Structural Analysis in the Social Sciences), Cambridge University Press 2005 (ISBN 0-521-60262-9). ESNA in Japanese, TDU, 2010. Patrick Doreian, Vladimir Batagelj, Anuška Ferligoj, Mark Granovetter (Series Editor), Generalized Blockmodeling (Structural Analysis in the Social Sciences), Cambridge University Press 2004 (ISBN 0-521-84085-6)

    Read more →
  • Population model (evolutionary algorithm)

    Population model (evolutionary algorithm)

    The population model of an evolutionary algorithm (EA) describes the structural properties of its population to which its members are subject. A population is the set of all proposed solutions of an EA considered in one iteration, which are also called individuals according to the biological role model. The individuals of a population can generate further individuals as offspring with the help of the genetic operators of the procedure. The simplest and widely used population model in EAs is the global or panmictic model, which corresponds to an unstructured population. It allows each individual to choose any other individual of the population as a partner for the production of offspring by crossover, whereby the details of the selection are irrelevant as long as the fitness of the individuals plays a significant role. Due to global mate selection, the genetic information of even slightly better individuals can prevail in a population after a few generations (iteration of an EA), provided that no better other offspring have emerged in this phase. If the solution found in this way is not the optimum sought, that is called premature convergence. This effect can be observed more often in panmictic populations. In nature global mating pools are rarely found. What prevails is a certain and limited isolation due to spatial distance. The resulting local neighbourhoods initially evolve independently and mutants have a higher chance of persisting over several generations. As a result, genotypic diversity in the gene pool is preserved longer than in a panmictic population. It is therefore obvious to divide the previously global population by substructures. Two basic models were introduced for this purpose, the island models, which are based on a division of the population into fixed subpopulations that exchange individuals from time to time, and the neighbourhood models, which assign individuals to overlapping neighbourhoods, also known as cellular genetic or evolutionary algorithms (cGA or cEA). The associated division of the population also suggests a corresponding parallelization of the procedure. For this reason, the topic of population models is also frequently discussed in the literature in connection with the parallelization of EAs. == Island models == In the island model, also called the migration model or coarse grained model, evolution takes place in strictly divided subpopulations. These can be organised panmictically, but do not have to be. From time to time an exchange of individuals takes place, which is called migration. The time between an exchange is called an epoch and its end can be triggered by various criteria: E.g. after a given time or given number of completed generations, or after the occurrence of stagnation. Stagnation can be detected, for example, by the fact that no fitness improvement has occurred in the island for a given number of generations. Island models introduce a variety of new strategy parameters: Number of subpopulations Size of the subpopulations Neighbourhood relations between islands: they determine which islands are considered neighbouring and can thus exchange individuals, see picture of a simple unidirectional ring (black arrows) and its extension by additional bidirectional neighbourhood relations (additional green arrows) Criteria for the termination of an epoch, synchronous or asynchronous migration Migration rate: number or proportion of individuals involved in migration. Migrant selection: There are many alternatives for this. E.g. the best individuals can replace the worst or randomly selected ones. Depending on the migration rate, this can affect one or more individuals at a time. With these parameters, the selection pressure can be influenced to a considerable extent. For example, it increases with the interconnectedness of the islands and decreases with the number of subpopulations or the epoch length. == Neighbourhood models or cellular evolutionary algorithms == The neighbourhood model, also called diffusion model or fine grained model, defines a topological neighbouhood relation between the individuals of a population that is independent of their phenotypic properties. The fundamental idea of this model is to provide the EA population with a special structure defined as a connected graph, in which each vertex is an individual that communicates with its nearest neighbours. Particularly, individuals are conceptually set in a toroidal mesh, and are only allowed to recombine with close individuals. This leads to a kind of locality known as isolation by distance. The set of potential mates of an individual is called its neighbourhood or deme. The adjacent figure illustrates that by showing two slightly overlapping neighbourhoods of two individuals marked yellow, through which genetic information can spread between the two demes. It is known that in this kind of algorithm, similar individuals tend to cluster and create niches that are independent of the deme boundaries and, in particular, can be larger than a deme. There is no clear borderline between adjacent groups, and close niches could be easily colonized by competitive ones and maybe merge solution contents during this process. Simultaneously, farther niches can be affected more slowly. EAs with this type of population are also well known as cellular EAs (cEA) or cellular genetic algorithms (cGA). A commonly used structure for arranging the individuals of a population is a 2D toroidal grid, although the number of dimensions can be easily extended (to 3D) or reduced (to 1D, e.g. a ring, see the figure on the right). The neighbourhood of a particular individual in the grid is defined in terms of the Manhattan distance from it to others in the population. In the basic algorithm, all the neighbourhoods have the same size and identical shapes. The two most commonly used neighbourhoods for two-dimensional cEAs are L5 and C9, see the figure on the left. Here, L stands for Linear while C stands for Compact. Each deme represents a panmictic subpopulation within which mate selection and the acceptance of offspring takes place by replacing the parent. The rules for the acceptance of offspring are local in nature and based on the neighbourhood: for example, it can be specified that the best offspring must be better than the parent being replaced or, less strictly, only better than the worst individual in the deme. The first rule is elitist and creates a higher selective pressure than the second non-elitist rule. In elitist EAs, the best individual of a population always survives. In this respect, they deviate from the biological model. The overlap of the neighbourhoods causes a mostly slow spread of genetic information across the neighbourhood boundaries, hence the name diffusion model. A better offspring now needs more generations than in panmixy to spread in the population. This promotes the emergence of local niches and their local evolution, thus preserving genotypic diversity over a longer period of time. The result is a better and dynamic balance between breadth and depth search adapted to the search space during a run. Depth search takes place in the niches and breadth search in the niche boundaries and through the evolution of the different niches of the whole population. For the same neighbourhood size, the spread of genetic information is larger for elongated figures like L9 than for a block like C9, and again significantly larger than for a ring. This means that ring neighbourhoods are well suited for achieving high quality results, even if this requires comparatively long run times. On the other hand, if one is primarily interested in fast and good, but possibly suboptimal results, 2D topologies are more suitable. == Comparison == When applying both population models to genetic algorithms, evolutionary strategy and other EAs, the splitting of a total population into subpopulations usually reduces the risk of premature convergence and leads to better results overall more reliably and faster than would be expected with panmictic EAs. Island models have the disadvantage compared to neighbourhood models that they introduce a large number of new strategy parameters. Despite the existing studies on this topic in the literature, a certain risk of unfavourable settings remains for the user. With neighbourhood models, on the other hand, only the size of the neighbourhood has to be specified and, in the case of the two-dimensional model, the choice of the neighbourhood figure is added. == Parallelism == Since both population models imply population partitioning, they are well suited as a basis for parallelizing an EA. This applies even more to cellular EAs, since they rely only on locally available information about the members of their respective demes. Thus, in the extreme case, an independent execution thread can be assigned to each individual, so that the entire cEA can run on a parallel hardware platform. The island model also supports p

    Read more →