AI Data Defense

AI Data Defense — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Infogram

    Infogram

    Infogram is a web-based data visualization and infographics platform, created in Riga, Latvia. It allows people to make and share digital charts, infographics and maps. Infogram offers an intuitive WYSIWYG editor that converts users’ data into infographics that can be published, embedded or shared. Users do not need coding skills to use this tool; users include newsrooms, marketing teams, governments, educators and students. The company that created Infogram, also called Infogram, was founded in 2012 in Riga, Latvia and has another office in San Francisco. As of October 2017, Infogram says it has 3 million users who have created charts and infographics that have been viewed more than 1.5 billion times. Infogram was bought by Prezi, a web-based presentation software company, in May 2017. == History == Infogram was founded in February 2012 in Riga, Latvia by Uldis Leiterts, Raimonds Kaže and Alise Dīrika. In January 2013, Infogram won the international Hy Berlin pitch contest. During his pitch, Infogram CEO Uldis Leiterts announced that the company had created more templates and was working with Microsoft to integrate its platform with the contemporaneous version of Microsoft Office. The company also won the 2013 Kantar Information Is Beautiful Award, which “celebrates excellence and beauty in data visualizations, infographics, interactives & information art.” In December 2014, Infogram acquired the Brazil-based data visualization blog, Visualoop. In an effort to expand sales and marketing in the U.S., Infogram secured $1.8 million in funding in February 2014. The announcement was made at TechChill, a startup conference for the Baltics in Riga, Latvia. At the time, the funding was believed to be the largest to date for the company. Infogram won the 2017 National Design Award of Latvia. == Acquisition by Prezi == Prezi, a web-based presentation software company, acquired Infogram in May 2017. Infogram is now a wholly owned subsidiary of Prezi. Infogram was rated #1 on Forbes’ list of “The Best Infographic Tools for 2017,” which was published in September 2017. In October 2017, Infogram announced a new version of its data visualization platform, including a drag-and-drop editor, over 40 new designer templates and social media support.

    Read more →
  • LIBSVM

    LIBSVM

    LIBSVM and LIBLINEAR are two popular open source machine learning libraries, both developed at the National Taiwan University and both written in C++ though with a C API. LIBSVM implements the sequential minimal optimization (SMO) algorithm for kernelized support vector machines (SVMs), supporting classification and regression. LIBLINEAR implements linear SVMs and logistic regression models trained using a coordinate descent algorithm. The SVM learning code from both libraries is often reused in other open source machine learning toolkits, including GATE, KNIME, Orange and scikit-learn. Bindings and ports exist for programming languages such as Java, MATLAB, R, Julia, and Python. It is available in e1071 library in R and scikit-learn in Python. Both libraries are free software released under the 3-clause BSD license.

    Read more →
  • Crossover (evolutionary algorithm)

    Crossover (evolutionary algorithm)

    Crossover in evolutionary algorithms and evolutionary computation, also called recombination, is a genetic operator used to combine the genetic information of two parents to generate new offspring. It is one way to stochastically generate new solutions from an existing population, and is analogous to the crossover that happens during sexual reproduction in biology. New solutions can also be generated by cloning an existing solution, which is analogous to asexual reproduction. Newly generated solutions may be mutated before being added to the population. The aim of recombination is to transfer good characteristics from two different parents to one child. Different algorithms in evolutionary computation may use different data structures to store genetic information, and each genetic representation can be recombined with different crossover operators. Typical data structures that can be recombined with crossover are bit arrays, vectors of real numbers, or trees. The list of operators presented below is by no means complete and serves mainly as an exemplary illustration of this dyadic genetic operator type. More operators and more details can be found in the literature. == Crossover for binary arrays == Traditional genetic algorithms store genetic information in a chromosome represented by a bit array. Crossover methods for bit arrays are popular and an illustrative example of genetic recombination. === One-point crossover === A point on both parents' chromosomes is picked randomly, and designated a 'crossover point'. Bits to the right of that point are swapped between the two parent chromosomes. This results in two offspring, each carrying some genetic information from both parents. === Two-point and k-point crossover === In two-point crossover, two crossover points are picked randomly from the parent chromosomes. The bits in between the two points are swapped between the parent organisms. Two-point crossover is equivalent to performing two single-point crossovers with different crossover points. This strategy can be generalized to k-point crossover for any positive integer k, picking k crossover points. === Uniform crossover === In uniform crossover, typically, each bit is chosen from either parent with equal probability. Other mixing ratios are sometimes used, resulting in offspring which inherit more genetic information from one parent than the other. In a uniform crossover, we don’t divide the chromosome into segments, rather we treat each gene separately. In this, we essentially flip a coin for each chromosome to decide whether or not it will be included in the off-spring. == Crossover for integer or real-valued genomes == For the crossover operators presented above and for most other crossover operators for bit strings, it holds that they can also be applied accordingly to integer or real-valued genomes whose genes each consist of an integer or real-valued number. Instead of individual bits, integer or real-valued numbers are then simply copied into the child genome. The offspring lie on the remaining corners of the hyperbody spanned by the two parents P 1 = ( 1.5 , 6 , 8 ) {\displaystyle P_{1}=(1.5,6,8)} and P 2 = ( 7 , 2 , 1 ) {\displaystyle P_{2}=(7,2,1)} , as exemplified in the accompanying image for the three-dimensional case. === Discrete recombination === If the rules of the uniform crossover for bit strings are applied during the generation of the offspring, this is also called discrete recombination. === Intermediate recombination === In this recombination operator, the allele values of the child genome a i {\displaystyle a_{i}} are generated by mixing the alleles of the two parent genomes a i , P 1 {\displaystyle a_{i,P_{1}}} and a i , P 2 {\displaystyle a_{i,P_{2}}} : α i = α i , P 1 ⋅ β i + α i , P 2 ⋅ ( 1 − β i ) w i t h β i ∈ [ − d , 1 + d ] {\displaystyle \alpha _{i}=\alpha _{i,P_{1}}\cdot \beta _{i}+\alpha _{i,P_{2}}\cdot \left(1-\beta _{i}\right)\quad {\mathsf {with}}\quad \beta _{i}\in \left[-d,1+d\right]} randomly equally distributed per gene i {\displaystyle i} The choice of the interval [ − d , 1 + d ] {\displaystyle [-d,1+d]} causes that besides the interior of the hyperbody spanned by the allele values of the parent genes additionally a certain environment for the range of values of the offspring is in question. A value of 0.25 {\displaystyle 0.25} is recommended for d {\displaystyle d} to counteract the tendency to reduce the allele values that otherwise exists at d = 0 {\displaystyle d=0} . The adjacent figure shows for the two-dimensional case the range of possible new alleles of the two exemplary parents P 1 = ( 3 , 6 ) {\displaystyle P_{1}=(3,6)} and P 2 = ( 9 , 2 ) {\displaystyle P_{2}=(9,2)} in intermediate recombination. The offspring of discrete recombination C 1 {\displaystyle C_{1}} and C 2 {\displaystyle C_{2}} are also plotted. Intermediate recombination satisfies the arithmetic calculation of the allele values of the child genome required by virtual alphabet theory. Discrete and intermediate recombination are used as a standard in the evolution strategy. == Crossover for permutations == For combinatorial tasks, permutations are usually used that are specifically designed for genomes that are themselves permutations of a set. The underlying set is usually a subset of N {\displaystyle \mathbb {N} } or N 0 {\displaystyle \mathbb {N} _{0}} . If 1- or n-point or uniform crossover for integer genomes is used for such genomes, a child genome may contain some values twice and others may be missing. This can be remedied by genetic repair, e.g. by replacing the redundant genes in positional fidelity for missing ones from the other child genome. In order to avoid the generation of invalid offspring, special crossover operators for permutations have been developed which fulfill the basic requirements of such operators for permutations, namely that all elements of the initial permutation are also present in the new one and only the order is changed. It can be distinguished between combinatorial tasks, where all sequences are admissible, and those where there are constraints in the form of inadmissible partial sequences. A well-known representative of the first task type is the traveling salesman problem (TSP), where the goal is to visit a set of cities exactly once on the shortest tour. An example of the constrained task type is the scheduling of multiple workflows. Workflows involve sequence constraints on some of the individual work steps. For example, a thread cannot be cut until the corresponding hole has been drilled in a workpiece. Such problems are also called order-based permutations. In the following, two crossover operators are presented as examples, the partially mapped crossover (PMX) motivated by the TSP and the order crossover (OX1) designed for order-based permutations. A second offspring can be produced in each case by exchanging the parent chromosomes. === Partially mapped crossover (PMX) === The PMX operator was designed as a recombination operator for TSP like Problems. The explanation of the procedure is illustrated by an example: === Order crossover (OX1) === The order crossover goes back to Davis in its original form and is presented here in a slightly generalized version with more than two crossover points. It transfers information about the relative order from the second parent to the offspring. First, the number and position of the crossover points are determined randomly. The resulting gene sequences are then processed as described below: Among other things, order crossover is well suited for scheduling multiple workflows, when used in conjunction with 1- and n-point crossover. === Further crossover operators for permutations === Over time, a large number of crossover operators for permutations have been proposed, so the following list is only a small selection. For more information, the reader is referred to the literature. cycle crossover (CX) order-based crossover (OX2) position-based crossover (POS) edge recombination voting recombination (VR) alternating-positions crossover (AP) maximal preservative crossover (MPX) merge crossover (MX) sequential constructive crossover operator (SCX) The usual approach to solving TSP-like problems by genetic or, more generally, evolutionary algorithms, presented earlier, is either to repair illegal descendants or to adjust the operators appropriately so that illegal offspring do not arise in the first place. Alternatively, Riazi suggests the use of a double chromosome representation, which avoids illegal offspring.

    Read more →
  • Geographical cluster

    Geographical cluster

    A geographical cluster is a localized anomaly, usually an excess of something given the distribution or variation of something else. Often it is considered as an incidence rate that is unusual in that there is more of some variable than might be expected. Examples would include: a local excess disease rate, a crime hot spot, areas of high unemployment, accident blackspots, unusually high positive residuals from a model, high concentrations of flora or fauna, physical features or events like earthquake epicenters etc... Identifying these extreme regions may be useful in that there could be implicit geographical associations with other variables that can be identified and would be of interest. Pattern detection via the identification of such geographical clusters is a very simple and generic form of geographical analysis that has many applications in many different contexts. The emphasis is on localized clustering or patterning because this may well contain the most useful information. A geographical cluster is different from a high concentration as it is generally second order, involving the factoring in of the distribution of something else. == Geographical cluster detection == Identifying geographical clusters can be an important stage in a geographical analysis. Mapping the locations of unusual concentrations may help identify causes of these. Some techniques include the Geographical Analysis Machine and Besag and Newell's cluster detection method.

    Read more →
  • Pixlr

    Pixlr

    Pixlr is a group of SaaS creative tools including Pixlr.com, Designs.ai and Vectr.com. Pixlr.com is a cloud-based set of image editing tools and utilities, including AI image generation and enhancements. The Pixlr suite targets users who require subjectively simple, or more advanced, photo editing as well as graphic design. It features a freemium business model with subscription plans—Plus, Premium and Teams. The platform can be used on desktop and also smartphones and tablets. Pixlr is compatible with various image formats such as JPEG, PNG, WEBP, GIF, PSD (Photoshop Document) and PXZ (native Pixlr document format). Designs.ai lets users create content using AI, with a goal of being within two minutes, across different media types including videos, text, banners and audio. Vectr.com was acquired in 2017 before being spun out into Pixlr Group in 2023. == History == Pixlr was founded in 2008 and built on Macromedia Flash. On 19 July 2011, Autodesk announced that they had acquired the Pixlr suite. In 2013, Time listed Pixlr as one of the top 50 websites of the year. In 2017, Pixlr was acquired from Autodesk. It was subsequently rebuilt and relaunched in HTML5 in 2019. In September 2023, Pixlr was awarded as the Top 13 GenAi Web Product by the world's top venture firm Andreessen Horowitz. In November 2023, Pixlr, Designs.ai and Vectr were combined as a new business group named Pixlr Group focusing on generative AI and creative software solutions. In May 2024, Pixlr was featured as one of the top 18 progressive web applications highlighted on Google I/O. == Versions == Pixlr.com rebranded itself as a full creative suite in 2019 by introducing Pixlr X, Pixlr E and Pixlr M. The platform introduced more features in December 2021 with a new logo and added tools which included: Brushes, the 'Heal tool', Animation, and Batch upload. The brush feature enables the creation of hand-drawn effects. The Heal tool allows users to remove unwanted objects from their images whereas the Animation feature can be used to include movements into their edits. Users can also utilize Batch upload to edit up to 50 images simultaneously. In November 2022, Pixlr 2023 was launched, adding more tools such as "AI smart resize", colorization, text wrapping and other additional effects. In November 2023, Pixlr 2024 was launched with Pixlr Designer and new AI-powered updates which includes AI image generation, AI infill, AI inpainting and more.

    Read more →
  • Local tangent space alignment

    Local tangent space alignment

    Local tangent space alignment (LTSA) is a method for manifold learning, which can efficiently learn a nonlinear embedding into low-dimensional coordinates from high-dimensional data, and can also reconstruct high-dimensional coordinates from embedding coordinates. It is based on the intuition that when a manifold is correctly unfolded, all of the tangent hyperplanes to the manifold will become aligned. It begins by computing the k-nearest neighbors of every point. It computes the tangent space at every point by computing the d-first principal components in each local neighborhood. It then optimizes to find an embedding that aligns the tangent spaces, but it ignores the label information conveyed by data samples, and thus can not be used for classification directly.

    Read more →
  • Correspondence analysis

    Correspondence analysis

    Correspondence analysis (CA) is a multivariate statistical technique proposed by Herman Otto Hartley (Hirschfeld) and later developed by Jean-Paul Benzécri. It is conceptually similar to principal component analysis, but applies to categorical rather than continuous data. In a manner similar to principal component analysis, it provides a means of displaying or summarising a set of data in two-dimensional graphical form. Its aim is to display in a biplot any structure hidden in the multivariate setting of the data table. As such it is a technique from the field of multivariate ordination. Since the variant of CA described here can be applied either with a focus on the rows or on the columns it should in fact be called simple (symmetric) correspondence analysis. It is traditionally applied to the contingency table of a pair of nominal variables where each cell contains either a count or a zero value. If more than two categorical variables are to be summarized, a variant called multiple correspondence analysis should be chosen instead. CA may also be applied to binary data given the presence/absence coding represents simplified count data i.e. a 1 describes a positive count and 0 stands for a count of zero. Depending on the scores used CA preserves the chi-square distance between either the rows or the columns of the table. Because CA is a descriptive technique, it can be applied to tables regardless of a significant chi-squared test. Although the χ 2 {\displaystyle \chi ^{2}} statistic used in inferential statistics and the chi-square distance are computationally related they should not be confused since the latter works as a multivariate statistical distance measure in CA while the χ 2 {\displaystyle \chi ^{2}} statistic is in fact a scalar not a metric. == Details == Like principal components analysis, correspondence analysis creates orthogonal components (or axes) and, for each item in a table i.e. for each row, a set of scores (sometimes called factor scores, see Factor analysis). Correspondence analysis is performed on the data table, conceived as matrix C of size m × n where m is the number of rows and n is the number of columns. In the following mathematical description of the method capital letters in italics refer to a matrix while letters in italics refer to vectors. Understanding the following computations requires knowledge of matrix algebra. === Preprocessing === Before proceeding to the central computational step of the algorithm, the values in matrix C have to be transformed. First compute a set of weights for the columns and the rows (sometimes called masses), where row and column weights are given by the row and column vectors, respectively: w m = 1 n C C 1 , w n = 1 n C 1 T C . {\displaystyle w_{m}={\frac {1}{n_{C}}}C\mathbf {1} ,\quad w_{n}={\frac {1}{n_{C}}}\mathbf {1} ^{T}C.} Here n C = ∑ i = 1 n ∑ j = 1 m C i j {\displaystyle n_{C}=\sum _{i=1}^{n}\sum _{j=1}^{m}C_{ij}} is the sum of all cell values in matrix C, or short the sum of C, and 1 {\displaystyle \mathbf {1} } is a column vector of ones with the appropriate dimension. Put in simple words, w m {\displaystyle w_{m}} is just a vector whose elements are the row sums of C divided by the sum of C, and w n {\displaystyle w_{n}} is a vector whose elements are the column sums of C divided by the sum of C. The weights are transformed into diagonal matrices W m = diag ⁡ ( 1 / w m ) {\displaystyle W_{m}=\operatorname {diag} (1/{\sqrt {w_{m}}})} and W n = diag ⁡ ( 1 / w n ) {\displaystyle W_{n}=\operatorname {diag} (1/{\sqrt {w_{n}}})} where the diagonal elements of W n {\displaystyle W_{n}} are 1 / w n {\displaystyle 1/{\sqrt {w_{n}}}} and those of W m {\displaystyle W_{m}} are 1 / w m {\displaystyle 1/{\sqrt {w_{m}}}} respectively i.e. the vector elements are the inverses of the square roots of the masses. The off-diagonal elements are all 0. Next, compute matrix P {\displaystyle P} by dividing C {\displaystyle C} by its sum P = 1 n C C . {\displaystyle P={\frac {1}{n_{C}}}C.} In simple words, Matrix P {\displaystyle P} is just the data matrix (contingency table or binary table) transformed into portions i.e. each cell value is just the cell portion of the sum of the whole table. Finally, compute matrix S {\displaystyle S} , sometimes called the matrix of standardized residuals, by matrix multiplication as S = W m ( P − w m w n ) W n {\displaystyle S=W_{m}(P-w_{m}w_{n})W_{n}} Note, the vectors w m {\displaystyle w_{m}} and w n {\displaystyle w_{n}} are combined in an outer product resulting in a matrix of the same dimensions as P {\displaystyle P} . In words the formula reads: matrix outer ⁡ ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} is subtracted from matrix P {\displaystyle P} and the resulting matrix is scaled (weighted) by the diagonal matrices W m {\displaystyle W_{m}} and W n {\displaystyle W_{n}} . Multiplying the resulting matrix by the diagonal matrices is equivalent to multiply the i-th row (or column) of it by the i-th element of the diagonal of W m {\displaystyle W_{m}} or W n {\displaystyle W_{n}} , respectively. === Interpretation of preprocessing === The vectors w m {\displaystyle w_{m}} and w n {\displaystyle w_{n}} are the row and column masses or the marginal probabilities for the rows and columns, respectively. Subtracting matrix outer ⁡ ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} from matrix P {\displaystyle P} is the matrix algebra version of double centering the data. Multiplying this difference by the diagonal weighting matrices results in a matrix containing weighted deviations from the origin of a vector space. This origin is defined by matrix outer ⁡ ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} . In fact matrix outer ⁡ ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} is identical with the matrix of expected frequencies in the chi-squared test. Therefore S {\displaystyle S} is computationally related to the independence model used in that test. But since CA is not an inferential method the term independence model is inappropriate here. === Orthogonal components === The table S {\displaystyle S} is then decomposed by a singular value decomposition as S = U Σ V ∗ {\displaystyle S=U\Sigma V^{}\,} where U {\displaystyle U} and V {\displaystyle V} are the left and right singular vectors of S {\displaystyle S} and Σ {\displaystyle \Sigma } is a square diagonal matrix with the singular values σ i {\displaystyle \sigma _{i}} of S {\displaystyle S} on the diagonal. Σ {\displaystyle \Sigma } is of dimension p ≤ ( min ( m , n ) − 1 ) {\displaystyle p\leq (\min(m,n)-1)} hence U {\displaystyle U} is of dimension m×p and V {\displaystyle V} is of n×p. As orthonormal vectors U {\displaystyle U} and V {\displaystyle V} fulfill U ∗ U = V ∗ V = I {\displaystyle U^{}U=V^{}V=I} . In other words, the multivariate information that is contained in C {\displaystyle C} as well as in S {\displaystyle S} is now distributed across two (coordinate) matrices U {\displaystyle U} and V {\displaystyle V} and a diagonal (scaling) matrix Σ {\displaystyle \Sigma } . The vector space defined by them has as number of dimensions p, that is the smaller of the two values, number of rows and number of columns, minus 1. === Inertia === While a principal component analysis may be said to decompose the (co)variance, and hence its measure of success is the amount of (co-)variance covered by the first few PCA axes - measured in eigenvalue -, a CA works with a weighted (co-)variance which is called inertia. The sum of the squared singular values is the total inertia I {\displaystyle \mathrm {I} } of the data table, computed as I = ∑ i = 1 p σ i 2 . {\displaystyle \mathrm {I} =\sum _{i=1}^{p}\sigma _{i}^{2}.} The total inertia I {\displaystyle \mathrm {I} } of the data table can also computed directly from S {\displaystyle S} as I = ∑ i = 1 n ∑ j = 1 m s i j 2 . {\displaystyle \mathrm {I} =\sum _{i=1}^{n}\sum _{j=1}^{m}s_{ij}^{2}.} The amount of inertia covered by the i-th set of singular vectors is ι i {\displaystyle \iota _{i}} , the principal inertia. The higher the portion of inertia covered by the first few singular vectors i.e. the larger the sum of the principal inertiae in comparison to the total inertia, the more successful a CA is. Therefore, all principal inertia values are expressed as portion ϵ i {\displaystyle \epsilon _{i}} of the total inertia ϵ i = σ i 2 / ∑ i = 1 p σ i 2 {\displaystyle \epsilon _{i}=\sigma _{i}^{2}/\sum _{i=1}^{p}\sigma _{i}^{2}} and are presented in the form of a scree plot. In fact a scree plot is just a bar plot of all principal inertia portions ϵ i {\displaystyle \epsilon _{i}} . === Coordinates === To transform the singular vectors to coordinates which preserve the chi-square distances between rows or columns an additional weighting step is necessary. The resulting coordinates are called principal coordinates in CA text books. If principal coordinates are used for

    Read more →
  • Types of artificial neural networks

    Types of artificial neural networks

    Types of neural networks (NN) include a family of techniques. The simplest types have static components, including number of units, number of layers, unit weights and topology. Dynamic NNs evolve via learning. Some types allow/require learning to be "supervised" by the operator, while others operate independently. Some types operate purely in hardware, while others are purely software and run on general purpose computers. The main types are: Transformers: these use attention to analyze every token in the input stream against every other token in the stream. That technique has enabled neural networks to reach the general public via chatbots, code generators and many other forms. Convolutional neural networks (CNN): a FNN that uses kernels and regularization to evade problems in prior generations of NNs. They are typically used to analyze visual and other two-dimensional data. Generative adversarial networks set networks (of varying structure) against each other, each trying to push the other(s) to produce better results such as winning a game or to deceive the opponent about the authenticity of an input. == Feedforward == In feedforward neural networks the information moves from the input to output directly in every layer. There can be hidden layers with or without cycles/loops to sequence inputs. Feedforward networks can be constructed with various types of units, such as binary McCulloch–Pitts neurons, the simplest of which is the perceptron. Continuous neurons, frequently with sigmoidal activation, are used in the context of backpropagation. == Group method of data handling == The Group Method of Data Handling (GMDH) features fully automatic structural and parametric model optimization. The node activation functions are Kolmogorov–Gabor polynomials that permit additions and multiplications. It uses a deep multilayer perceptron with eight layers. It is a supervised learning network that grows layer by layer, where each layer is trained by regression analysis. Useless items are detected using a validation set, and pruned through regularization. The size and depth of the resulting network depends on the task. == Autoencoder == An autoencoder, autoassociator or Diabolo network is similar to the multilayer perceptron (MLP) – with an input layer, an output layer and one or more hidden layers connecting them. However, the output layer has the same number of units as the input layer. Its purpose is to reconstruct its own inputs (instead of emitting a target value). Therefore, autoencoders are unsupervised learning models. An autoencoder is used for unsupervised learning of efficient codings, typically for the purpose of dimensionality reduction and for learning generative models of data. == Probabilistic == A probabilistic neural network (PNN) is a four-layer feedforward neural network. The layers are Input, hidden pattern, hidden summation, and output. In the PNN algorithm, the parent probability distribution function (PDF) of each class is approximated by a Parzen window and a non-parametric function. Then, using PDF of each class, the class probability of a new input is estimated and Bayes’ rule is employed to allocate it to the class with the highest posterior probability. It was derived from the Bayesian network and a statistical algorithm called Kernel Fisher discriminant analysis. It is used for classification and pattern recognition. == Time delay == A time delay neural network (TDNN) is a feedforward architecture for sequential data that recognizes features independent of sequence position. In order to achieve time-shift invariance, delays are added to the input so that multiple data points (points in time) are analyzed together. It usually forms part of a larger pattern recognition system. It has been implemented using a perceptron network whose connection weights were trained with back propagation (supervised learning). == Convolutional == A convolutional neural network (CNN, or ConvNet or shift invariant or space invariant) is a class of deep network, composed of one or more convolutional layers with fully connected layers (matching those in typical ANNs) on top. It uses tied weights and pooling layers. In particular, max-pooling. It is often structured via Fukushima's convolutional architecture. They are variations of multilayer perceptrons that use minimal preprocessing. This architecture allows CNNs to take advantage of the 2D structure of input data. Its unit connectivity pattern is inspired by the organization of the visual cortex. Units respond to stimuli in a restricted region of space known as the receptive field. Receptive fields partially overlap, over-covering the entire visual field. Unit response can be approximated mathematically by a convolution operation. CNNs are suitable for processing visual and other two-dimensional data. They have shown superior results in both image and speech applications. They can be trained with standard backpropagation. CNNs are easier to train than other regular, deep, feed-forward neural networks and have many fewer parameters to estimate. Capsule Neural Networks (CapsNet) add structures called capsules to a CNN and reuse output from several capsules to form more stable (with respect to various perturbations) representations. Examples of applications in computer vision include DeepDream and robot navigation. They have wide applications in image and video recognition, recommender systems and natural language processing. == Deep stacking network == A deep stacking network (DSN) (deep convex network) is based on a hierarchy of blocks of simplified neural network modules. It was introduced in 2011 by Deng and Yu. It formulates the learning as a convex optimization problem with a closed-form solution, emphasizing the mechanism's similarity to stacked generalization. Each DSN block is a simple module that is easy to train by itself in a supervised fashion without backpropagation for the entire blocks. Each block consists of a simplified multi-layer perceptron (MLP) with a single hidden layer. The hidden layer h has logistic sigmoidal units, and the output layer has linear units. Connections between these layers are represented by weight matrix U; input-to-hidden-layer connections have weight matrix W. Target vectors t form the columns of matrix T, and the input data vectors x form the columns of matrix X. The matrix of hidden units is H = σ ( W T X ) {\displaystyle {\boldsymbol {H}}=\sigma ({\boldsymbol {W}}^{T}{\boldsymbol {X}})} . Modules are trained in order, so lower-layer weights W are known at each stage. The function performs the element-wise logistic sigmoid operation. Each block estimates the same final label class y, and its estimate is concatenated with original input X to form the expanded input for the next block. Thus, the input to the first block contains the original data only, while downstream blocks' input adds the output of preceding blocks. Then learning the upper-layer weight matrix U given other weights in the network can be formulated as a convex optimization problem: min U T f = ‖ U T H − T ‖ F 2 , {\displaystyle \min _{U^{T}}f=\|{\boldsymbol {U}}^{T}{\boldsymbol {H}}-{\boldsymbol {T}}\|_{F}^{2},} which has a closed-form solution. Unlike other deep architectures, such as DBNs, the goal is not to discover the transformed feature representation. The structure of the hierarchy of this kind of architecture makes parallel learning straightforward, as a batch-mode optimization problem. In purely discriminative tasks, DSNs outperform conventional DBNs. === Tensor deep stacking networks === This architecture is a DSN extension. It offers two important improvements: it uses higher-order information from covariance statistics, and it transforms the non-convex problem of a lower-layer to a convex sub-problem of an upper-layer. TDSNs use covariance statistics in a bilinear mapping from each of two distinct sets of hidden units in the same layer to predictions, via a third-order tensor. While parallelization and scalability are not considered seriously in conventional DNNs, all learning for DSNs and TDSNs is done in batch mode, to allow parallelization. Parallelization allows scaling the design to larger (deeper) architectures and data sets. The basic architecture is suitable for diverse tasks such as classification and regression. == Physics-informed == Such a neural network is designed for the numerical solution of mathematical equations, such as differential, integral, delay, fractional and others. As input parameters, PINN accepts variables (spatial, temporal, and others), transmits them through the network block. At the output, it produces an approximate solution and substitutes it into the mathematical model, considering the initial and boundary conditions. If the solution does not satisfy the required accuracy, one uses the backpropagation and rectify the solution. Besides PINN, other architectures have been developed to produce surrogate models for scientific comput

    Read more →
  • Application enablement

    Application enablement

    Application enablement is an approach which brings telecommunications network providers and developers together to combine their network and web abilities in creating and delivering high demand advanced services and new intelligent applications. Network providers, in addition to bandwidth, provide abilities such as billing, location, presence, and security, which have allowed them to establish long-term relationships with end-users. By offering these select abilities as application programming interfaces (APIs), providers give developers access to a set of tools to create (mashup) new applications and services to run on provider networks. Unifying the strengths of providers and developers facilitates the creation of mash-up applications, and in turn, a better end user quality of experience (QoE) for improved profit margins. Apple's iOS with App Store, and Google's Android with Android Market exemplify this approach. Both have introduced mobile platforms that are supported by a comprehensive ecosystem in order to perpetuate innovation in product design, content and service offerings, and overall consumer behavior. By the end of April 2010, downloadable applications numbered over 200,000 for iPhone and over 50,000 for Android. == Background == Historically, telecommunication providers primarily based their business models on network performance, emphasizing connectivity, availability, and quality of service (QoS) as key sources of revenue and customer value. With the increasing demand for bandwidth-intensive data and video applications, maintaining service continuity has required substantial infrastructure investments. To address rising operational costs and declining average revenue per user (ARPU), providers have increasingly adopted customer-oriented strategies and diversified business models to expand their roles within the telecommunications value chain. Application enablement supports providers in making this transition by providing an environment, or ecosystem, where providers and developers can collaborate to build, test, manage, and distribute applications across networks including television, broadband, Internet, and mobile. This cooperative effort produces mutually beneficial results for all parties, opening up new revenue streams while enhancing value and rate of return (ROI). The following are some examples of key network abilities which function as application enablers in the telecommunications market: Billing systems Security for private transactions Network-based storage of digital content End-to-end bandwidth for high-quality transmissions Scoring abilities to identify end-user preferences and behaviors Subscriber data to customize the end-user experience Context information, such as location and presence, to localize services. == New business models == As network providers work toward effective collaboration with application and content developers, several new business models are emerging to help facilitate the business relationships: === Vendor-led === A type of business model driven by telecommunications vendors, who assist network providers in building relationships with application and content developers to lower the cost and complexity of managing third parties. Examples of this model include: Forum Nokia IBM Technology Partner Ecosystem Ng Connect Huawei Intouch program === Operator-led === Characterized by network providers who want to maintain a high degree of flexibility and control over applications created for their end-consumers, this model lets them create and manage their own developer program, development platform, and application store. Under this arrangement, independent developers provide their own branding, marketing communications, pricing and customer care. Network providers pursuing this model will often seek to partner with a large number of third parties using standardized on-boarding processes. Examples of this model include: o2 Litmus Orange Partner Joint Innovation Lab === Aggregator === Network providers who choose not to create/manage their own developer relationships will partner with one or multiple aggregators, to administer a portion of or their entire application strategy. Examples of this model include: Ovi Operator Partnership Blackberry Operator Partnership Cellmania Buongiorno === Mass wholesale === Select network providers also participate in wholesale models that exist primarily for applications (BT's Ribbit- an Internet Protocol (IP) based calling and messaging platform) and devices (Verizon's Open Device initiative). This business-to-business approach reduces a large portion of the potential costs of third party application enablement (marketing, acquisition and support). Examples of this model include: BT's Ribbit Verizon Wireless ODI AT&T Synaptic Hosting === The enterprise customer === Some network providers are focusing on enabling applications in the enterprise space. In this model, the network provider establishes a platform for their large enterprise customers who want to blend custom software with enhanced abilities, and will provide standardized processes around mobilizing enterprise applications, and exposing core back-office abilities to allow for dynamic customer interaction. Examples of this model include: Vodafone Applications Service Verizon Private Network Sprint Solution Launchpad === Trusted partner === In this model, the network provider builds one-on-one relationships with trusted third-party developers by exposing customized network abilities, bringing a greater variety of brands to the network provider's portfolio. Network providers using this model tend to only have a few partners (in contrast to the operator led model). Under this scenario, network providers benefit from a pre-established customer base and the developer's marketing resources. Examples of this model include: 3/Skype Partnership (UK) Virgin Media and BBC iPlayer == Network operator developer resources == Operator led model o2 Litmus Orange Partner Joint Innovations Lab Aggregator model Ovi Operator Partnership Cellmania Buongiorno Mass wholesale model BT Ribbit Verizon Wireless ODI AT&T Synaptic Hosting Enterprise customer model Vodafone Applications Service Verizon Private Network Sprint Solution Launchpad == Rerencesfe ==

    Read more →
  • Nearest centroid classifier

    Nearest centroid classifier

    In machine learning, a nearest centroid classifier or nearest prototype classifier is a classification model that assigns to observations the label of the class of training samples whose mean (centroid) is closest to the observation. When applied to text classification using word vectors containing tfidf weights to represent documents, the nearest centroid classifier is known as the Rocchio classifier because of its similarity to the Rocchio algorithm for relevance feedback. An extended version of the nearest centroid classifier has found applications in the medical domain, specifically classification of tumors. == Algorithm == === Training === Given labeled training samples { ( x → 1 , y 1 ) , … , ( x → n , y n ) } {\displaystyle \textstyle \{({\vec {x}}_{1},y_{1}),\dots ,({\vec {x}}_{n},y_{n})\}} with class labels y i ∈ Y {\displaystyle y_{i}\in \mathbf {Y} } , compute the per-class centroids μ → ℓ = 1 | C ℓ | ∑ i ∈ C ℓ x → i {\displaystyle \textstyle {\vec {\mu }}_{\ell }={\frac {1}{|C_{\ell }|}}{\underset {i\in C_{\ell }}{\sum }}{\vec {x}}_{i}} where C ℓ {\displaystyle C_{\ell }} is the set of indices of samples belonging to class ℓ ∈ Y {\displaystyle \ell \in \mathbf {Y} } . === Prediction === The class assigned to an observation x → {\displaystyle {\vec {x}}} is y ^ = arg ⁡ min ℓ ∈ Y ‖ μ → ℓ − x → ‖ {\displaystyle {\hat {y}}={\arg \min }_{\ell \in \mathbf {Y} }\|{\vec {\mu }}_{\ell }-{\vec {x}}\|} .

    Read more →
  • Polynomial kernel

    Polynomial kernel

    In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models. Intuitively, the polynomial kernel looks not only at the given features of input samples to determine their similarity, but also combinations of these. In the context of regression analysis, such combinations are known as interaction features. The (implicit) feature space of a polynomial kernel is equivalent to that of polynomial regression, but without the combinatorial blowup in the number of parameters to be learned. When the input features are binary-valued (booleans), then the features correspond to logical conjunctions of input features. == Definition == For degree-d polynomials, the polynomial kernel is defined as K ( x , y ) = ( x T y + c ) d {\displaystyle K(\mathbf {x} ,\mathbf {y} )=(\mathbf {x} ^{\mathsf {T}}\mathbf {y} +c)^{d}} where x and y are vectors of size n in the input space, i.e. vectors of features computed from training or test samples and c ≥ 0 is a free parameter trading off the influence of higher-order versus lower-order terms in the polynomial. When c = 0, the kernel is called homogeneous. (A further generalized polykernel divides xTy by a user-specified scalar parameter a.) As a kernel, K corresponds to an inner product in a feature space based on some mapping φ: K ( x , y ) = ⟨ φ ( x ) , φ ( y ) ⟩ {\displaystyle K(\mathbf {x} ,\mathbf {y} )=\langle \varphi (\mathbf {x} ),\varphi (\mathbf {y} )\rangle } The nature of φ can be seen from an example. Let d = 2, so we get the special case of the quadratic kernel. After using the multinomial theorem (twice—the outermost application is the binomial theorem) and regrouping, K ( x , y ) = ( ∑ i = 1 n x i y i + c ) 2 = ∑ i = 1 n ( x i 2 ) ( y i 2 ) + ∑ i = 2 n ∑ j = 1 i − 1 ( 2 x i x j ) ( 2 y i y j ) + ∑ i = 1 n ( 2 c x i ) ( 2 c y i ) + c 2 {\displaystyle K(\mathbf {x} ,\mathbf {y} )=\left(\sum _{i=1}^{n}x_{i}y_{i}+c\right)^{2}=\sum _{i=1}^{n}\left(x_{i}^{2}\right)\left(y_{i}^{2}\right)+\sum _{i=2}^{n}\sum _{j=1}^{i-1}\left({\sqrt {2}}x_{i}x_{j}\right)\left({\sqrt {2}}y_{i}y_{j}\right)+\sum _{i=1}^{n}\left({\sqrt {2c}}x_{i}\right)\left({\sqrt {2c}}y_{i}\right)+c^{2}} From this it follows that the feature map is given by: φ ( x ) = ( x n 2 , … , x 1 2 , 2 x n x n − 1 , … , 2 x n x 1 , 2 x n − 1 x n − 2 , … , 2 x n − 1 x 1 , … , 2 x 2 x 1 , 2 c x n , … , 2 c x 1 , c ) {\displaystyle \varphi (x)=\left(x_{n}^{2},\ldots ,x_{1}^{2},{\sqrt {2}}x_{n}x_{n-1},\ldots ,{\sqrt {2}}x_{n}x_{1},{\sqrt {2}}x_{n-1}x_{n-2},\ldots ,{\sqrt {2}}x_{n-1}x_{1},\ldots ,{\sqrt {2}}x_{2}x_{1},{\sqrt {2c}}x_{n},\ldots ,{\sqrt {2c}}x_{1},c\right)} generalizing for ( x T y + c ) d {\displaystyle \left(\mathbf {x} ^{T}\mathbf {y} +c\right)^{d}} , where x ∈ R n {\displaystyle \mathbf {x} \in \mathbb {R} ^{n}} , y ∈ R n {\displaystyle \mathbf {y} \in \mathbb {R} ^{n}} and applying the multinomial theorem: ( x T y + c ) d = ∑ j 1 + j 2 + ⋯ + j n + 1 = d d ! j 1 ! ⋯ j n ! j n + 1 ! x 1 j 1 ⋯ x n j n c j n + 1 d ! j 1 ! ⋯ j n ! j n + 1 ! y 1 j 1 ⋯ y n j n c j n + 1 = φ ( x ) T φ ( y ) {\displaystyle {\begin{alignedat}{2}\left(\mathbf {x} ^{T}\mathbf {y} +c\right)^{d}&=\sum _{j_{1}+j_{2}+\dots +j_{n+1}=d}{\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}x_{1}^{j_{1}}\cdots x_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}{\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}y_{1}^{j_{1}}\cdots y_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}\\&=\varphi (\mathbf {x} )^{T}\varphi (\mathbf {y} )\end{alignedat}}} The last summation has l d = ( n + d d ) {\displaystyle l_{d}={\tbinom {n+d}{d}}} elements, so that: φ ( x ) = ( a 1 , … , a l , … , a l d ) {\displaystyle \varphi (\mathbf {x} )=\left(a_{1},\dots ,a_{l},\dots ,a_{l_{d}}\right)} where l = ( j 1 , j 2 , . . . , j n , j n + 1 ) {\displaystyle l=(j_{1},j_{2},...,j_{n},j_{n+1})} and a l = d ! j 1 ! ⋯ j n ! j n + 1 ! x 1 j 1 ⋯ x n j n c j n + 1 | j 1 + j 2 + ⋯ + j n + j n + 1 = d {\displaystyle a_{l}={\frac {\sqrt {d!}}{\sqrt {j_{1}!\cdots j_{n}!j_{n+1}!}}}x_{1}^{j_{1}}\cdots x_{n}^{j_{n}}{\sqrt {c}}^{j_{n+1}}\quad |\quad j_{1}+j_{2}+\dots +j_{n}+j_{n+1}=d} == Practical use == Although the RBF kernel is more popular in SVM classification than the polynomial kernel, the latter is quite popular in natural language processing (NLP). The most common degree is d = 2 (quadratic), since larger degrees tend to overfit on NLP problems. Various ways of computing the polynomial kernel (both exact and approximate) have been devised as alternatives to the usual non-linear SVM training algorithms, including: full expansion of the kernel prior to training/testing with a linear SVM, i.e. full computation of the mapping φ as in polynomial regression; basket mining (using a variant of the apriori algorithm) for the most commonly occurring feature conjunctions in a training set to produce an approximate expansion; inverted indexing of support vectors. One problem with the polynomial kernel is that it may suffer from numerical instability: when xTy + c < 1, K(x, y) = (xTy + c)d tends to zero with increasing d, whereas when xTy + c > 1, K(x, y) tends to infinity.

    Read more →
  • Medoid

    Medoid

    Medoids are representative objects of a data set or a cluster within a data set whose sum of dissimilarities to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images, 3-D trajectories and gene expression (where while the data is sparse the medoid need not be). These are also of interest while wanting to find a representative using some distance other than squared euclidean distance (for instance in movie-ratings). For some data sets there may be more than one medoid, as with medians. A common application of the medoid is the k-medoids clustering algorithm, which is similar to the k-means algorithm but works when a mean or centroid is not definable. This algorithm basically works as follows. First, a set of medoids is chosen at random. Second, the distances to the other points are computed. Third, data are clustered according to the medoid they are most similar to. Fourth, the medoid set is optimized via an iterative process. Note that a medoid is not equivalent to a median, a geometric median, or centroid. A median is only defined on 1-dimensional data, and it only minimizes dissimilarity to other points for metrics induced by a norm (such as the Manhattan distance or Euclidean distance). A geometric median is defined in any dimension, but unlike a medoid, it is not necessarily a point from within the original dataset. == Definition == Let X := { x 1 , x 2 , … , x n } {\textstyle {\mathcal {X}}:=\{x_{1},x_{2},\dots ,x_{n}\}} be a set of n {\textstyle n} points in a space with a distance function d. Medoid is defined as x medoid = arg ⁡ min y ∈ X ∑ i = 1 n d ( y , x i ) . {\displaystyle x_{\text{medoid}}=\arg \min _{y\in {\mathcal {X}}}\sum _{i=1}^{n}d(y,x_{i}).} == Clustering with medoids == Medoids are a popular replacement for the cluster mean when the distance function is not (squared) Euclidean distance, or not even a metric (as the medoid does not require the triangle inequality). When partitioning the data set into clusters, the medoid of each cluster can be used as a representative of each cluster. Clustering algorithms based on the idea of medoids include: Partitioning Around Medoids (PAM), the standard k-medoids algorithm Hierarchical Clustering Around Medoids (HACAM), which uses medoids in hierarchical clustering == Algorithms to compute the medoid of a set == From the definition above, it is clear that the medoid of a set X {\displaystyle {\mathcal {X}}} can be computed after computing all pairwise distances between points in the ensemble. This would take O ( n 2 ) {\textstyle O(n^{2})} distance evaluations (with n = | X | {\displaystyle n=|{\mathcal {X}}|} ). In the worst case, one can not compute the medoid with fewer distance evaluations. However, there are many approaches that allow us to compute medoids either exactly or approximately in sub-quadratic time under different statistical models. If the points lie on the real line, computing the medoid reduces to computing the median which can be done in O ( n ) {\textstyle O(n)} by Quick-select algorithm of Hoare. However, in higher dimensional real spaces, no linear-time algorithm is known. RAND is an algorithm that estimates the average distance of each point to all the other points by sampling a random subset of other points. It takes a total of O ( n log ⁡ n ϵ 2 ) {\textstyle O\left({\frac {n\log n}{\epsilon ^{2}}}\right)} distance computations to approximate the medoid within a factor of ( 1 + ϵ Δ ) {\textstyle (1+\epsilon \Delta )} with high probability, where Δ {\textstyle \Delta } is the maximum distance between two points in the ensemble. Note that RAND is an approximation algorithm, and moreover Δ {\textstyle \Delta } may not be known apriori. RAND was leveraged by TOPRANK which uses the estimates obtained by RAND to focus on a small subset of candidate points, evaluates the average distance of these points exactly, and picks the minimum of those. TOPRANK needs O ( n 5 3 log 4 3 ⁡ n ) {\textstyle O(n^{\frac {5}{3}}\log ^{\frac {4}{3}}n)} distance computations to find the exact medoid with high probability under a distributional assumption on the average distances. trimed presents an algorithm to find the medoid with O ( n 3 2 2 Θ ( d ) ) {\textstyle O(n^{\frac {3}{2}}2^{\Theta (d)})} distance evaluations under a distributional assumption on the points. The algorithm uses the triangle inequality to cut down the search space. Meddit leverages a connection of the medoid computation with multi-armed bandits and uses an upper-Confidence-bound type of algorithm to get an algorithm which takes O ( n log ⁡ n ) {\textstyle O(n\log n)} distance evaluations under statistical assumptions on the points. Correlated Sequential Halving also leverages multi-armed bandit techniques, improving upon Meddit. By exploiting the correlation structure in the problem, the algorithm is able to provably yield drastic improvement (usually around 1-2 orders of magnitude) in both number of distance computations needed and wall clock time. == Implementations == An implementation of RAND, TOPRANK, and trimed can be found here. An implementation of Meddit can be found here and here. An implementation of Correlated Sequential Halving can be found here. == Medoids in text and natural language processing (NLP) == Medoids can be applied to various text and NLP tasks to improve the efficiency and accuracy of analyses. By clustering text data based on similarity, medoids can help identify representative examples within the dataset, leading to better understanding and interpretation of the data. === Text clustering === Text clustering is the process of grouping similar text or documents together based on their content. Medoid-based clustering algorithms can be employed to partition large amounts of text into clusters, with each cluster represented by a medoid document. This technique helps in organizing, summarizing, and retrieving information from large collections of documents, such as in search engines, social media analytics and recommendation systems. === Text summarization === Text summarization aims to produce a concise and coherent summary of a larger text by extracting the most important and relevant information. Medoid-based clustering can be used to identify the most representative sentences in a document or a group of documents, which can then be combined to create a summary. This approach is especially useful for extractive summarization tasks, where the goal is to generate a summary by selecting the most relevant sentences from the original text. === Sentiment analysis === Sentiment analysis involves determining the sentiment or emotion expressed in a piece of text, such as positive, negative, or neutral. Medoid-based clustering can be applied to group text data based on similar sentiment patterns. By analyzing the medoid of each cluster, researchers can gain insights into the predominant sentiment of the cluster, helping in tasks such as opinion mining, customer feedback analysis, and social media monitoring. === Topic modeling === Topic modeling is a technique used to discover abstract topics that occur in a collection of documents. Medoid-based clustering can be applied to group documents with similar themes or topics. By analyzing the medoids of these clusters, researchers can gain an understanding of the underlying topics in the text corpus, facilitating tasks such as document categorization, trend analysis, and content recommendation. === Techniques for measuring text similarity in medoid-based clustering === When applying medoid-based clustering to text data, it is essential to choose an appropriate similarity measure to compare documents effectively. Each technique has its advantages and limitations, and the choice of the similarity measure should be based on the specific requirements and characteristics of the text data being analyzed. The following are common techniques for measuring text similarity in medoid-based clustering: ==== Cosine similarity ==== Cosine similarity is a widely used measure to compare the similarity between two pieces of text. It calculates the cosine of the angle between two document vectors in a high-dimensional space. Cosine similarity ranges between -1 and 1, where a value closer to 1 indicates higher similarity, and a value closer to -1 indicates lower similarity. By visualizing two lines originating from the origin and extending to the respective points of interest, and then measuring the angle between these lines, one can determine the similarity between the associated points. Cosine similarity is less affected by document length, so it may be better at producing medoids that are representative of the content of a cluster instead of the lengt

    Read more →
  • Apache Parquet

    Apache Parquet

    Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem inspired by Google Dremel interactive ad-hoc query system for analysis of read-only nested data. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides data compression and encoding schemes with enhanced performance to handle complex data in bulk. == History == The open-source project to build Apache Parquet began as a joint effort between Twitter and Cloudera using the record shredding and assembly algorithm as described in Google's Dremel. Parquet was designed as an improvement on the Trevni columnar storage format created by Doug Cutting, the creator of Hadoop. The name 'parquet' (lit. 'small compartment') refers to a style of decorative flooring and was chosen to "evoke the bottom layer of a database with an interesting layout". The first version, Apache Parquet 1.0, was released in July 2013. Since April 27, 2015, Apache Parquet has been a top-level Apache Software Foundation (ASF)-sponsored project. == Features == Apache Parquet is implemented using the record-shredding and assembly algorithm, which accommodates the complex data structures that can be used to store data. The values in each column are stored in contiguous memory locations, providing the following benefits: Column-wise compression is efficient in storage space Encoding and compression techniques specific to the type of data in each column can be used Queries that fetch specific column values need not read the entire row, thus improving performance Apache Parquet is implemented using the Apache Thrift framework, which increases its flexibility; it can work with a number of programming languages like C++, Java, Python, PHP, etc. As of August 2015, Parquet supports the big-data-processing frameworks including Apache Hive, Apache Drill, Apache Impala, Apache Crunch, Apache Pig, Cascading, Presto and Apache Spark. It is one of the external data formats used by the pandas Python data manipulation and analysis library. == Compression and encoding == In Parquet, compression is performed column by column, which enables different encoding schemes to be used for text and integer data. This strategy also keeps the door open for newer and better encoding schemes to be implemented as they are invented. Parquet supports various compression formats: snappy, gzip, LZO, brotli, zstd, and LZ4. === Dictionary encoding === Parquet has an automatic dictionary encoding enabled dynamically for data with a small number of unique values (i.e. below 105) that enables significant compression and boosts processing speed. === Bit packing === Storage of integers is usually done with dedicated 32 or 64 bits per integer. For small integers, packing multiple integers into the same space makes storage more efficient. === Run-length encoding (RLE) === To optimize storage of multiple occurrences of the same value, run-length encoding is used, which is where a single value is stored once along with the number of occurrences. Parquet implements a hybrid of bit packing and RLE, in which the encoding switches based on which produces the best compression results. This strategy works well for certain types of integer data and combines well with dictionary encoding. == Cloud Storage and Data Lakes == Parquet is widely used as the underlying file format in modern cloud-based data lake architectures. Cloud storage systems such as Amazon S3, Azure Data Lake Storage, and Google Cloud Storage commonly store data in Parquet format due to its efficient columnar representation and retrieval capabilities. Data lakehouse frameworks—including Apache Iceberg, Delta Lake, and Apache Hudi —build an additional metadata layer on top of Parquet files to support features such as schema evolution, time-travel queries, and ACID-compliant transactions. In these architectures, Parquet files serve as the immutable storage layer while the table formats manage data versioning and transactional integrity. == Comparison == Apache Parquet is comparable to RCFile and Optimized Row Columnar (ORC) file formats — all three fall under the category of columnar data storage within the Hadoop ecosystem. They all have better compression and encoding with improved read performance at the cost of slower writes. In addition to these features, Apache Parquet supports limited schema evolution, i.e., the schema can be modified according to the changes in the data. It also provides the ability to add new columns and merge schemas that do not conflict. Apache Arrow is designed as an in-memory complement to on-disk columnar formats like Parquet and ORC. The Arrow and Parquet projects include libraries that allow for reading and writing between the two formats. == Implementations == Known implementations of Parquet include:

    Read more →
  • Markov blanket

    Markov blanket

    In statistics and machine learning, a Markov blanket of a random variable is a set of variables that renders the variable conditionally independent of all other variables in the system. This concept is central in probabilistic graphical models and feature selection. If a Markov blanket is minimal—meaning that no variable in it can be removed without losing this conditional independence—it is called a Markov boundary. Identifying a Markov blanket or boundary allows for efficient inference and helps isolate relevant variables for prediction or causal reasoning. The terms Markov blanket and Markov boundary were coined by Judea Pearl in 1988. A Markov blanket may be derived from the structure of a probabilistic graphical model such as a Bayesian network or Markov random field. == Definition == A Markov blanket of a random variable Y {\displaystyle Y} in a random variable set S = { X 1 , … , X n } {\displaystyle {\mathcal {S}}=\{X_{1},\ldots ,X_{n}\}} is any subset S 1 {\displaystyle {\mathcal {S}}_{1}} of S {\displaystyle {\mathcal {S}}} , conditioned on which other variables are independent with Y {\displaystyle Y} : Y ⊥ ⊥ S ∖ S 1 ∣ S 1 {\displaystyle Y\perp \!\!\!\perp {\mathcal {S}}\smallsetminus {\mathcal {S}}_{1}\mid {\mathcal {S}}_{1}} It means that S 1 {\displaystyle {\mathcal {S}}_{1}} contains at least all the information one needs to infer Y {\displaystyle Y} , where the variables in S ∖ S 1 {\displaystyle {\mathcal {S}}\smallsetminus {\mathcal {S}}_{1}} are redundant. In general, a given Markov blanket is not unique. Any set in S {\displaystyle {\mathcal {S}}} that contains a Markov blanket is also a Markov blanket itself. Specifically, S {\displaystyle {\mathcal {S}}} is a Markov blanket of Y {\displaystyle Y} in S {\displaystyle {\mathcal {S}}} . === Example === In a Bayesian network, the Markov blanket of a node consists of its parents, its children, and its children's other parents (i.e., co-parents). Knowing the values of these nodes makes the target node conditionally independent of the rest of the network. In a Markov random field, the Markov blanket of a node is simply its immediate neighbors. == Markov condition == The concept of a Markov blanket is rooted in the Markov condition, which states that in a probabilistic graphical model, each variable is conditionally independent of its non-descendants given its parents. This condition implies the existence of a minimal separating set — the Markov blanket — that shields a variable from the rest of the network. For instance, when a person holds an object stationary against gravity, the object’s acceleration is fully determined by its direct causes—namely, the upward force from the hand and the downward gravitational pull. Other variables such as air pressure or temperature are causally irrelevant. == Markov boundary == A Markov boundary of Y {\displaystyle Y} in S {\displaystyle {\mathcal {S}}} is a subset S 2 {\displaystyle {\mathcal {S}}_{2}} of S {\displaystyle {\mathcal {S}}} , such that S 2 {\displaystyle {\mathcal {S}}_{2}} itself is a Markov blanket of Y {\displaystyle Y} , but any proper subset of S 2 {\displaystyle {\mathcal {S}}_{2}} is not a Markov blanket of Y {\displaystyle Y} . In other words, a Markov boundary is a minimal Markov blanket. The Markov boundary of a node A {\displaystyle A} in a Bayesian network is the set of nodes composed of A {\displaystyle A} 's parents, A {\displaystyle A} 's children, and A {\displaystyle A} 's children's other parents. In a Markov random field, the Markov boundary for a node is the set of its neighboring nodes. In a dependency network, the Markov boundary for a node is the set of its parents. === Uniqueness of Markov boundary === The Markov boundary always exists. Under some mild conditions, the Markov boundary is unique. However, for most practical and theoretical scenarios multiple Markov boundaries may provide alternative solutions. When there are multiple Markov boundaries, quantities measuring causal effect could fail. == In cognitive science == In the study of consciousness, brain function, and complex adaptive systems, Markov blankets are proposed as a mathematical mechanism which delimits the extent of cognitive entities, whether it be physical or causal.

    Read more →
  • Concept class

    Concept class

    In computational learning theory in mathematics, a concept over a domain X is a total Boolean function over X. A concept class is a class of concepts. Concept classes are a subject of computational learning theory. Concept class terminology frequently appears in model theory associated with probably approximately correct (PAC) learning. In this setting, if one takes a set Y as a set of (classifier output) labels, and X is a set of examples, the map c : X → Y {\displaystyle c:X\to Y} , i.e. from examples to classifier labels (where Y = { 0 , 1 } {\displaystyle Y=\{0,1\}} and where c is a subset of X), c is then said to be a concept. A concept class C {\displaystyle C} is then a collection of such concepts. Given a class of concepts C, a subclass D is reachable if there exists a sample s such that D contains exactly those concepts in C that are extensions to s. Not every subclass is reachable. == Background == A sample s {\displaystyle s} is a partial function from X {\displaystyle X} to { 0 , 1 } {\displaystyle \{0,1\}} . Identifying a concept with its characteristic function mapping X {\displaystyle X} to { 0 , 1 } {\displaystyle \{0,1\}} , it is a special case of a sample. Two samples are consistent if they agree on the intersection of their domains. A sample s ′ {\displaystyle s'} extends another sample s {\displaystyle s} if the two are consistent and the domain of s {\displaystyle s} is contained in the domain of s ′ {\displaystyle s'} . == Examples == Suppose that C = S + ( X ) {\displaystyle C=S^{+}(X)} . Then: the subclass { { x } } {\displaystyle \{\{x\}\}} is reachable with the sample s = { ( x , 1 ) } {\displaystyle s=\{(x,1)\}} ; the subclass S + ( Y ) {\displaystyle S^{+}(Y)} for Y ⊆ X {\displaystyle Y\subseteq X} are reachable with a sample that maps the elements of X − Y {\displaystyle X-Y} to zero; the subclass S ( X ) {\displaystyle S(X)} , which consists of the singleton sets, is not reachable. == Applications == Let C {\displaystyle C} be some concept class. For any concept c ∈ C {\displaystyle c\in C} , we call this concept 1 / d {\displaystyle 1/d} -good for a positive integer d {\displaystyle d} if, for all x ∈ X {\displaystyle x\in X} , at least 1 / d {\displaystyle 1/d} of the concepts in C {\displaystyle C} agree with c {\displaystyle c} on the classification of x {\displaystyle x} . The fingerprint dimension F D ( C ) {\displaystyle FD(C)} of the entire concept class C {\displaystyle C} is the least positive integer d {\displaystyle d} such that every reachable subclass C ′ ⊆ C {\displaystyle C'\subseteq C} contains a concept that is 1 / d {\displaystyle 1/d} -good for it. This quantity can be used to bound the minimum number of equivalence queries needed to learn a class of concepts according to the following inequality: F D ( C ) − 1 ≤ # E Q ( C ) ≤ ⌈ F D ( C ) ln ⁡ ( | C | ) ⌉ {\textstyle FD(C)-1\leq \#EQ(C)\leq \lceil FD(C)\ln(|C|)\rceil } .

    Read more →