Sammon mapping or Sammon projection is an algorithm that maps a high-dimensional space to a space of lower dimensionality (see multidimensional scaling) by trying to preserve the structure of inter-point distances in high-dimensional space in the lower-dimension projection. It is particularly suited for use in exploratory data analysis. The method was proposed by John W. Sammon in 1969. It is considered a non-linear approach as the mapping cannot be represented as a linear combination of the original variables as possible in techniques such as principal component analysis, which also makes it more difficult to use for classification applications. Denote the distance between ith and jth objects in the original space by d i j ∗ {\displaystyle \scriptstyle d_{ij}^{}} , and the distance between their projections by d i j {\displaystyle \scriptstyle d_{ij}^{}} . Sammon's mapping aims to minimize the following error function, which is often referred to as Sammon's stress or Sammon's error: E = 1 ∑ i < j d i j ∗ ∑ i < j ( d i j ∗ − d i j ) 2 d i j ∗ . {\displaystyle E={\frac {1}{\sum \limits _{i Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents. An information retrieval technique using latent semantic structure was patented in 1988 by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landauer, Karen Lochbaum and Lynn Streeter. In the context of its application to information retrieval, it is sometimes called latent semantic indexing (LSI). == Overview == === Occurrence matrix === LSA can use a document-term matrix which describes the occurrences of terms in documents; it is a sparse matrix whose rows correspond to terms and whose columns correspond to documents. A typical example of the weighting of the elements of the matrix is tf-idf (term frequency–inverse document frequency): the weight of an element of the matrix is proportional to the number of times the terms appear in each document, where rare terms are upweighted to reflect their relative importance. This matrix is also common to standard semantic models, though it is not necessarily explicitly expressed as a matrix, since the mathematical properties of matrices are not always used. === Rank lowering === After the construction of the occurrence matrix, LSA finds a low-rank approximation to the term-document matrix. There could be various reasons for these approximations: The original term-document matrix is presumed too large for the computing resources; in this case, the approximated low rank matrix is interpreted as an approximation (a "least and necessary evil"). The original term-document matrix is presumed noisy: for example, anecdotal instances of terms are to be eliminated. From this point of view, the approximated matrix is interpreted as a de-noisified matrix (a better matrix than the original). The original term-document matrix is presumed overly sparse relative to the "true" term-document matrix. That is, the original matrix lists only the words actually in each document, whereas we might be interested in all words related to each document—generally a much larger set due to synonymy. The consequence of the rank lowering is that some dimensions are combined and depend on more than one term: {(car), (truck), (flower)} → {(1.3452 car + 0.2828 truck), (flower)} This mitigates the problem of identifying synonymy, as the rank lowering is expected to merge the dimensions associated with terms that have similar meanings. It also partially mitigates the problem with polysemy, since components of polysemous words that point in the "right" direction are added to the components of words that share a similar meaning. Conversely, components that point in other directions tend to either simply cancel out, or, at worst, to be smaller than components in the directions corresponding to the intended sense. === Derivation === Let X {\displaystyle X} be a matrix where element ( i , j ) {\displaystyle (i,j)} describes the occurrence of term i {\displaystyle i} in document j {\displaystyle j} (this can be, for example, the frequency). X {\displaystyle X} will look like this: d j ↓ t i T → [ x 1 , 1 … x 1 , j … x 1 , n ⋮ ⋱ ⋮ ⋱ ⋮ x i , 1 … x i , j … x i , n ⋮ ⋱ ⋮ ⋱ ⋮ x m , 1 … x m , j … x m , n ] {\displaystyle {\begin{matrix}&{\textbf {d}}_{j}\\&\downarrow \\{\textbf {t}}_{i}^{T}\rightarrow &{\begin{bmatrix}x_{1,1}&\dots &x_{1,j}&\dots &x_{1,n}\\\vdots &\ddots &\vdots &\ddots &\vdots \\x_{i,1}&\dots &x_{i,j}&\dots &x_{i,n}\\\vdots &\ddots &\vdots &\ddots &\vdots \\x_{m,1}&\dots &x_{m,j}&\dots &x_{m,n}\\\end{bmatrix}}\end{matrix}}} Now a row in this matrix will be a vector corresponding to a term, giving its relation to each document: t i T = [ x i , 1 … x i , j … x i , n ] {\displaystyle {\textbf {t}}_{i}^{T}={\begin{bmatrix}x_{i,1}&\dots &x_{i,j}&\dots &x_{i,n}\end{bmatrix}}} Likewise, a column in this matrix will be a vector corresponding to a document, giving its relation to each term: d j = [ x 1 , j ⋮ x i , j ⋮ x m , j ] {\displaystyle {\textbf {d}}_{j}={\begin{bmatrix}x_{1,j}\\\vdots \\x_{i,j}\\\vdots \\x_{m,j}\\\end{bmatrix}}} Now the dot product t i T t p {\displaystyle {\textbf {t}}_{i}^{T}{\textbf {t}}_{p}} between two term vectors gives the correlation between the terms over the set of documents. The matrix product X X T {\displaystyle XX^{T}} contains all these dot products. Element ( i , p ) {\displaystyle (i,p)} (which is equal to element ( p , i ) {\displaystyle (p,i)} ) contains the dot product t i T t p {\displaystyle {\textbf {t}}_{i}^{T}{\textbf {t}}_{p}} ( = t p T t i {\displaystyle ={\textbf {t}}_{p}^{T}{\textbf {t}}_{i}} ). Likewise, the matrix X T X {\displaystyle X^{T}X} contains the dot products between all the document vectors, giving their correlation over the terms: d j T d q = d q T d j {\displaystyle {\textbf {d}}_{j}^{T}{\textbf {d}}_{q}={\textbf {d}}_{q}^{T}{\textbf {d}}_{j}} . Now, from the theory of linear algebra, there exists a decomposition of X {\displaystyle X} such that U {\displaystyle U} and V {\displaystyle V} are orthogonal matrices and Σ {\displaystyle \Sigma } is a diagonal matrix. This is called a singular value decomposition (SVD): X = U Σ V T {\displaystyle {\begin{matrix}X=U\Sigma V^{T}\end{matrix}}} The matrix products giving us the term and document correlations then become X X T = ( U Σ V T ) ( U Σ V T ) T = ( U Σ V T ) ( V T T Σ T U T ) = U Σ V T V Σ T U T = U Σ Σ T U T X T X = ( U Σ V T ) T ( U Σ V T ) = ( V T T Σ T U T ) ( U Σ V T ) = V Σ T U T U Σ V T = V Σ T Σ V T {\displaystyle {\begin{matrix}XX^{T}&=&(U\Sigma V^{T})(U\Sigma V^{T})^{T}=(U\Sigma V^{T})(V^{T^{T}}\Sigma ^{T}U^{T})=U\Sigma V^{T}V\Sigma ^{T}U^{T}=U\Sigma \Sigma ^{T}U^{T}\\X^{T}X&=&(U\Sigma V^{T})^{T}(U\Sigma V^{T})=(V^{T^{T}}\Sigma ^{T}U^{T})(U\Sigma V^{T})=V\Sigma ^{T}U^{T}U\Sigma V^{T}=V\Sigma ^{T}\Sigma V^{T}\end{matrix}}} Since Σ Σ T {\displaystyle \Sigma \Sigma ^{T}} and Σ T Σ {\displaystyle \Sigma ^{T}\Sigma } are diagonal we see that U {\displaystyle U} must contain the eigenvectors of X X T {\displaystyle XX^{T}} , while V {\displaystyle V} must be the eigenvectors of X T X {\displaystyle X^{T}X} . Both products have the same non-zero eigenvalues, given by the non-zero entries of Σ Σ T {\displaystyle \Sigma \Sigma ^{T}} , or equally, by the non-zero entries of Σ T Σ {\displaystyle \Sigma ^{T}\Sigma } . Now the decomposition looks like this: X U Σ V T ( d j ) ( d ^ j ) ↓ ↓ ( t i T ) → [ x 1 , 1 … x 1 , j … x 1 , n ⋮ ⋱ ⋮ ⋱ ⋮ x i , 1 … x i , j … x i , n ⋮ ⋱ ⋮ ⋱ ⋮ x m , 1 … x m , j … x m , n ] = ( t ^ i T ) → [ [ u 1 ] … [ u l ] ] ⋅ [ σ 1 … 0 ⋮ ⋱ ⋮ 0 … σ l ] ⋅ [ [ v 1 ] ⋮ [ v l ] ] {\displaystyle {\begin{matrix}&X&&&U&&\Sigma &&V^{T}\\&({\textbf {d}}_{j})&&&&&&&({\hat {\textbf {d}}}_{j})\\&\downarrow &&&&&&&\downarrow \\({\textbf {t}}_{i}^{T})\rightarrow &{\begin{bmatrix}x_{1,1}&\dots &x_{1,j}&\dots &x_{1,n}\\\vdots &\ddots &\vdots &\ddots &\vdots \\x_{i,1}&\dots &x_{i,j}&\dots &x_{i,n}\\\vdots &\ddots &\vdots &\ddots &\vdots \\x_{m,1}&\dots &x_{m,j}&\dots &x_{m,n}\\\end{bmatrix}}&=&({\hat {\textbf {t}}}_{i}^{T})\rightarrow &{\begin{bmatrix}{\begin{bmatrix}\,\\\,\\{\textbf {u}}_{1}\\\,\\\,\end{bmatrix}}\dots {\begin{bmatrix}\,\\\,\\{\textbf {u}}_{l}\\\,\\\,\end{bmatrix}}\end{bmatrix}}&\cdot &{\begin{bmatrix}\sigma _{1}&\dots &0\\\vdots &\ddots &\vdots \\0&\dots &\sigma _{l}\\\end{bmatrix}}&\cdot &{\begin{bmatrix}{\begin{bmatrix}&&{\textbf {v}}_{1}&&\end{bmatrix}}\\\vdots \\{\begin{bmatrix}&&{\textbf {v}}_{l}&&\end{bmatrix}}\end{bmatrix}}\end{matrix}}} The values σ 1 , … , σ l {\displaystyle \sigma _{1},\dots ,\sigma _{l}} are called the singular values, and u 1 , … , u l {\displaystyle u_{1},\dots ,u_{l}} and v 1 , … , v l {\displaystyle v_{1},\dots ,v_{l}} the left and right singular vectors. Notice the only part of U {\displaystyle U} that contributes to t i {\displaystyle {\textbf {t}}_{i}} is the i 'th {\displaystyle i{\textrm {'th}}} row. Let this row vector be called t ^ i T {\displaystyle {\hat {\textrm {t}}}_{i}^{T}} . Likewise, the only part of V T {\displaystyle V^{T}} that contributes to d j {\displaystyle {\textbf {d}}_{j}} is the j 'th {\displaystyle j{\textrm {'th}}} column, d ^ j {\displaystyle {\hat {\textrm {d}}}_{j}} . These are not the eigenvectors, but depend on all the eigenvectors. I Linear discriminant analysis (LDA), normal discriminant analysis (NDA), canonical variates analysis (CVA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features that characterizes or separates two or more classes of objects or events. The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification. LDA is closely related to analysis of variance (ANOVA) and regression analysis, which also attempt to express one dependent variable as a linear combination of other features or measurements. However, ANOVA uses categorical independent variables and a continuous dependent variable, whereas discriminant analysis has continuous independent variables and a categorical dependent variable (i.e. the class label). Logistic regression and probit regression are more similar to LDA than ANOVA is, as they also explain a categorical variable by the values of continuous independent variables. These other methods are preferable in applications where it is not reasonable to assume that the independent variables have a normal distribution, which is a fundamental assumption of the LDA method. LDA is also closely related to principal component analysis (PCA) and factor analysis in that they both look for linear combinations of variables which best explain the data. LDA explicitly attempts to model the difference between the classes of data. PCA, in contrast, does not take into account any difference in class, and factor analysis builds the feature combinations based on similarities rather than differences. Discriminant analysis is also different from factor analysis in that it is not an interdependence technique: a distinction between independent variables and dependent variables (also called criterion variables) must be made. LDA works when the measurements made on independent variables for each observation are continuous quantities. When dealing with categorical independent variables, the equivalent technique is discriminant correspondence analysis. Discriminant analysis is used when groups are known a priori (unlike in cluster analysis). Each case must have a score on one or more quantitative predictor measures, and a score on a group measure. In simple terms, discriminant function analysis is classification - the act of distributing things into groups, classes or categories of the same type. == History == The original dichotomous discriminant analysis was developed by Sir Ronald Fisher in 1936. It is different from an ANOVA or MANOVA, which is used to predict one (ANOVA) or multiple (MANOVA) continuous dependent variables by one or more independent categorical variables. Discriminant function analysis is useful in determining whether a set of variables is effective in predicting category membership. == LDA for two classes == Consider a set of observations x → {\displaystyle {\vec {x}}} (also called features, attributes, variables or measurements) for each sample of an object or event with known class y {\displaystyle y} . This set of samples is called the training set in a supervised learning context. The classification problem is then to find a good predictor for the class y {\displaystyle y} of any sample of the same distribution (not necessarily from the training set) given only an observation x → {\displaystyle {\vec {x}}} . LDA approaches the problem by assuming that the conditional probability density functions p ( x → | y = 0 ) {\displaystyle p({\vec {x}}|y=0)} and p ( x → | y = 1 ) {\displaystyle p({\vec {x}}|y=1)} are both the normal distribution with mean and covariance parameters ( μ → 0 , Σ 0 ) {\displaystyle \left({\vec {\mu }}_{0},\Sigma _{0}\right)} and ( μ → 1 , Σ 1 ) {\displaystyle \left({\vec {\mu }}_{1},\Sigma _{1}\right)} , respectively. Under this assumption, the Bayes-optimal solution is to predict points as being from the second class if the log of the likelihood ratios is bigger than some threshold T, so that: 1 2 ( x → − μ → 0 ) T Σ 0 − 1 ( x → − μ → 0 ) + 1 2 ln | Σ 0 | − 1 2 ( x → − μ → 1 ) T Σ 1 − 1 ( x → − μ → 1 ) − 1 2 ln | Σ 1 | > T {\displaystyle {\frac {1}{2}}({\vec {x}}-{\vec {\mu }}_{0})^{\mathrm {T} }\Sigma _{0}^{-1}({\vec {x}}-{\vec {\mu }}_{0})+{\frac {1}{2}}\ln |\Sigma _{0}|-{\frac {1}{2}}({\vec {x}}-{\vec {\mu }}_{1})^{\mathrm {T} }\Sigma _{1}^{-1}({\vec {x}}-{\vec {\mu }}_{1})-{\frac {1}{2}}\ln |\Sigma _{1}|\ >\ T} Without any further assumptions, the resulting classifier is referred to as quadratic discriminant analysis (QDA). LDA instead makes the additional simplifying homoscedasticity assumption (i.e. that the class covariances are identical, so Σ 0 = Σ 1 = Σ {\displaystyle \Sigma _{0}=\Sigma _{1}=\Sigma } ) and that the covariances have full rank. In this case, several terms cancel: x → T Σ 0 − 1 x → = x → T Σ 1 − 1 x → {\displaystyle {\vec {x}}^{\mathrm {T} }\Sigma _{0}^{-1}{\vec {x}}={\vec {x}}^{\mathrm {T} }\Sigma _{1}^{-1}{\vec {x}}} x → T Σ i − 1 μ → i = μ → i T Σ i − 1 x → {\displaystyle {\vec {x}}^{\mathrm {T} }{\Sigma _{i}}^{-1}{\vec {\mu }}_{i}={{\vec {\mu }}_{i}}^{\mathrm {T} }{\Sigma _{i}}^{-1}{\vec {x}}} because both sides are scalar and transpose to each other ( Σ i {\displaystyle \Sigma _{i}} is Hermitian) and the above decision criterion becomes a threshold on the dot product w → T x → > c {\displaystyle {\vec {w}}^{\mathrm {T} }{\vec {x}}>c} for some threshold constant c, where w → = Σ − 1 ( μ → 1 − μ → 0 ) {\displaystyle {\vec {w}}=\Sigma ^{-1}({\vec {\mu }}_{1}-{\vec {\mu }}_{0})} c = 1 2 w → T ( μ → 1 + μ → 0 ) {\displaystyle c={\frac {1}{2}}\,{\vec {w}}^{\mathrm {T} }({\vec {\mu }}_{1}+{\vec {\mu }}_{0})} This means that the criterion of an input x → {\displaystyle {\vec {x}}} being in a class y {\displaystyle y} is purely a function of this linear combination of the known observations. It is often useful to see this conclusion in geometrical terms: the criterion of an input x → {\displaystyle {\vec {x}}} being in a class y {\displaystyle y} is purely a function of projection of multidimensional-space point x → {\displaystyle {\vec {x}}} onto vector w → {\displaystyle {\vec {w}}} (thus, we only consider its direction). In other words, the observation belongs to y {\displaystyle y} if corresponding x → {\displaystyle {\vec {x}}} is located on a certain side of a hyperplane perpendicular to w → {\displaystyle {\vec {w}}} . The location of the plane is defined by the threshold c {\displaystyle c} . == Assumptions == The assumptions of discriminant analysis are the same as those for MANOVA. The analysis is quite sensitive to outliers and the size of the smallest group must be larger than the number of predictor variables. Multivariate normality: Independent variables are normal for each level of the grouping variable. Homogeneity of variance/covariance (homoscedasticity): Variances among group variables are the same across levels of predictors. Can be tested with Box's M statistic. It has been suggested, however, that linear discriminant analysis be used when covariances are equal, and that quadratic discriminant analysis may be used when covariances are not equal. Independence: Participants are assumed to be randomly sampled, and a participant's score on one variable is assumed to be independent of scores on that variable for all other participants. It has been suggested that discriminant analysis is relatively robust to slight violations of these assumptions, and it has also been shown that discriminant analysis may still be reliable when using dichotomous variables (where multivariate normality is often violated). == Discriminant functions == Discriminant analysis works by creating one or more linear combinations of predictors, creating a new latent variable for each function. These functions are called discriminant functions. The number of functions possible is either N g − 1 {\displaystyle N_{g}-1} where N g {\displaystyle N_{g}} = number of groups, or p {\displaystyle p} (the number of predictors), whichever is smaller. The first function created maximizes the differences between groups on that function. The second function maximizes differences on that function, but also must not be correlated with the previous function. This continues with subsequent functions with the requirement that the new function not be correlated with any of the previous functions. Given group j {\displaystyle j} , with R j {\displaystyle \mathbb {R} _{j}} sets of sample space, there is a discriminant rule such that if x ∈ R j {\displaystyle x\in \mathbb {R} _{j}} , then x ∈ j {\displaystyle x\in j} . Discriminant analysis then, finds “good” regions of R j {\displaystyle \mathbb {R} _{j}} to minimize classification error, therefore leading to a high percent correct classified in the classification table. Each function is given a discriminant score to determine how well it predicts group placement. Structure Corr Optical character recognition (OCR) or optical character reader is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example: from a television broadcast). Widely used as a form of data entry from printed paper data records – whether passport documents, invoices, bank statements, computerized receipts, business cards, mail, printed data, or any suitable documentation – it is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed online, and used in machine processes such as cognitive computing, machine translation, (extracted) text-to-speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision. Early versions needed to be trained with images of each character, and worked on one font at a time. Advanced systems capable of producing a high degree of accuracy for most fonts are now common, and with support for a variety of image file format inputs. Some systems are capable of reproducing formatted output that closely approximates the original page including images, columns, and other non-textual components. == History == Early optical character recognition may be traced to technologies involving telegraphy and creating reading devices for the blind. In 1914, Emanuel Goldberg developed a machine that read characters and converted them into standard telegraph code. Concurrently, Edmund Fournier d'Albe developed the Optophone, a handheld scanner that when moved across a printed page, produced tones that corresponded to specific letters or characters. In the late 1920s and into the 1930s, Emanuel Goldberg developed what he called a "Statistical Machine" for searching microfilm archives using an optical code recognition system. In 1931, he was granted US Patent number 1,838,389 for the invention. The patent was acquired by IBM. === Visually impaired users === In 1974, Ray Kurzweil started the company Kurzweil Computer Products, Inc. and continued development of omni-font OCR, which could recognize text printed in virtually any font. (Kurzweil is often credited with inventing omni-font OCR, but it was in use by companies, including CompuScan, in the late 1960s and 1970s.) Kurzweil used the technology to create a reading machine for blind people to have a computer read text to them out loud. The device included a CCD-type flatbed scanner and a text-to-speech synthesizer. On January 13, 1976, the finished product was unveiled during a widely reported news conference headed by Kurzweil and the leaders of the National Federation of the Blind. In 1978, Kurzweil Computer Products began selling a commercial version of the optical character recognition computer program. LexisNexis was one of the first customers, and bought the program to upload legal paper and news documents onto its nascent online databases. Two years later, Kurzweil sold his company to Xerox, which eventually spun it off as Scansoft, which merged with Nuance Communications. In the 2000s, OCR was made available online as a service (WebOCR), in a cloud computing environment, and in mobile applications like real-time translation of foreign-language signs on a smartphone. With the advent of smartphones and smartglasses, OCR can be used in internet connected mobile device applications that extract text captured using the device's camera. These devices that do not have built-in OCR functionality will typically use an OCR API to extract the text from the image file captured by the device. The OCR API returns the extracted text, along with information about the location of the detected text in the original image back to the device app for further processing (such as text-to-speech) or display. Various commercial and open source OCR systems are available for most common writing systems, including Latin, Cyrillic, Arabic, Hebrew, Indic, Bengali (Bangla), Devanagari, Tamil, Chinese, Japanese, and Korean characters. == Applications == OCR engines have been developed into software applications specializing in various subjects such as receipts, invoices, checks, and legal billing documents. The software can be used for: Entering data for business documents, e.g. checks, passports, invoices, bank statements and receipts Automatic number-plate recognition Passport recognition and information extraction in airports Automatically extracting key information from insurance documents Traffic-sign recognition Extracting business card information into a contact list Creating textual versions of printed documents, e.g. book scanning for Project Gutenberg Making electronic images of printed documents searchable, e.g. Google Books Converting handwriting in real-time to control a computer (pen computing) Defeating or testing the robustness of CAPTCHA anti-bot systems, though these are specifically designed to prevent OCR. Assistive technology for blind and visually impaired users Writing instructions for vehicles by identifying CAD images in a database that are appropriate to the vehicle design as it changes in real time Making scanned documents searchable by converting them to PDFs == Types == Optical character recognition (OCR) – targets typewritten text, one glyph or character at a time. Optical word recognition – targets typewritten text, one word at a time (for languages that use a space as a word divider). Usually just called "OCR". Intelligent character recognition (ICR) – also targets handwritten printscript or cursive text one glyph or character at a time, usually involving machine learning. Intelligent word recognition (IWR) – also targets handwritten printscript or cursive text, one word at a time. This is especially useful for languages where glyphs are not separated in cursive script. OCR is generally an offline process, which analyses a static document. There are cloud based services which provide an online OCR API service. Handwriting movement analysis can be used as input to handwriting recognition. Instead of merely using the shapes of glyphs and words, this technique is able to capture motion, such as the order in which segments are drawn, the direction, and the pattern of putting the pen down and lifting it. This additional information can make the process more accurate. This technology is also known as "online character recognition", "dynamic character recognition", "real-time character recognition", and "intelligent character recognition". == Techniques == === Pre-processing === OCR software often pre-processes images to improve the chances of successful recognition. Techniques include: De-skewing – if the document was not aligned properly when scanned, it may need to be tilted a few degrees clockwise or counterclockwise in order to make lines of text perfectly horizontal or vertical. Despeckling – removal of positive and negative spots, smoothing edges Binarization – conversion of an image from color or greyscale to black-and-white (called a binary image because there are two colors). The task is performed as a simple way of separating the text (or any other desired image component) from the background. The task of binarization is necessary since most commercial recognition algorithms work only on binary images, as it is simpler to do so. In addition, the effectiveness of binarization influences to a significant extent the quality of character recognition, and careful decisions are made in the choice of the binarization employed for a given input image type; since the quality of the method used to obtain the binary result depends on the type of image (scanned document, scene text image, degraded historical document, etc.). Line removal – Cleaning up non-glyph boxes and lines Layout analysis or zoning – Identification of columns, paragraphs, captions, etc. as distinct blocks. Especially important in multi-column layouts and tables. Line and word detection – Establishment of a baseline for word and character shapes, separating words as necessary. Script recognition – In multilingual documents, the script may change at the level of the words and hence, identification of the script is necessary, before the right OCR can be invoked to handle the specific script. Character isolation or segmentation – For per-character OCR, multiple characters that are connected due to image artifacts must be separated; single characters that are broken into multiple pieces due to artifacts must be connected. Normalization of aspect ratio and scale Segmentation of fixed-pitch fonts is accomplished relatively simply by aligning the image to a uniform grid based on where vertical grid lines will least often intersect black areas. For proportional fonts, more sophisticated techniques are needed because whitespace bet Neural gas is an artificial neural network, inspired by the self-organizing map and introduced in 1991 by Thomas Martinetz and Klaus Schulten. The neural gas is a simple algorithm for finding optimal data representations based on feature vectors. The algorithm was coined "neural gas" because of the dynamics of the feature vectors during the adaptation process, which distribute themselves like a gas within the data space. It is applied where data compression or vector quantization is an issue, for example speech recognition, image processing or pattern recognition. As a robustly converging alternative to the k-means clustering it is also used for cluster analysis. == Algorithm == Suppose we want to model a probability distribution P ( x ) {\displaystyle P(x)} of data vectors x {\displaystyle x} using a finite number of feature vectors w i {\displaystyle w_{i}} , where i = 1 , ⋯ , N {\displaystyle i=1,\cdots ,N} . For each time step t {\displaystyle t} Sample data vector x {\displaystyle x} from P ( x ) {\displaystyle P(x)} Compute the distance between x {\displaystyle x} and each feature vector. Rank the distances. Let i 0 {\displaystyle i_{0}} be the index of the closest feature vector, i 1 {\displaystyle i_{1}} the index of the second closest feature vector, and so on. Update each feature vector by: w i k t + 1 = w i k t + ε ⋅ e − k / λ ⋅ ( x − w i k t ) , k = 0 , ⋯ , N − 1 {\displaystyle w_{i_{k}}^{t+1}=w_{i_{k}}^{t}+\varepsilon \cdot e^{-k/\lambda }\cdot (x-w_{i_{k}}^{t}),k=0,\cdots ,N-1} In the algorithm, ε {\displaystyle \varepsilon } can be understood as the learning rate, and λ {\displaystyle \lambda } as the neighborhood range. ε {\displaystyle \varepsilon } and λ {\displaystyle \lambda } are reduced with increasing t {\displaystyle t} so that the algorithm converges after many adaptation steps. The adaptation step of the neural gas can be interpreted as gradient descent on a cost function. By adapting not only the closest feature vector but all of them with a step size decreasing with increasing distance order, compared to (online) k-means clustering a much more robust convergence of the algorithm can be achieved. The neural gas model does not delete a node and also does not create new nodes. === Comparison with SOM === Compared to self-organized map, the neural gas model does not assume that some vectors are neighbors. If two vectors happen to be close together, they would tend to move together, and if two vectors happen to be apart, they would tend to not move together. In contrast, in an SOM, if two vectors are neighbors in the underlying graph, then they will always tend to move together, no matter whether the two vectors happen to be neighbors in the Euclidean space. The name "neural gas" is because one can imagine it to be what an SOM would be like if there is no underlying graph, and all points are free to move without the bonds that bind them together. == Variants == A number of variants of the neural gas algorithm exists in the literature so as to mitigate some of its shortcomings. More notable is perhaps Bernd Fritzke's growing neural gas, but also one should mention further elaborations such as the Growing When Required network and also the incremental growing neural gas. A performance-oriented approach that avoids the risk of overfitting is the Plastic Neural gas model. === Growing neural gas === Fritzke describes the growing neural gas (GNG) as an incremental network model that learns topological relations by using a "Hebb-like learning rule", only, unlike the neural gas, it has no parameters that change over time and it is capable of continuous learning, i.e. learning on data streams. GNG has been widely used in several domains, demonstrating its capabilities for clustering data incrementally. The GNG is initialized with two randomly positioned nodes which are initially connected with a zero age edge and whose errors are set to 0. Since in the GNG input data is presented sequentially one by one, the following steps are followed at each iteration: It is calculating the errors (distances) between the two closest nodes to the current input data. The error of the winner node (only the closest one) is respectively accumulated. The winner node and its topological neighbors (connected by an edge) are moving towards the current input by different fractions of their respective errors. The age of all edges connected to the winner node are incremented. If the winner node and the second-winner are connected by an edge, such an edge is set to 0. Else, an edge is created between them. If there are edges with an age larger than a threshold, they are removed. Nodes without connections are eliminated. If the current iteration is an integer multiple of a predefined frequency-creation threshold, a new node is inserted between the node with the largest error (among all) and its topological neighbor presenting the highest error. The link between the former and the latter nodes is eliminated (their errors are decreased by a given factor) and the new node is connected to both of them. The error of the new node is initialized as the updated error of the node which had the largest error (among all). The accumulated error of all nodes is decreased by a given factor. If the stopping criterion is not met, the algorithm takes a following input. The criterion might be a given number of epochs, i.e., a pre-set number of times where all data is presented, or the reach of a maximum number of nodes. === Incremental growing neural gas === Another neural gas variant inspired by the GNG algorithm is the incremental growing neural gas (IGNG). The authors propose the main advantage of this algorithm to be "learning new data (plasticity) without degrading the previously trained network and forgetting the old input data (stability)." === Growing when required === Having a network with a growing set of nodes, like the one implemented by the GNG algorithm was seen as a great advantage, however some limitation on the learning was seen by the introduction of the parameter λ, in which the network would only be able to grow when iterations were a multiple of this parameter. The proposal to mitigate this problem was a new algorithm, the Growing When Required network (GWR), which would have the network grow more quickly, by adding nodes as quickly as possible whenever the network identified that the existing nodes would not describe the input well enough. === Plastic neural gas === The ability to only grow a network may quickly introduce overfitting; on the other hand, removing nodes on the basis of age only, as in the GNG model, does not ensure that the removed nodes are actually useless, because removal depends on a model parameter that should be carefully tuned to the "memory length" of the stream of input data. The "Plastic Neural Gas" model solves this problem by making decisions to add or remove nodes using an unsupervised version of cross-validation, which controls an equivalent notion of "generalization ability" for the unsupervised setting. While growing-only methods only cater for the incremental learning scenario, the ability to grow and shrink is suited to the more general streaming data problem. == Implementations == To find the ranking i 0 , i 1 , … , i N − 1 {\displaystyle i_{0},i_{1},\ldots ,i_{N-1}} of the feature vectors, the neural gas algorithm involves sorting, which is a procedure that does not lend itself easily to parallelization or implementation in analog hardware. However, implementations in both parallel software and analog hardware were actually designed. Aarogya Setu (lit. 'The bridge to health') is an Indian COVID-19 "contact tracing, syndromic mapping and self-assessment" digital service, primarily a mobile app, developed by the National Informatics Centre under the Ministry of Electronics and Information Technology (MeitY). The app reached more than 100 million installs in 40 days. On 26 May, amid growing privacy and security concerns, the source code of the app was made public. == Full view == The stated purpose of this app is to spread awareness of COVID-19 and to connect essential COVID-19-related health services to the people of India. This app augments the initiatives of the Department of Health to contain COVID-19 and shares best practices and advisories. It is a tracking app which uses the smartphone's GPS and Bluetooth features to track COVID-19 cases. The app is available for Android and iOS mobile operating systems. With Bluetooth, it tries to determine the risk if one has been near (within six feet of) a COVID-19-infected person, by scanning through a database of known cases across India. Using location information, it determines whether the location one is in belongs to one of the infected areas based on the data available. This app is an updated version of an earlier app called Corona Kavach (now discontinued) which was released earlier by the Government of India. == Features and tools == Aarogya Setu has four sections: User Status (tells the risk of getting COVID-19 for the user) Self Assess (helps the users identify COVID-19 symptoms and their risk profile) COVID-19 Updates (gives updates on local and national COVID-19 cases) E-pass integration (if applied for E-pass, it will be available) See Recent Contacts option (allows the users to assess the risk level of their Bluetooth contacts) It tells how many COVID-19 positive cases are likely in a radius of 500 m, 1 km, 2 km, 5 km and 10 km from the user. The app is built on a platform that can provide an application programming interface (API) so that other computer programs, mobile applications, and web services can make use of the features and data available in Aarogya Setu. == Response == Aarogya Setu crossed five million downloads within three days of its launch, making it one of the most popular government apps in India. It became the world's fastest-growing mobile app, beating Pokémon Go, with more than 50 million installs 13 days after launching in India on 2 April 2020. It reached 100 million installs by 13 May 2020, that is in 40 days since its launch. In an order on 29 April 2020 the central government made it mandatory for all employees to download the app and use it – "Before starting for office, they must review their status on Aarogya Setu and commute only when the app shows safe or low risk". The Union Home Ministry also said that the application is mandatory for all living in the COVID-19 containment zone. The government gave the announcement along with the nationwide lockdown extension by two weeks from the 4 May with certain relaxations. On 21 May 2020, the Airport Authority of India issued a Standard Operating Procedure (SOP) stating that all departing passengers must compulsorily be registered with the Aarogya Setu app. It added that the app would not be mandatory for children below 14 years. However, the next day, Civil Aviation Minister Hardeep Singh Puri clarified that the app would not be mandatory for any passengers. On 26 May 2020, the Aarogya Setu app code was made open to developers across the globe to help other countries manage contact tracing in their fight against COVID-19 pandemic. In March 2021, Co-WIN portal was integrated with the app. This allowed users to schedule an appointment through the app for COVID-19 vaccine by registering their phone number and providing relevant documents. == Effectiveness == NITI Aayog CEO revealed that "the app has been able to identify more than 3,000 hotspots in 3–17 days ahead of time." However, users and experts in India and around the world say the app raises huge data security concerns. The app collects name, number, gender, travel history, and uses a phone's Bluetooth and location data to let users know if they have been near a person with COVID-19 by scanning a database of known cases of infection, and also share it with the government simultaneously. This is the major area of concern as the app's constant access to a phone's Bluetooth imposes a form of security threat. But it stood to clarify itself that the informations received are not going to be made public. Amidst all these, the app hits a record of about one-hundred million downloads. == Reception == Rahul Gandhi, leader of the Congress party, termed the Aarogya Setu application a "sophisticated surveillance system" after the government announced that downloading the app would be mandatory for both government and private employees. Following this, others raised the same concerns about the Aarogya Setu app. The Ministry of Electronics and Information Technology (MeitY) responded to these concerns by asserting that Gandhi's claims were false, and that the app was being appreciated internationally. On 5 May, French ethical hacker Robert Baptiste, who goes by the name Elliot Alderson on Twitter, claimed that there were security issues with the app. The Indian government, as well as the app developers, responded to this claim by thanking the hacker for his attention, but dismissed his concerns. The developers of the app stated that the fetching of location data is a documented feature of the app, rather than a flaw, since the app is designed to track the distribution of the virus-infected population. They also asserted that no personal information of any user has been proven to be at risk. On 6 May, Robert Baptiste tweeted that security vulnerabilities in Aarogya Setu allowed hackers to "know who is infected, unwell, [or] made a self assessment in the area of his choice". He also gave details of how many people were unwell and infected at the Prime Minister's Office, the Indian Parliament and the Home Office. The Economic Times pointed out that a clause in the app's Terms and Conditions stated that the user "agrees and acknowledges that the Government of India will not be liable for ... any unauthorised access to your information or modification thereof". In response, several software developers called for the source code to be made public. On 12 May, former Supreme Court Judge Justice B.N. Srikrishna termed the government's push mandating the use of Aarogya Setu app "utterly illegal". He said so far it is not backed by any law and questioned "under what law, government is mandating it on anyone". MIT Technology Review gave 2 out of 5 stars to Aarogya Setu app after analyzing the COVID contact tracing apps launched in 25 countries. The app got stars only for the policy which suggests that data collected is deleted after a period of time and that the data collection, as far as user inputs go, is minimal. It also highlighted that India is the only democracy making its app mandatory for millions of people. The rating was further downgraded from 2 to 1 for collecting more information than the app needs to function. Following this, the MeitY made the source code of the Android app public on GitHub on 26 May, which will be followed by iOS and API documentation. Further, the Government has also launched a "bug bounty program". This was done to "promote transparency and ensure security and integrity of the app". However, experts stated that the server-side code had not yet been publicly released, which meant that public opinion on security and privacy was yet to be completely assuaged. Following this, ZDNet noted that the source code seemed to confirm the government's claim that user location data, if collected, would be anonymised and would be deleted after 45 days, or 60 days for high-risk individuals. Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing. The data are linearly transformed onto a new coordinate system such that the directions (principal components) capturing the largest variation in the data can be easily identified. The principal components of a collection of points in a real coordinate space are a sequence of p {\displaystyle p} unit vectors, where the i {\displaystyle i} -th vector is the direction of a line that best fits the data while being orthogonal to the first i − 1 {\displaystyle i-1} vectors. Here, a best-fitting line is defined as one that minimizes the average squared perpendicular distance from the points to the line. These directions (i.e., principal components) constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated. Many studies use the first two principal components in order to plot the data in two dimensions and to visually identify clusters of closely related data points. Principal component analysis has applications in many fields such as population genetics, microbiome studies, and atmospheric science. == Overview == When performing PCA, the first principal component of a set of p {\displaystyle p} variables is the derived variable formed as a linear combination of the original variables that explains the most variance. The second principal component explains the most variance in what is left once the effect of the first component is removed, and we may proceed through p {\displaystyle p} iterations until all the variance is explained. PCA is most commonly used when many of the variables are highly correlated with each other and it is desirable to reduce their number to an independent set. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data. The i {\displaystyle i} -th principal component can be taken as a direction orthogonal to the first i − 1 {\displaystyle i-1} principal components that maximizes the variance of the projected data. For either objective, it can be shown that the principal components are eigenvectors of the data's covariance matrix. Thus, the principal components are often computed by eigendecomposition of the data covariance matrix or singular value decomposition of the data matrix. PCA is the simplest of the true eigenvector-based multivariate analyses and is closely related to factor analysis. Factor analysis typically incorporates more domain-specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. PCA is also related to canonical correlation analysis (CCA). CCA defines coordinate systems that optimally describe the cross-covariance between two datasets while PCA defines a new orthogonal coordinate system that optimally describes variance in a single dataset. Robust and L1-norm-based variants of standard PCA have also been proposed. == History == PCA was invented in 1901 by Karl Pearson, as an analogue of the principal axis theorem in mechanics; it was later independently developed and named by Harold Hotelling in the 1930s. Depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT) in signal processing, the Hotelling transform in multivariate quality control, proper orthogonal decomposition (POD) in mechanical engineering, singular value decomposition (SVD) of X (invented in the last quarter of the 19th century), eigenvalue decomposition (EVD) of XTX in linear algebra, factor analysis (for a discussion of the differences between PCA and factor analysis see Ch. 7 of Jolliffe's Principal Component Analysis), Eckart–Young theorem (Harman, 1960), or empirical orthogonal functions (EOF) in meteorological science (Lorenz, 1956), empirical eigenfunction decomposition (Sirovich, 1987), quasiharmonic modes (Brooks et al., 1988), spectral decomposition in noise and vibration, and empirical modal analysis in structural dynamics. == Intuition == PCA can be thought of as fitting a p-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipsoid is small, then the variance along that axis is also small. To find the axes of the ellipsoid, we must first center the values of each variable in the dataset on 0 by subtracting the mean of the variable's observed values from each of those values. These transformed values are used instead of the original observed values for each of the variables. Then, we compute the covariance matrix of the data and calculate the eigenvalues and corresponding eigenvectors of this covariance matrix. Then we must normalize each of the orthogonal eigenvectors to turn them into unit vectors. Once this is done, each of the mutually-orthogonal unit eigenvectors can be interpreted as an axis of the ellipsoid fitted to the data. This choice of basis will transform the covariance matrix into a diagonalized form, in which the diagonal elements represent the variance of each axis. The proportion of the variance that each eigenvector represents can be calculated by dividing the eigenvalue corresponding to that eigenvector by the sum of all eigenvalues. Biplots and scree plots (degree of explained variance) are used to interpret findings of the PCA. == Details == PCA is defined as an orthogonal linear transformation on a real inner product space that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. Consider an n × p {\displaystyle n\times p} data matrix, X, with column-wise zero empirical mean (the sample mean of each column has been shifted to zero), where each of the n rows represents a different repetition of the experiment, and each of the p columns gives a particular kind of feature (say, the results from a particular sensor). Mathematically, the transformation is defined by a set of size l {\displaystyle l} (where l {\displaystyle l} is usually selected to be strictly less than p {\displaystyle p} to reduce dimensionality) of p {\displaystyle p} -dimensional vectors of weights or coefficients w ( k ) = ( w 1 , … , w p ) ( k ) {\displaystyle \mathbf {w} _{(k)}=(w_{1},\dots ,w_{p})_{(k)}} that map each row vector x ( i ) = ( x 1 , … , x p ) ( i ) {\displaystyle \mathbf {x} _{(i)}=(x_{1},\dots ,x_{p})_{(i)}} of X to a new vector of principal component scores t ( i ) = ( t 1 , … , t l ) ( i ) {\displaystyle \mathbf {t} _{(i)}=(t_{1},\dots ,t_{l})_{(i)}} , given by t k ( i ) = x ( i ) ⋅ w ( k ) f o r i = 1 , … , n k = 1 , … , l {\displaystyle {t_{k}}_{(i)}=\mathbf {x} _{(i)}\cdot \mathbf {w} _{(k)}\qquad \mathrm {for} \qquad i=1,\dots ,n\qquad k=1,\dots ,l} in such a way that the individual variables t 1 , … , t l {\displaystyle t_{1},\dots ,t_{l}} of t considered over the data set successively inherit the maximum possible variance from X, with each coefficient vector w constrained to be a unit vector. The above may equivalently be written in matrix form as T = X W {\displaystyle \mathbf {T} =\mathbf {X} \mathbf {W} } where T i k = t k ( i ) {\displaystyle {\mathbf {T} }_{ik}={t_{k}}_{(i)}} , X i j = x j ( i ) {\displaystyle {\mathbf {X} }_{ij}={x_{j}}_{(i)}} , and W j k = w j ( k ) {\displaystyle {\mathbf {W} }_{jk}={w_{j}}_{(k)}} . === First component === In order to maximize variance, the first weight vector w(1) thus has to satisfy w ( 1 ) = arg max ‖ w ‖ = 1 { ∑ i ( t 1 ) ( i ) 2 } = arg max ‖ w ‖ = 1 { ∑ i ( x ( i ) ⋅ w ) 2 } {\displaystyle \mathbf {w} _{(1)}=\arg \max _{\Vert \mathbf {w} \Vert =1}\,\left\{\sum _{i}(t_{1})_{(i)}^{2}\right\}=\arg \max _{\Vert \mathbf {w} \Vert =1}\,\left\{\sum _{i}\left(\mathbf {x} _{(i)}\cdot \mathbf {w} \right)^{2}\right\}} Equivalently, writing this in matrix form gives w ( 1 ) = arg max ‖ w ‖ = 1 { ‖ X w ‖ 2 } = arg max ‖ w ‖ = 1 { w T X T X w } {\displaystyle \mathbf {w} _{(1)}=\arg \max _{\left\|\mathbf {w} \right\|=1}\left\{\left\|\mathbf {Xw} \right\|^{2}\right\}=\arg \max _{\left\|\mathbf {w} \right\|=1}\left\{\mathbf {w} ^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {Xw} \right\}} Since w(1) has been defined to be a unit vector, it equivalently also satisfies w ( 1 ) = arg max { w T X T X w w T w } {\displaystyle \mathbf {w} _{(1)}=\arg \max \left\{{\frac {\mathbf {w} ^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {Xw} }{\mathbf {w} ^{\mathsf {T}}\mathbf {w} }}\right\}} The quantity to be maximised can be recognised as a Rayleigh quotient. A standard result for a positive semidefinite matrix such as XTX is that the quotient's maximum possible value is the largest eigenvalue of the matrix, which occurs when w is the corresponding eigenvector. With w(1) found, the first principal component of a data vectorLatent semantic analysis
Linear discriminant analysis
Optical character recognition
Neural gas
Aarogya Setu
Principal component analysis