AI Chatbot Q

AI Chatbot Q — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Screen generator

    Screen generator

    A screen generator, also known as a screen painter, screen mapper, or forms generator is a software package (or component thereof) which enables data entry screens to be generated declaratively, by "painting" them on the screen WYSIWYG-style, or through filling-in forms, rather than requiring writing of code to display them manually. 4GLs commonly incorporate a screen generator feature. They are also commonly found bundled with database systems, especially entry-level databases. A screen generator is one aspect of an application generator, which can also include other functions such as report generation and a data dictionary. The earliest screen generators were character-based; by the 1990s, GUI support became common, and then support for generating HTML forms as well. Some screen generators work by generating code to display the screen in a high-level language (for example, COBOL); others store the screen definition in a data file or in database tables, and then have a runtime component responsible for actually displaying the form and receiving and validating user input. == Examples == Examples of screen generators include: IBM Screen Definition Facility II: generates screens for CICS BMS, IMS MFS, ISPF, GDDM and CSP/AD. Performix for Informix. Microsoft Visual Basic the forms component of Microsoft Access Oracle Developer, in particular its Oracle Forms component the QDesign component of PowerHouse SystemBuilder/SB+ the Screen Painter component of SAP's ABAP Workbench the FoxView component of FoxPro. FoxView was originally developed by Luis Castro as a dBASE screen generator named ViewGen; Fox purchased it and bundled it with FoxPro 1.0. Later, Fox replaced Castro's code with their own screen painter code. dBASE included a built-in screen generator in dBASE IV onwards; in dBASE III and earlier, third party screen generators were available, including the already mentioned ViewGen DPS 1100 for UNIVAC 1100 series mainframes.

    Read more →
  • Ground truth

    Ground truth

    Ground truth is information that is known to be real or true, provided by direct observation and measurement (i.e. empirical evidence) as opposed to information provided by inference. The term ground truth appeared in remote sensing literature as early as 1972, when NASA described it as essential "data about ... materials on the earth's surface" used to calibrate measurements. It was later adopted by the statistical modeling and machine learning communities. == Etymology == The Oxford English Dictionary (s.v. ground truth) records the use of the word Groundtruth in the sense of 'fundamental truth' from Henry Ellison's poem "The Siberian Exile's Tale", published in 1833. == Usage == The term "ground truth" can be used as a noun, adjective, and verb. Noun: "ground truth" (no hyphen). Example: "The ground truth is essential for training accurate models." Adjective: "ground-truth" (hyphenated compound adjective). Example: "We need to use ground-truth data to validate the model." Verb: "to ground-truth" or "to groundtruth" (compound verb,). Example: "We need to ground-truth the results to ensure their accuracy." == Statistics and machine learning == In statistics and machine learning, ground truth is the ideal expected result, used in statistical models to prove or disprove research hypotheses. "Ground truthing" is the process of gathering the good data for this test. Ground truth is typically included in labeled data. In machine learning, "ground truth" is not necessarily objectively correct or true. For example, in training AI models or relevance rankers, it may be a set of judgments made by people or inferred from user behavior, which may depend on context. For example, in Bayesian spam filtering, a supervised learning system is typically trained by examples labeled as spam and non-spam. Although these labels may be subjective or inaccurate, they are considered ground truth. True ground truth in machine learning is objective data. For example, suppose we are testing a stereo vision system to see how well it can estimate 3D positions. A calibrated laser rangefinder may provide accurate distances as ground truth. == Remote sensing == In remote sensing, "ground truth" refers to information collected at the imaged location. Ground truth allows image data to be related to real features and materials on the ground. The collection of ground truth data enables calibration of remote-sensing data, and aids in the interpretation and analysis of what is being sensed. Examples include cartography, meteorology, analysis of aerial photographs, satellite imagery and other techniques in which data are gathered at a distance. More specifically, ground truth may refer to a process in which "pixels" on a satellite image are compared to what is imaged (at the time of capture) in order to verify the contents of the "pixels" in the image (noting that the concept of "pixel" is imaging-system-dependent). In the case of a classified image, supervised classification can help to determine the accuracy of the classification by the remote sensing system which can minimize error in the classification. Ground truth is usually done on site, correlating what is known with surface observations and measurements of various properties of the features of the ground resolution cells under study in the remotely sensed digital image. The process also involves taking geographic coordinates of the ground resolution cell with GPS technology and comparing those with the coordinates of the "pixel" being studied provided by the remote sensing software to understand and analyze the location errors and how it may affect a particular study. Ground truth is important in the initial supervised classification of an image. When the identity and location of land cover types are known through a combination of field work, maps, and personal experience these areas are known as training sites. The spectral characteristics of these areas are used to train the remote sensing software using decision rules for classifying the rest of the image. These decision rules such as Maximum Likelihood Classification, Parallelopiped Classification, and Minimum Distance Classification offer different techniques to classify an image. Additional ground truth sites allow the remote sensor to establish an error matrix that validates the accuracy of the classification method used. Different classification methods may have different percentages of error for a given classification project. It is important that the remote sensor chooses a classification method that works best with the number of classifications used while providing the least amount of error. Ground truth also helps with atmospheric correction. Since images from satellites have to pass through the atmosphere, they can get distorted because of absorption in the atmosphere. So ground truth can help fully identify objects in satellite photos. === Errors of commission === An example of an error of commission is when a pixel reports the presence of a feature (such a tree) that, in reality, is absent (no tree is actually present). Ground truthing ensures that the error matrices have a higher accuracy percentage than would be the case if no pixels were ground-truthed. This value is the complement of the user's accuracy, i.e. Commission Error = 1 - user's accuracy. === Errors of omission === An example of an error of omission is when pixels of a certain type, for example, maple trees, are not classified as maple trees. The process of ground-truthing helps to ensure that the pixel is classified correctly and the error matrices are more accurate. This value is the complement of the producer's accuracy, i.e. Omission Error = 1 - producer's accuracy == Geographical information systems == In GIS the spatial data is modeled as field (like in remote sensing raster images) or as object (like in vectorial map representation). They are modeled from the real world (also named geographical reality), typically by a cartographic process (illustrated). Geographic information systems such as GIS, GPS, and GNSS, have become so widespread that the term "ground truth" has taken on special meaning in that context. If the location coordinates returned by a location method such as GPS are an estimate of a location, then the "ground truth" is the actual location on Earth. A smart phone might return a set of estimated location coordinates such as 43.87870, −103.45901. The ground truth being estimated by those coordinates is the tip of George Washington's nose on Mount Rushmore. The accuracy of the estimate is the maximum distance between the location coordinates and the ground truth. We could say in this case that the estimate accuracy is 10 meters, meaning that the point on Earth represented by the location coordinates is thought to be within 10 meters of George's nose—the ground truth. In slang, the coordinates indicate where we think George Washington's nose is located, and the ground truth is where it really is. In practice a smart phone or hand-held GPS unit is routinely able to estimate the ground truth within 6–10 meters. Specialized instruments can reduce GPS measurement error to under a centimeter. == Military usage == US military slang uses "ground truth" to refer to the facts comprising a tactical situation—as opposed to intelligence reports, mission plans, and other descriptions reflecting the conative or policy-based projections of the industrial·military complex. The term appears in the title of the Iraq War documentary film The Ground Truth (2006), and also in military publications, for example Stars and Stripes saying: "Stripes decided to figure out what the ground truth was in Iraq."

    Read more →
  • Multiple correspondence analysis

    Multiple correspondence analysis

    In statistics, multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Euclidean space. The procedure thus appears to be the counterpart of principal component analysis for categorical data. MCA can be viewed as an extension of simple correspondence analysis (CA) in that it is applicable to a large set of categorical variables. == As an extension of correspondence analysis == MCA is performed by applying the CA algorithm to either an indicator matrix (also called complete disjunctive table – CDT) or a Burt table formed from these variables. An indicator matrix is an individuals × variables matrix, where the rows represent individuals and the columns are dummy variables representing categories of the variables. Analyzing the indicator matrix allows the direct representation of individuals as points in geometric space. The Burt table is the symmetric matrix of all two-way cross-tabulations between the categorical variables, and has an analogy to the covariance matrix of continuous variables. Analyzing the Burt table is a more natural generalization of simple correspondence analysis, and individuals or the means of groups of individuals can be added as supplementary points to the graphical display. In the indicator matrix approach, associations between variables are uncovered by calculating the chi-square distance between different categories of the variables and between the individuals (or respondents). These associations are then represented graphically as "maps", which eases the interpretation of the structures in the data. Oppositions between rows and columns are then maximized, in order to uncover the underlying dimensions best able to describe the central oppositions in the data. As in factor analysis or principal component analysis, the first axis is the most important dimension, the second axis the second most important, and so on, in terms of the amount of variance accounted for. The number of axes to be retained for analysis is determined by calculating modified eigenvalues. == Details == Since MCA is adapted to draw statistical conclusions from categorical variables (such as multiple choice questions), the first thing one needs to do is to transform quantitative data (such as age, size, weight, day time, etc) into categories (using for instance statistical quantiles). When the dataset is completely represented as categorical variables, one is able to build the corresponding so-called complete disjunctive table. We denote this table X {\displaystyle X} . If I {\displaystyle I} persons answered a survey with J {\displaystyle J} multiple choices questions with 4 answers each, X {\displaystyle X} will have I {\displaystyle I} rows and 4 J {\displaystyle 4J} columns. More theoretically, assume X {\displaystyle X} is the completely disjunctive table of I {\displaystyle I} observations of K {\displaystyle K} categorical variables. Assume also that the k {\displaystyle k} -th variable have J k {\displaystyle J_{k}} different levels (categories) and set J = ∑ k = 1 K J k {\displaystyle J=\sum _{k=1}^{K}J_{k}} . The table X {\displaystyle X} is then a I × J {\displaystyle I\times J} matrix with all coefficient being 0 {\displaystyle 0} or 1 {\displaystyle 1} . Set the sum of all entries of X {\displaystyle X} to be N {\displaystyle N} and introduce Z = X / N {\displaystyle Z=X/N} . In an MCA, there are also two special vectors: first r {\displaystyle r} , that contains the sums along the rows of Z {\displaystyle Z} , and c {\displaystyle c} , that contains the sums along the columns of Z {\displaystyle Z} . Note D r = diag ( r ) {\displaystyle D_{r}={\text{diag}}(r)} and D c = diag ( c ) {\displaystyle D_{c}={\text{diag}}(c)} , the diagonal matrices containing r {\displaystyle r} and c {\displaystyle c} respectively as diagonal. With these notations, computing an MCA consists essentially in the singular value decomposition of the matrix: M = D r − 1 / 2 ( Z − r c T ) D c − 1 / 2 {\displaystyle M=D_{r}^{-1/2}(Z-rc^{T})D_{c}^{-1/2}} The decomposition of M {\displaystyle M} gives you P {\displaystyle P} , Δ {\displaystyle \Delta } and Q {\displaystyle Q} such that M = P Δ Q T {\displaystyle M=P\Delta Q^{T}} with P, Q two unitary matrices and Δ {\displaystyle \Delta } is the generalized diagonal matrix of the singular values (with the same shape as Z {\displaystyle Z} ). The positive coefficients of Δ 2 {\displaystyle \Delta ^{2}} are the eigenvalues of Z {\displaystyle Z} . The interest of MCA comes from the way observations (rows) and variables (columns) in Z {\displaystyle Z} can be decomposed. This decomposition is called a factor decomposition. The coordinates of the observations in the factor space are given by F = D r − 1 / 2 P Δ {\displaystyle F=D_{r}^{-1/2}P\Delta } The i {\displaystyle i} -th rows of F {\displaystyle F} represent the i {\displaystyle i} -th observation in the factor space. And similarly, the coordinates of the variables (in the same factor space as observations!) are given by G = D c − 1 / 2 Q Δ {\displaystyle G=D_{c}^{-1/2}Q\Delta } == Recent works and extensions == In recent years, several students of Jean-Paul Benzécri have refined MCA and incorporated it into a more general framework of data analysis known as geometric data analysis. This involves the development of direct connections between simple correspondence analysis, principal component analysis and MCA with a form of cluster analysis known as Euclidean classification. Two extensions have great practical use. It is possible to include, as active elements in the MCA, several quantitative variables. This extension is called factor analysis of mixed data (see below). Very often, in questionnaires, the questions are structured in several issues. In the statistical analysis it is necessary to take into account this structure. This is the aim of multiple factor analysis which balances the different issues (i.e. the different groups of variables) within a global analysis and provides, beyond the classical results of factorial analysis (mainly graphics of individuals and of categories), several results (indicators and graphics) specific of the group structure. == Application fields == In the social sciences, MCA is arguably best known for its application by Pierre Bourdieu, notably in his books La Distinction, Homo Academicus and The State Nobility. Bourdieu argued that there was an internal link between his vision of the social as spatial and relational --– captured by the notion of field, and the geometric properties of MCA. Sociologists following Bourdieu's work most often opt for the analysis of the indicator matrix, rather than the Burt table, largely because of the central importance accorded to the analysis of the 'cloud of individuals'. == Multiple correspondence analysis and principal component analysis == MCA can also be viewed as a PCA applied to the complete disjunctive table. To do this, the CDT must be transformed as follows. Let y i k {\displaystyle y_{ik}} denote the general term of the CDT. y i k {\displaystyle y_{ik}} is equal to 1 if individual i {\displaystyle i} possesses the category k {\displaystyle k} and 0 if not. Let denote p k {\displaystyle p_{k}} , the proportion of individuals possessing the category k {\displaystyle k} . The transformed CDT (TCDT) has as general term: x i k = y i k / p k − 1 {\displaystyle x_{ik}=y_{ik}/p_{k}-1} The unstandardized PCA applied to TCDT, the column k {\displaystyle k} having the weight p k {\displaystyle p_{k}} , leads to the results of MCA. This equivalence is fully explained in a book by Jérôme Pagès. It plays an important theoretical role because it opens the way to the simultaneous treatment of quantitative and qualitative variables. Two methods simultaneously analyze these two types of variables: factor analysis of mixed data and, when the active variables are partitioned in several groups: multiple factor analysis. This equivalence does not mean that MCA is a particular case of PCA as it is not a particular case of CA. It only means that these methods are closely linked to one another, as they belong to the same family: the factorial methods. == Software == There are numerous software of data analysis that include MCA, such as STATA and SPSS. The R package FactoMineR also features MCA. This software is related to a book describing the basic methods for performing MCA . There is also a Python package for [1] which works with numpy array matrices; the package has not been implemented yet for Spark dataframes.

    Read more →
  • Ilastik

    Ilastik

    ilastik is free open source software for image classification and segmentation. No previous experience in image processing is required to run the software. Since 2018 ilastik is further developed and maintained by Anna Kreshuk's group at European Molecular Biology Laboratory. == Features == ilastik allows user to annotate an arbitrary number of classes in images with a mouse interface. Using these user annotations and the generic (nonlinear) image features, the user can train a random forest classifier. Trained ilastik classifiers can be applied new data not included in the training set in ilastik via its batch processing functionality, or without using the graphical user interface, in headless mode. ilastik can be integrated into various related tools: Pre-trained workflows can be executed directly from ImageJ/Fiji using the ilastik-ImageJ plugin. Pre-trained ilastik Pixel Classification workflows can be run directly in Python with the ilastik Python package, which is available via conda. ilastik has a CellProfiler module to use ilastik classifiers to process images within a CellProfiler framework. == History == ilastik was first released in 2011 by scientists at the Heidelberg Collaboratory for Image Processing (HCI), University of Heidelberg. == Application == The Interactive Learning and Segmentation Toolkit Carving Cell classification and neuron classification Synapse detection Cell tracking Neural Network Classification == Resources == ilastik project is hosted on GitHub. It is a collaborative project, any contributions such as comments, bug reports, bug fixes or code contributions are welcome. The ilastik team can be contacted for user support on the image.sc forum.

    Read more →
  • Videotex

    Videotex

    Videotex (or interactive videotex) was one of the earliest implementations of an end-user information system. From the late 1970s to early 2010s, it was used to deliver information (usually pages of text) to a user in computer-like format, typically to be displayed on a television or a dumb terminal. In a strict definition, videotex is any system that provides interactive content and displays it on a video monitor such as a television, typically using modems to send data in both directions. A close relative is teletext, which sends data in one direction only, typically encoded in a television signal. All such systems are occasionally referred to as viewdata. Unlike the modern Internet, traditional videotex services were highly centralized. Videotex in its broader definition can be used to refer to any such service, including teletext, the Internet, bulletin board systems, online service providers, and even the arrival/departure displays at an airport. This usage is no longer common. With the exception of Minitel in France, videotex elsewhere never managed to attract any more than a very small percentage of the universal mass market once envisaged. By the end of the 1980s its use was essentially limited to a few niche applications. == Initial development and technologies == === United Kingdom === The first attempts at a general-purpose videotex service were created in the United Kingdom in the late 1960s. In about 1970 the BBC had a brainstorming session in which it was decided to start researching ways to send closed captioning information to the audience. As the Teledata research continued the BBC became interested in using the system for delivering any sort of information, not just closed captioning. In 1972, the concept was first made public under the new name Ceefax. Meanwhile, the General Post Office (soon to become British Telecom) had been researching a similar concept since the late 1960s, known as Viewdata. Unlike Ceefax which was a one-way service carried in the existing TV signal, Viewdata was a two-way system using telephones. Since the Post Office owned the telephones, this was considered to be an excellent way to drive more customers to use the phones. Not to be outdone by the BBC, they also announced their service, under the name Prestel. ITV soon joined the fray with a Ceefax-clone known as ORACLE. In 1974, all the services agreed on a standard for displaying the information. The display would be a simple 40×24 grid of text, with some "graphics characters" for constructing simple graphics, revised and finalized in 1976. The standard did not define the delivery system, so both Viewdata-like and Teledata-like services could at least share the TV-side hardware, which was expensive at the time. The standard also introduced a new term that covered all such services, teletext. Ceefax first started operation in 1974 with a limited 30 pages, followed quickly by ORACLE and then Prestel in 1979. By 1981, Prestel International was available in nine countries, and a number of countries, including Sweden, The Netherlands, Finland and West Germany were developing their own national systems closely based on Prestel. General Telephone and Electronics (GTE) acquired an exclusive agency for the system for North America. In the early 1980s, videotex became the base technology for the London Stock Exchange's pricing service called TOPIC. Later versions of TOPIC, notably TOPIC2 and TOPIC3, were developed by Thanos Vassilakis and introduced trading and historic price feeds. === France === Development of a French teletext-like system began in 1973. A very simple 2-way videotex system called Tictac was also demonstrated in the mid-1970s. As in the UK, this led on to work to develop a common display standard for videotex and teletext, called Antiope, which was finalised in 1977. Antiope had similar capabilities to the UK system for displaying alphanumeric text and chunky "mosaic" character-based block graphics. A difference however was that while in the UK standard control codes automatically also occupied one character position on screen, Antiope allowed for "non spacing" control codes. This gave Antiope slightly more flexibility in the use of colours in mosaic block graphics, and in presenting the accents and diacritics of the French language. Meanwhile, spurred on by the 1978 Nora/Minc report, the French government was determined to catch up on a perceived falling behind in its computer and communications facilities. In 1980 it began field trials issuing Antiope-based terminals for free to over 250,000 telephone subscribers in Ille-et-Vilaine region, where the French CCETT research centre was based, for use as telephone directories. The trial was a success, and in 1982 Minitel was rolled out nationwide. === Canada === Since 1970, researchers at the Communications Research Centre (CRC) in Ottawa had been working on a set of "picture description instructions", which encoded graphics commands as a text stream. Graphics were encoded as a series of instructions (graphics primitives) each represented by a single ASCII character. Graphic coordinates were encoded in multiple 6 bit strings of XY coordinate data, flagged to place them in the printable ASCII range so that they could be transmitted with conventional text transmission techniques. ASCII SI/SO characters were used to differentiate the text from graphic portions of a transmitted "page". In 1975, the CRC gave a contract to Norpak to develop an interactive graphics terminal that could decode the instructions and display them on a colour display, which was successfully up and running by 1977. Against the background of the developments in Europe, CRC was able to persuade the Canadian government to develop the system into a fully-fledged service. In August 1978, the Canadian Department of Communications publicly launched it as Telidon, a "second generation" videotex/teletext service, and committed to a four-year development plan to encourage rollout. Compared to the European systems, Telidon offered real graphics, as opposed to block-mosaic character graphics. The downside was that it required much more advanced decoders, typically featuring Zilog Z80 or Motorola 6809 processors. === Japan === Research in Japan was shaped by the demands of the large number of Kanji characters used in Japanese script. With 1970s technology, the ability to generate so many characters on demand in the end-user's terminal was seen as prohibitive. Instead, development focussed on methods to send pages to user terminals pre-rendered, using coding strategies similar to facsimile machines. This led to a videotex system called Captain ("Character and Pattern Telephone Access Information Network"), created by NTT in 1978, which went into full trials from 1979 to 1981. The system also lent itself naturally to photographic images, albeit at only moderate resolution. However, the pages typically took two or three times longer to load, compared to the European systems. NHK developed an experimental teletext system along similar lines, called CIBS ("Character Information Broadcasting Station"). Based on a 388×200 pixel resolution, it was first announced in 1976, and began trials in late 1978. (NHK's ultimate production teletext system launched in 1983). == Standards == Work to establish an international standard for videotex began in 1978 in CCITT. But the national delegations showed little interest in compromise, each hoping that their system would come to define what was perceived to be going to be an enormous new mass-market. In 1980 CCITT therefore issued recommendation S.100 (later T.100), noting the points of similarity but the essential incompatibility of the systems, and declaring all four to be recognised options. Trying to kick-start the market, AT&T Corporation entered the fray, and in May 1981 announced its own Presentation Layer Protocol (PLP). This was closely based on the Canadian Telidon system, but added to it some further graphics primitives and a syntax for defining macros, algorithms to define cleaner pixel spacing for the (arbitrarily sizeable) text, and also dynamically redefinable characters and a mosaic block graphic character set, so that it could reproduce content from the French Antiope. After some further revisions this was adopted in 1983 as ANSI standard X3.110, more commonly called NAPLPS, the North American Presentation Layer Protocol Syntax. It was also adopted in 1988 as the presentation-layer syntax for NABTS, the North American Broadcast Teletext Specification. Meanwhile, the European national Postal Telephone and Telegraph (PTT) agencies were also increasingly interested in videotex, and had convened discussions in European Conference of Postal and Telecommunications Administrations (CEPT) to co-ordinate developments, which had been diverging along national lines. As well as the British and French standards, the Swedes had proposed extending the British Prestel standard with a new se

    Read more →
  • Locality-sensitive hashing

    Locality-sensitive hashing

    In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. The number of buckets is much smaller than the universe of possible input items. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items. Hashing-based approximate nearest-neighbor search algorithms generally use one of two main categories of hashing methods: either data-independent methods, such as locality-sensitive hashing (LSH); or data-dependent methods, such as locality-preserving hashing (LPH). Locality-preserving hashing was initially devised as a way to facilitate data pipelining in implementations of massively parallel algorithms that use randomized routing and universal hashing to reduce memory contention and network congestion. == Definitions == A finite family F {\displaystyle {\mathcal {F}}} of functions h : M → S {\displaystyle h\colon M\to S} is defined to be an LSH family for a metric space M = ( M , d ) {\displaystyle {\mathcal {M}}=(M,d)} , a threshold r > 0 {\displaystyle r>0} , an approximation factor c > 1 {\displaystyle c>1} , and probabilities p 1 > p 2 {\displaystyle p_{1}>p_{2}} if it satisfies the following condition. For any two points a , b ∈ M {\displaystyle a,b\in M} and a hash function h {\displaystyle h} chosen uniformly at random from F {\displaystyle {\mathcal {F}}} : If d ( a , b ) ≤ r {\displaystyle d(a,b)\leq r} , then h ( a ) = h ( b ) {\displaystyle h(a)=h(b)} (i.e., a and b collide) with probability at least p 1 {\displaystyle p_{1}} , If d ( a , b ) ≥ c r {\displaystyle d(a,b)\geq cr} , then h ( a ) = h ( b ) {\displaystyle h(a)=h(b)} with probability at most p 2 {\displaystyle p_{2}} . Such a family F {\displaystyle {\mathcal {F}}} is called ( r , c r , p 1 , p 2 ) {\displaystyle (r,cr,p_{1},p_{2})} -sensitive. === LSH with respect to a similarity measure === Alternatively it is possible to define an LSH family on a universe of items U endowed with a similarity function ϕ : U × U → [ 0 , 1 ] {\displaystyle \phi \colon U\times U\to [0,1]} . In this setting, a LSH scheme is a family of hash functions H coupled with a probability distribution D over H such that a function h ∈ H {\displaystyle h\in H} chosen according to D satisfies P r [ h ( a ) = h ( b ) ] = ϕ ( a , b ) {\displaystyle Pr[h(a)=h(b)]=\phi (a,b)} for each a , b ∈ U {\displaystyle a,b\in U} . === Amplification === Given a ( d 1 , d 2 , p 1 , p 2 ) {\displaystyle (d_{1},d_{2},p_{1},p_{2})} -sensitive family F {\displaystyle {\mathcal {F}}} , we can construct new families G {\displaystyle {\mathcal {G}}} by either the AND-construction or OR-construction of F {\displaystyle {\mathcal {F}}} . To create an AND-construction, we define a new family G {\displaystyle {\mathcal {G}}} of hash functions g, where each function g is constructed from k random functions h 1 , … , h k {\displaystyle h_{1},\ldots ,h_{k}} from F {\displaystyle {\mathcal {F}}} . We then say that for a hash function g ∈ G {\displaystyle g\in {\mathcal {G}}} , g ( x ) = g ( y ) {\displaystyle g(x)=g(y)} if and only if all h i ( x ) = h i ( y ) {\displaystyle h_{i}(x)=h_{i}(y)} for i = 1 , 2 , … , k {\displaystyle i=1,2,\ldots ,k} . Since the members of F {\displaystyle {\mathcal {F}}} are independently chosen for any g ∈ G {\displaystyle g\in {\mathcal {G}}} , G {\displaystyle {\mathcal {G}}} is a ( d 1 , d 2 , p 1 k , p 2 k ) {\displaystyle (d_{1},d_{2},p_{1}^{k},p_{2}^{k})} -sensitive family. To create an OR-construction, we define a new family G {\displaystyle {\mathcal {G}}} of hash functions g, where each function g is constructed from k random functions h 1 , … , h k {\displaystyle h_{1},\ldots ,h_{k}} from F {\displaystyle {\mathcal {F}}} . We then say that for a hash function g ∈ G {\displaystyle g\in {\mathcal {G}}} , g ( x ) = g ( y ) {\displaystyle g(x)=g(y)} if and only if h i ( x ) = h i ( y ) {\displaystyle h_{i}(x)=h_{i}(y)} for one or more values of i. Since the members of F {\displaystyle {\mathcal {F}}} are independently chosen for any g ∈ G {\displaystyle g\in {\mathcal {G}}} , G {\displaystyle {\mathcal {G}}} is a ( d 1 , d 2 , 1 − ( 1 − p 1 ) k , 1 − ( 1 − p 2 ) k ) {\displaystyle (d_{1},d_{2},1-(1-p_{1})^{k},1-(1-p_{2})^{k})} -sensitive family. == Applications == LSH has been applied to several problem domains, including: Near-duplicate detection Hierarchical clustering Genome-wide association study Image similarity identification VisualRank Gene expression similarity identification Audio similarity identification Nearest neighbor search Audio fingerprint Digital video fingerprinting Shared memory organization in parallel computing Physical data organization in database management systems Training fully connected neural networks Computer security Machine learning == Methods == === Bit sampling for Hamming distance === One of the easiest ways to construct an LSH family is by bit sampling. This approach works for the Hamming distance over d-dimensional vectors { 0 , 1 } d {\displaystyle \{0,1\}^{d}} . Here, the family F {\displaystyle {\mathcal {F}}} of hash functions is simply the family of all the projections of points on one of the d {\displaystyle d} coordinates, i.e., F = { h : { 0 , 1 } d → { 0 , 1 } ∣ h ( x ) = x i for some i ∈ { 1 , … , d } } {\displaystyle {\mathcal {F}}=\{h\colon \{0,1\}^{d}\to \{0,1\}\mid h(x)=x_{i}{\text{ for some }}i\in \{1,\ldots ,d\}\}} , where x i {\displaystyle x_{i}} is the i {\displaystyle i} th coordinate of x {\displaystyle x} . A random function h {\displaystyle h} from F {\displaystyle {\mathcal {F}}} simply selects a random bit from the input point. This family has the following parameters: P 1 = 1 − R / d {\displaystyle P_{1}=1-R/d} , P 2 = 1 − c R / d {\displaystyle P_{2}=1-cR/d} . That is, any two vectors x , y {\displaystyle x,y} with Hamming distance at most R {\displaystyle R} collide under a random h {\displaystyle h} with probability at least P 1 {\displaystyle P_{1}} . Any x , y {\displaystyle x,y} with Hamming distance at least c R {\displaystyle cR} collide with probability at most P 2 {\displaystyle P_{2}} . === Min-wise independent permutations === Suppose U is composed of subsets of some ground set of enumerable items S and the similarity function of interest is the Jaccard index J. If π is a permutation on the indices of S, for A ⊆ S {\displaystyle A\subseteq S} let h ( A ) = min a ∈ A { π ( a ) } {\displaystyle h(A)=\min _{a\in A}\{\pi (a)\}} . Each possible choice of π defines a single hash function h mapping input sets to elements of S. Define the function family H to be the set of all such functions and let D be the uniform distribution. Given two sets A , B ⊆ S {\displaystyle A,B\subseteq S} the event that h ( A ) = h ( B ) {\displaystyle h(A)=h(B)} corresponds exactly to the event that the minimizer of π over A ∪ B {\displaystyle A\cup B} lies inside A ∩ B {\displaystyle A\cap B} . As h was chosen uniformly at random, P r [ h ( A ) = h ( B ) ] = J ( A , B ) {\displaystyle Pr[h(A)=h(B)]=J(A,B)\,} and ( H , D ) {\displaystyle (H,D)\,} define an LSH scheme for the Jaccard index. Because the symmetric group on n elements has size n!, choosing a truly random permutation from the full symmetric group is infeasible for even moderately sized n. Because of this fact, there has been significant work on finding a family of permutations that is "min-wise independent" — a permutation family for which each element of the domain has equal probability of being the minimum under a randomly chosen π. It has been established that a min-wise independent family of permutations is at least of size lcm ⁡ { 1 , 2 , … , n } ≥ e n − o ( n ) {\displaystyle \operatorname {lcm} \{\,1,2,\ldots ,n\,\}\geq e^{n-o(n)}} , and that this bound is tight. Because min-wise independent families are too big for practical applications, two variant notions of min-wise independence are introduced: restricted min-wise independent permutations families, and approximate min-wise independent families. Restricted min-wise independence is the min-wise independence property restricted to certain sets of cardinality at most k. Approximate min-wise independence differs from the property by at most a fixed ε. === Open source methods === ==== Nilsimsa Hash ==== Nilsimsa is a locality-sensitive hashing algorithm used in anti-spam efforts. The goal of Nilsimsa is to generate a hash digest of an email message such that the digests of two similar messages are similar to each other. The paper suggests that the Nilsimsa satisfies three requirements: The digest identifying each message should not

    Read more →
  • Growth function

    Growth function

    The growth function, also called the shatter coefficient or the shattering number, measures the richness of a set family or class of functions. It is especially used in the context of statistical learning theory, where it is used to study properties of statistical learning methods. The term 'growth function' was coined by Vapnik and Chervonenkis in their 1968 paper, where they also proved many of its properties. It is a basic concept in machine learning. == Definitions == === Set-family definition === Let H {\displaystyle H} be a set family (a set of sets) and C {\displaystyle C} a set. Their intersection is defined as the following set-family: H ∩ C := { h ∩ C ∣ h ∈ H } {\displaystyle H\cap C:=\{h\cap C\mid h\in H\}} The intersection-size (also called the index) of H {\displaystyle H} with respect to C {\displaystyle C} is | H ∩ C | {\displaystyle |H\cap C|} . If a set C m {\displaystyle C_{m}} has m {\displaystyle m} elements then the index is at most 2 m {\displaystyle 2^{m}} . If the index is exactly 2m then the set C {\displaystyle C} is said to be shattered by H {\displaystyle H} , because H ∩ C {\displaystyle H\cap C} contains all the subsets of C {\displaystyle C} , i.e.: | H ∩ C | = 2 | C | , {\displaystyle |H\cap C|=2^{|C|},} The growth function measures the size of H ∩ C {\displaystyle H\cap C} as a function of | C | {\displaystyle |C|} . Formally: Growth ⁡ ( H , m ) := max C : | C | = m | H ∩ C | {\displaystyle \operatorname {Growth} (H,m):=\max _{C:|C|=m}|H\cap C|} === Hypothesis-class definition === Equivalently, let H {\displaystyle H} be a hypothesis-class (a set of binary functions) and C {\displaystyle C} a set with m {\displaystyle m} elements. The restriction of H {\displaystyle H} to C {\displaystyle C} is the set of binary functions on C {\displaystyle C} that can be derived from H {\displaystyle H} : H C := { ( h ( x 1 ) , … , h ( x m ) ) ∣ h ∈ H , x i ∈ C } {\displaystyle H_{C}:=\{(h(x_{1}),\ldots ,h(x_{m}))\mid h\in H,x_{i}\in C\}} The growth function measures the size of H C {\displaystyle H_{C}} as a function of | C | {\displaystyle |C|} : Growth ⁡ ( H , m ) := max C : | C | = m | H C | {\displaystyle \operatorname {Growth} (H,m):=\max _{C:|C|=m}|H_{C}|} == Examples == 1. The domain is the real line R {\displaystyle \mathbb {R} } . The set-family H {\displaystyle H} contains all the half-lines (rays) from a given number to positive infinity, i.e., all sets of the form { x > x 0 ∣ x ∈ R } {\displaystyle \{x>x_{0}\mid x\in \mathbb {R} \}} for some x 0 ∈ R {\displaystyle x_{0}\in \mathbb {R} } . For any set C {\displaystyle C} of m {\displaystyle m} real numbers, the intersection H ∩ C {\displaystyle H\cap C} contains m + 1 {\displaystyle m+1} sets: the empty set, the set containing the largest element of C {\displaystyle C} , the set containing the two largest elements of C {\displaystyle C} , and so on. Therefore: Growth ⁡ ( H , m ) = m + 1 {\displaystyle \operatorname {Growth} (H,m)=m+1} . The same is true whether H {\displaystyle H} contains open half-lines, closed half-lines, or both. 2. The domain is the segment [ 0 , 1 ] {\displaystyle [0,1]} . The set-family H {\displaystyle H} contains all the open sets. For any finite set C {\displaystyle C} of m {\displaystyle m} real numbers, the intersection H ∩ C {\displaystyle H\cap C} contains all possible subsets of C {\displaystyle C} . There are 2 m {\displaystyle 2^{m}} such subsets, so Growth ⁡ ( H , m ) = 2 m {\displaystyle \operatorname {Growth} (H,m)=2^{m}} . 3. The domain is the Euclidean space R n {\displaystyle \mathbb {R} ^{n}} . The set-family H {\displaystyle H} contains all the half-spaces of the form: x ⋅ ϕ ≥ 1 {\displaystyle x\cdot \phi \geq 1} , where ϕ {\displaystyle \phi } is a fixed vector. Then Growth ⁡ ( H , m ) = Comp ⁡ ( n , m ) {\displaystyle \operatorname {Growth} (H,m)=\operatorname {Comp} (n,m)} , where Comp is the number of components in a partitioning of an n-dimensional space by m hyperplanes. 4. The domain is the real line R {\displaystyle \mathbb {R} } . The set-family H {\displaystyle H} contains all the real intervals, i.e., all sets of the form { x ∈ [ x 0 , x 1 ] | x ∈ R } {\displaystyle \{x\in [x_{0},x_{1}]|x\in \mathbb {R} \}} for some x 0 , x 1 ∈ R {\displaystyle x_{0},x_{1}\in \mathbb {R} } . For any set C {\displaystyle C} of m {\displaystyle m} real numbers, the intersection H ∩ C {\displaystyle H\cap C} contains all runs of between 0 and m {\displaystyle m} consecutive elements of C {\displaystyle C} . The number of such runs is ( m + 1 2 ) + 1 {\displaystyle {m+1 \choose 2}+1} , so Growth ⁡ ( H , m ) = ( m + 1 2 ) + 1 {\displaystyle \operatorname {Growth} (H,m)={m+1 \choose 2}+1} . == Polynomial or exponential == The main property that makes the growth function interesting is that it can be either polynomial or exponential - nothing in-between. The following is a property of the intersection-size: If, for some set C m {\displaystyle C_{m}} of size m {\displaystyle m} , and for some number n ≤ m {\displaystyle n\leq m} , | H ∩ C m | ≥ Comp ⁡ ( n , m ) {\displaystyle |H\cap C_{m}|\geq \operatorname {Comp} (n,m)} - then, there exists a subset C n ⊆ C m {\displaystyle C_{n}\subseteq C_{m}} of size n {\displaystyle n} such that | H ∩ C n | = 2 n {\displaystyle |H\cap C_{n}|=2^{n}} . This implies the following property of the Growth function. For every family H {\displaystyle H} there are two cases: The exponential case: Growth ⁡ ( H , m ) = 2 m {\displaystyle \operatorname {Growth} (H,m)=2^{m}} identically. The polynomial case: Growth ⁡ ( H , m ) {\displaystyle \operatorname {Growth} (H,m)} is majorized by Comp ⁡ ( n , m ) ≤ m n + 1 {\displaystyle \operatorname {Comp} (n,m)\leq m^{n}+1} , where n {\displaystyle n} is the smallest integer for which Growth ⁡ ( H , n ) < 2 n {\displaystyle \operatorname {Growth} (H,n)<2^{n}} . == Other properties == === Trivial upper bound === For any finite H {\displaystyle H} : Growth ⁡ ( H , m ) ≤ | H | {\displaystyle \operatorname {Growth} (H,m)\leq |H|} since for every C {\displaystyle C} , the number of elements in H ∩ C {\displaystyle H\cap C} is at most | H | {\displaystyle |H|} . Therefore, the growth function is mainly interesting when H {\displaystyle H} is infinite. === Exponential upper bound === For any nonempty H {\displaystyle H} : Growth ⁡ ( H , m ) ≤ 2 m {\displaystyle \operatorname {Growth} (H,m)\leq 2^{m}} I.e, the growth function has an exponential upper-bound. We say that a set-family H {\displaystyle H} shatters a set C {\displaystyle C} if their intersection contains all possible subsets of C {\displaystyle C} , i.e. H ∩ C = 2 C {\displaystyle H\cap C=2^{C}} . If H {\displaystyle H} shatters C {\displaystyle C} of size m {\displaystyle m} , then Growth ⁡ ( H , C ) = 2 m {\displaystyle \operatorname {Growth} (H,C)=2^{m}} , which is the upper bound. === Cartesian intersection === Define the Cartesian intersection of two set-families as: H 1 ⨂ H 2 := { h 1 ∩ h 2 ∣ h 1 ∈ H 1 , h 2 ∈ H 2 } {\displaystyle H_{1}\bigotimes H_{2}:=\{h_{1}\cap h_{2}\mid h_{1}\in H_{1},h_{2}\in H_{2}\}} . Then: Growth ⁡ ( H 1 ⨂ H 2 , m ) ≤ Growth ⁡ ( H 1 , m ) ⋅ Growth ⁡ ( H 2 , m ) {\displaystyle \operatorname {Growth} (H_{1}\bigotimes H_{2},m)\leq \operatorname {Growth} (H_{1},m)\cdot \operatorname {Growth} (H_{2},m)} === Union === For every two set-families: Growth ⁡ ( H 1 ∪ H 2 , m ) ≤ Growth ⁡ ( H 1 , m ) + Growth ⁡ ( H 2 , m ) {\displaystyle \operatorname {Growth} (H_{1}\cup H_{2},m)\leq \operatorname {Growth} (H_{1},m)+\operatorname {Growth} (H_{2},m)} === VC dimension === The VC dimension of H {\displaystyle H} is defined according to these two cases: In the polynomial case, VCDim ⁡ ( H ) = n − 1 {\displaystyle \operatorname {VCDim} (H)=n-1} = the largest integer d {\displaystyle d} for which Growth ⁡ ( H , d ) = 2 d {\displaystyle \operatorname {Growth} (H,d)=2^{d}} . In the exponential case VCDim ⁡ ( H ) = ∞ {\displaystyle \operatorname {VCDim} (H)=\infty } . So VCDim ⁡ ( H ) ≥ d {\displaystyle \operatorname {VCDim} (H)\geq d} if-and-only-if Growth ⁡ ( H , d ) = 2 d {\displaystyle \operatorname {Growth} (H,d)=2^{d}} . The growth function can be regarded as a refinement of the concept of VC dimension. The VC dimension only tells us whether Growth ⁡ ( H , d ) {\displaystyle \operatorname {Growth} (H,d)} is equal to or smaller than 2 d {\displaystyle 2^{d}} , while the growth function tells us exactly how Growth ⁡ ( H , m ) {\displaystyle \operatorname {Growth} (H,m)} changes as a function of m {\displaystyle m} . Another connection between the growth function and the VC dimension is given by the Sauer–Shelah lemma: If VCDim ⁡ ( H ) = d {\displaystyle \operatorname {VCDim} (H)=d} , then: for all m {\displaystyle m} : Growth ⁡ ( H , m ) ≤ ∑ i = 0 d ( m i ) {\displaystyle \operatorname {Growth} (H,m)\leq \sum _{i=0}^{d}{m \choose i}} In particular, for all m > d + 1 {\displaystyle m>d+1} : Growth ⁡ ( H , m ) ≤ ( e m / d ) d = O ( m d ) {\displaystyle \operatorname {Growth} (H,m)\leq (

    Read more →
  • Density-based clustering validation

    Density-based clustering validation

    Density-Based Clustering Validation (DBCV) is a metric designed to assess the quality of clustering solutions, particularly for density-based clustering algorithms like DBSCAN, Mean shift, and OPTICS. This metric is particularly suited for identifying concave and nested clusters, where traditional metrics such as the Silhouette coefficient, Davies–Bouldin index, or Calinski–Harabasz index often struggle to provide meaningful evaluations. Unlike traditional validation measures, which often rely on compact and well-separated clusters, DBCV index evaluates how well clusters are defined in terms of local density variations and structural coherence. This metric was introduced in 2014 by David Moulavi and colleagues in their work. It utilizes density connectivity principles to quantify clustering structures, making it especially effective at detecting arbitrarily shaped clusters in concave datasets, where traditional metrics may be less reliable. The DBCV index has been employed for clustering analysis in bioinformatics, ecology, techno-economy, and health informatics , as well as in numerous other fields. == Definition == DBCV index evaluates clustering structures by analyzing the relationships between data points within and across clusters. Given a dataset X = x 1 , x 2 , . . . , x n {\displaystyle X={x_{1},x_{2},...,x_{n}}} , a density-based algorithm partitions it into K clusters C 1 , C 2 , . . . , C K {\displaystyle {C_{1},C_{2},...,C_{K}}} . Each point x i {\displaystyle x_{i}} belongs to a specific cluster, denoted as C c l u s t e r ( x i ) {\displaystyle C_{cluster(x_{i})}} A key concept in DBCV index is the notion of density-connected paths. Two points within the same cluster are considered density-connected if there exists a sequence of intermediate points linking them, where each consecutive pair meets a predefined density criterion. The density-based distance between two points is determined by identifying the optimal path that minimizes the maximum local reachability distance along its trajectory. DBCV index extends the Silhouette coefficient by redefining cluster cohesion and separation using density-based distances: Within-cluster density distance measures how closely a point is related to other members of its cluster: a i = 1 | C c l u s t e r ( x i ) | − 1 ∑ x j ∈ C c l u s t e r ( x i ) , y ≠ x d d e n s i t y ( x j , x i ) {\displaystyle a_{i}={\frac {1}{|C_{cluster(x_{i})}|-1}}\sum _{x_{j}\in C_{cluster(x_{i})},y\neq x}d_{density}(x_{j},x_{i})} Nearest-cluster density distance quantifies how far a point is from the closest external cluster: b i = min C ≠ C cluster ( x i ) C ∈ { C 1 , … , C K } ( 1 | C | ∑ x j ∈ C d density ( x i , x j ) ) . {\displaystyle b_{i}=\min _{C\neq C_{{\text{cluster}}(x_{i})} \atop C\in \{C_{1},\dots ,C_{K}\}}\left({\frac {1}{|C|}}\sum _{x_{j}\in C}d_{\text{density}}(x_{i},x_{j})\right).} Using these measures, the DBCV index is computed as: D B C V = 1 n ∑ i = 1 n b i − a i max ( a i , b i ) {\displaystyle DBCV={\frac {1}{n}}\sum _{i=1}^{n}{\frac {b_{i}-a_{i}}{\max(a_{i},b_{i})}}} == Explanation == DBCV index values range between −1 and +1: +1: Strongly cohesive and well-separated clusters. 0: Ambiguous clustering structure. −1: Poorly formed clusters or incorrect assignments. By leveraging density-based distances instead of traditional Euclidean measures, DBCV index provides a more robust evaluation of clustering performance in datasets with irregular or non-spherical distributions.

    Read more →
  • Picture Prowler

    Picture Prowler

    Picture Prowler was an early piece of photo management software developed around and meant to show off Xing Technology's JPEG image decompression library during the early 1990s. Little known today, it featured thumbnail based picture management, printing, etc. The primary developer was Ray Bunnage from compression / decompression libraries developed by Howard Gordon and Chris Eddy.

    Read more →
  • BookCorpus

    BookCorpus

    BookCorpus (also sometimes referred to as the Toronto Book Corpus) is a dataset consisting of the text of around 7,000 self-published books scraped from the indie ebook distribution website Smashwords. It was the main corpus used to train the initial GPT model by OpenAI, and has been used as training data for other early large language models including Google's BERT. The dataset consists of around 985 million words, and the books that comprise it span a range of genres, including romance, science fiction, and fantasy. The corpus was introduced in a 2015 paper by researchers from the University of Toronto and MIT titled "Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books". The authors described it as consisting of "free books written by yet unpublished authors," yet this is factually incorrect. These books were published by self-published ("indie") authors who priced them at free; the books were downloaded without the consent or permission of Smashwords or Smashwords authors and in violation of the Smashwords Terms of Service. The dataset was initially hosted on a University of Toronto webpage. An official version of the original dataset is no longer publicly available, though at least one substitute, BookCorpusOpen, has been created. Though not documented in the original 2015 paper, the site from which the corpus's books were scraped is now known to be Smashwords.

    Read more →
  • Characteristic samples

    Characteristic samples

    Characteristic samples is a concept in the field of grammatical inference, related to passive learning. In passive learning, an inference algorithm I {\displaystyle I} is given a set of pairs of strings and labels S {\displaystyle S} , and returns a representation R {\displaystyle R} that is consistent with S {\displaystyle S} . Characteristic samples consider the scenario when the goal is not only finding a representation consistent with S {\displaystyle S} , but finding a representation that recognizes a specific target language. A characteristic sample of language L {\displaystyle L} is a set of pairs of the form ( s , l ( s ) ) {\displaystyle (s,l(s))} where: l ( s ) = 1 {\displaystyle l(s)=1} if and only if s ∈ L {\displaystyle s\in L} l ( s ) = − 1 {\displaystyle l(s)=-1} if and only if s ∉ L {\displaystyle s\notin L} Given the characteristic sample S {\displaystyle S} , I {\displaystyle I} 's output on it is a representation R {\displaystyle R} , e.g. an automaton, that recognizes L {\displaystyle L} . == Formal Definition == === The Learning Paradigm associated with Characteristic Samples === There are three entities in the learning paradigm connected to characteristic samples, the adversary, the teacher and the inference algorithm. Given a class of languages C {\displaystyle \mathbb {C} } and a class of representations for the languages R {\displaystyle \mathbb {R} } , the paradigm goes as follows: The adversary A {\displaystyle A} selects a language L ∈ C {\displaystyle L\in \mathbb {C} } and reports it to the teacher The teacher T {\displaystyle T} then computes a set of strings and label them correctly according to L {\displaystyle L} , trying to make sure that the inference algorithm will compute L {\displaystyle L} The adversary can add correctly labeled words to the set in order to confuse the inference algorithm The inference algorithm I {\displaystyle I} gets the sample and computes a representation R ∈ R {\displaystyle R\in \mathbb {R} } consistent with the sample. The goal is that when the inference algorithm receives a characteristic sample for a language L {\displaystyle L} , or a sample that subsumes a characteristic sample for L {\displaystyle L} , it will return a representation that recognizes exactly the language L {\displaystyle L} . === Sample === Sample S {\displaystyle S} is a set of pairs of the form ( s , l ( s ) ) {\displaystyle (s,l(s))} such that l ( s ) ∈ { − 1 , 1 } {\displaystyle l(s)\in \{-1,1\}} ==== Sample consistent with a language ==== We say that a sample S {\displaystyle S} is consistent with language L {\displaystyle L} if for every pair ( s , l ( s ) ) {\displaystyle (s,l(s))} in S {\displaystyle S} : l ( s ) = 1 if and only if s ∈ L {\displaystyle l(s)=1{\text{ if and only if }}s\in L} l ( s ) = − 1 if and only if s ∉ L {\displaystyle l(s)=-1{\text{ if and only if }}s\notin L} === Characteristic sample === Given an inference algorithm I {\displaystyle I} and a language L {\displaystyle L} , a sample S {\displaystyle S} that is consistent with L {\displaystyle L} is called a characteristic sample of L {\displaystyle L} for I {\displaystyle I} if: I {\displaystyle I} 's output on S {\displaystyle S} is a representation R {\displaystyle R} that recognizes L {\displaystyle L} . For every sample D {\displaystyle D} that is consistent with L {\displaystyle L} and also fulfils S ⊆ D {\displaystyle S\subseteq D} , I {\displaystyle I} 's output on D {\displaystyle D} is a representation R {\displaystyle R} that recognizes L {\displaystyle L} . A Class of languages C {\displaystyle \mathbb {C} } is said to have charistaristic samples if every L ∈ C {\displaystyle L\in \mathbb {C} } has a characteristic sample. == Related Theorems == === Theorem === If equivalence is undecidable for a class C {\textstyle \mathbb {C} } over Σ {\textstyle \Sigma } of cardinality bigger than 1, then C {\textstyle \mathbb {C} } doesn't have characteristic samples. ==== Proof ==== Given a class of representations C {\textstyle \mathbb {C} } such that equivalence is undecidable, for every polynomial p ( x ) {\displaystyle p(x)} and every n ∈ N {\displaystyle n\in \mathbb {N} } , there exist two representations r 1 {\displaystyle r_{1}} and r 2 {\displaystyle r_{2}} of sizes bounded by n {\displaystyle n} , that recognize different languages but are inseparable by any string of size bounded by p ( n ) {\displaystyle p(n)} . Assuming this is not the case, we can decide if r 1 {\displaystyle r_{1}} and r 2 {\displaystyle r_{2}} are equivalent by simulating their run on all strings of size smaller than p ( n ) {\displaystyle p(n)} , contradicting the assumption that equivalence is undecidable. === Theorem === If S 1 {\displaystyle S_{1}} is a characteristic sample for L 1 {\displaystyle L_{1}} and is also consistent with L 2 {\displaystyle L_{2}} , then every characteristic sample of L 2 {\displaystyle L_{2}} , is inconsistent with L 1 {\displaystyle L_{1}} . ==== Proof ==== Given a class C {\textstyle \mathbb {C} } that has characteristic samples, let R 1 {\displaystyle R_{1}} and R 2 {\displaystyle R_{2}} be representations that recognize L 1 {\displaystyle L_{1}} and L 2 {\displaystyle L_{2}} respectively. Under the assumption that there is a characteristic sample for L 1 {\displaystyle L_{1}} , S 1 {\displaystyle S_{1}} that is also consistent with L 2 {\displaystyle L_{2}} , we'll assume falsely that there exist a characteristic sample for L 2 {\displaystyle L_{2}} , S 2 {\displaystyle S_{2}} that is consistent with L 1 {\displaystyle L_{1}} . By the definition of characteristic sample, the inference algorithm I {\displaystyle I} must return a representation which recognizes the language if given a sample that subsumes the characteristic sample itself. But for the sample S 1 ∪ S 2 {\displaystyle S_{1}\cup S_{2}} , the answer of the inferring algorithm needs to recognize both L 1 {\displaystyle L_{1}} and L 2 {\displaystyle L_{2}} , in contradiction. === Theorem === If a class is polynomially learnable by example based queries, it is learnable with characteristic samples. == Polynomialy characterizable classes == === Regular languages === The proof that DFA's are learnable using characteristic samples, relies on the fact that every regular language has a finite number of equivalence classes with respect to the right congruence relation, ∼ L {\displaystyle \sim _{L}} (where x ∼ L y {\displaystyle x\sim _{L}y} for x , y ∈ Σ ∗ {\displaystyle x,y\in \Sigma ^{}} if and only if ∀ z ∈ Σ ∗ : x z ∈ L ↔ y z ∈ L {\displaystyle \forall z\in \Sigma ^{}:xz\in L\leftrightarrow yz\in L} ). Note that if x {\displaystyle x} , y {\displaystyle y} are not congruent with respect to ∼ L {\displaystyle \sim _{L}} , there exists a string z {\displaystyle z} such that x z ∈ L {\displaystyle xz\in L} but y z ∉ L {\displaystyle yz\notin L} or vice versa, this string is called a separating suffix. ==== Constructing a characteristic sample ==== The construction of a characteristic sample for a language L {\displaystyle L} by the teacher goes as follows. Firstly, by running a depth first search on a deterministic automaton A {\displaystyle A} recognizing L {\displaystyle L} , starting from its initial state, we get a suffix closed set of words, W {\displaystyle W} , ordered in shortlex order. From the fact above, we know that for every two states in the automaton, there exists a separating suffix that separates between every two strings that the run of A {\displaystyle A} on them ends in the respective states. We refer to the set of separating suffixes as S {\displaystyle S} . The labeled set (sample) of words the teacher gives the adversary is { ( w , l ( w ) ) | w ∈ W ⋅ S ∪ W ⋅ Σ ⋅ S } {\displaystyle \{(w,l(w))|w\in W\cdot S\cup W\cdot \Sigma \cdot S\}} where l ( w ) {\displaystyle l(w)} is the correct label of w {\displaystyle w} (whether it is in L {\displaystyle L} or not). We may assume that ϵ ∈ S {\displaystyle \epsilon \in S} . ==== Constructing a deterministic automata ==== Given the sample from the adversary W {\displaystyle W} , the construction of the automaton by the inference algorithm I {\displaystyle I} starts with defining P = prefix ( W ) {\displaystyle P={\text{prefix}}(W)} and S = suffix ( W ) {\displaystyle S={\text{suffix}}(W)} , which are the set of prefixes and suffixes of W {\displaystyle W} respectively. Now the algorithm constructs a matrix M {\displaystyle M} where the elements of P {\displaystyle P} function as the rows, ordered by the shortlex order, and the elements of S {\displaystyle S} function as the columns, ordered by the shortlex order. Next, the cells in the matrix are filled in the following manner for prefix p i {\displaystyle p_{i}} and suffix s j {\displaystyle s_{j}} : If p i s j ∈ W → M i j = l ( p i s j ) {\displaystyle p_{i}s_{j}\in W\rightarrow M_{ij}=l(p_{i}s_{j})} else, M i j = 0 {\displaystyle M_{ij}=0} Now, we say row i {\displaystyle i} and t {\displaystyle t} are distinguishable if there exi

    Read more →
  • Vowpal Wabbit

    Vowpal Wabbit

    Vowpal Wabbit (VW) is an open-source fast online interactive machine learning system library and program developed originally at Yahoo! Research, and currently at Microsoft Research. It was started and is led by John Langford. Vowpal Wabbit's interactive learning support is particularly notable including Contextual Bandits, Active Learning, and forms of guided Reinforcement Learning. Vowpal Wabbit provides an efficient scalable out-of-core implementation with support for a number of machine learning reductions, importance weighting, and a selection of different loss functions and optimization algorithms. == Notable features == The VW program supports: Multiple supervised (and semi-supervised) learning problems: Classification (both binary and multi-class) Regression Active learning (partially labeled data) for both regression and classification Multiple learning algorithms (model-types / representations) OLS regression Matrix factorization (sparse matrix SVD) Single layer neural net (with user specified hidden layer node count) Searn (Search and Learn) Latent Dirichlet Allocation (LDA) Stagewise polynomial approximation Recommend top-K out of N One-against-all (OAA) and cost-sensitive OAA reduction for multi-class Weighted all pairs Contextual-bandit (with multiple exploration/exploitation strategies) Multiple loss functions: squared error quantile hinge logistic poisson Multiple optimization algorithms Stochastic gradient descent (SGD) BFGS Conjugate gradient Regularization (L1 norm, L2 norm, & elastic net regularization) Flexible input - input features may be: Binary Numerical Categorical (via flexible feature-naming and the hash trick) Can deal with missing values/sparse-features Other features On the fly generation of feature interactions (quadratic and cubic) On the fly generation of N-grams with optional skips (useful for word/language data-sets) Automatic test-set holdout and early termination on multiple passes bootstrapping User settable online learning progress report + auditing of the model Hyperparameter optimization == Scalability == Vowpal wabbit has been used to learn a tera-feature (1012) data-set on 1000 nodes in one hour. Its scalability is aided by several factors: Out-of-core online learning: no need to load all data into memory The hashing trick: feature identities are converted to a weight index via a hash (uses 32-bit MurmurHash3) Exploiting multi-core CPUs: parsing of input and learning are done in separate threads. Compiled C++ code

    Read more →
  • Incremental heuristic search

    Incremental heuristic search

    Incremental heuristic search algorithms combine both incremental and heuristic search to speed up searches of sequences of similar search problems, which is important in domains that are only incompletely known or change dynamically. Incremental search has been studied at least since the late 1960s. Incremental search algorithms reuse information from previous searches to speed up the current search and solve search problems potentially much faster than solving them repeatedly from scratch. Similarly, heuristic search has also been studied at least since the late 1960s. Heuristic search algorithms, often based on A, use heuristic knowledge in the form of approximations of the goal distances to focus the search and solve search problems potentially much faster than uninformed search algorithms. The resulting search problems, sometimes called dynamic path planning problems, are graph search problems where paths have to be found repeatedly because the topology of the graph, its edge costs, the start vertex or the goal vertices change over time. So far, three main classes of incremental heuristic search algorithms have been developed: The first class restarts A at the point where its current search deviates from the previous one (example: Fringe Saving A). The second class updates the h-values (heuristic, i.e. approximate distance to goal) from the previous search during the current search to make them more informed (example: Generalized Adaptive A). The third class updates the g-values (distance from start) from the previous search during the current search to correct them when necessary, which can be interpreted as transforming the A search tree from the previous search into the A search tree for the current search (examples: Lifelong Planning A, D, D Lite). All three classes of incremental heuristic search algorithms are different from other replanning algorithms, such as planning by analogy, in that their plan quality does not deteriorate with the number of replanning episodes. == Applications == Incremental heuristic search has been extensively used in robotics, where a larger number of path planning systems are based on either D (typically earlier systems) or D Lite (current systems), two different incremental heuristic search algorithms.

    Read more →
  • One-shot learning (computer vision)

    One-shot learning (computer vision)

    One-shot learning is an object categorization problem, found mostly in computer vision. Whereas most machine learning-based object categorization algorithms require training on hundreds or thousands of examples, one-shot learning aims to classify objects from one, or only a few, examples. The term few-shot learning is also used for these problems, especially when more than one example is needed. == Motivation == The ability to learn object categories from few examples, and at a rapid pace, has been demonstrated in humans. It is estimated that a child learns almost all of the 10 ~ 30 thousand object categories in the world by age six. This is due not only to the human mind's computational power, but also to its ability to synthesize and learn new object categories from existing information about different, previously learned categories. Given two examples from two object categories: one, an unknown object composed of familiar shapes, the second, an unknown, amorphous shape; it is much easier for humans to recognize the former than the latter, suggesting that humans make use of previously learned categories when learning new ones. The key motivation for solving one-shot learning is that systems, like humans, can use knowledge about object categories to classify new objects. == Background == As with most classification schemes, one-shot learning involves three main challenges: Representation: How should objects and categories be described? Learning: How can such descriptions be created? Recognition: How can a known object be filtered from enveloping clutter, irrespective of occlusion, viewpoint, and lighting? One-shot learning differs from single object recognition and standard category recognition algorithms in its emphasis on knowledge transfer, which makes use of previously learned categories. Model parameters: Reuses model parameters, based on the similarity between old and new categories. Categories are first learned on numerous training examples, then new categories are learned using transformations of model parameters from those initial categories or selecting relevant parameters for a classifier. Feature sharing: Shares parts or features of objects across categories. One algorithm extracts "diagnostic information" in patches from already learned categories by maximizing the patches' mutual information, and then applies these features to the learning of a new category. A dog category, for example, may be learned in one shot from previous knowledge of horse and cow categories, because dog objects may contain similar distinguishing patches. Contextual information: Appeals to global knowledge of the scene in which the object appears. Such global information can be used as frequency distributions in a conditional random field framework to recognize objects. Alternatively context can consider camera height and scene geometry. Algorithms of this type have two advantages. First, they learn object categories that are relatively dissimilar; and second, they perform well in ad hoc situations where an image has not been hand-cropped and aligned. == Theory == The Bayesian one-shot learning algorithm represents the foreground and background of images as parametrized by a mixture of constellation models. During the learning phase, the parameters of these models are learned using a conjugate density parameter posterior and variational Bayesian expectation–maximization (VBEM). In this stage the previously learned object categories inform the choice of model parameters via transfer by contextual information. For object recognition on new images, the posterior obtained during the learning phase is used in a Bayesian decision framework to estimate the ratio of p(object | test, train) to p(background clutter | test, train) where p is the probability of the outcome. === Bayesian framework === Given the task of finding a particular object in a query image, the overall objective of the Bayesian one-shot learning algorithm is to compare the probability that object is present vs the probability that only background clutter is present. If the former probability is higher, the algorithm reports the object's presence, otherwise the algorithm reports its absence. To compute these probabilities, the object class must be modeled from a set of (1 ~ 5) training images containing examples. To formalize these ideas, let I {\displaystyle I} be the query image, which contains either an example of the foreground category O f g {\displaystyle O_{fg}} or only background clutter of a generic background category O b g {\displaystyle O_{bg}} . Also let I t {\displaystyle I_{t}} be the set of training images used as the foreground category. The decision of whether I {\displaystyle I} contains an object from the foreground category, or only clutter from the background category is: R = p ( O f g | I , I t ) p ( O b g | I , I t ) = p ( I | I t , O f g ) p ( O f g ) p ( I | I t , O b g ) p ( O b g ) , {\displaystyle R={\frac {p(O_{fg}|I,I_{t})}{p(O_{bg}|I,I_{t})}}={\frac {p(I|I_{t},O_{fg})p(O_{fg})}{p(I|I_{t},O_{bg})p(O_{bg})}},} where the class posteriors p ( O f g | I , I t ) {\displaystyle p(O_{fg}|I,I_{t})} and p ( O b g | I , I t ) {\displaystyle p(O_{bg}|I,I_{t})} have been expanded by Bayes' theorem, yielding a ratio of likelihoods and a ratio of object category priors. We decide that the image I {\displaystyle I} contains an object from the foreground class if R {\displaystyle R} exceeds a certain threshold T {\displaystyle T} . We next introduce parametric models for the foreground and background categories with parameters θ {\displaystyle \theta } and θ b g {\displaystyle \theta _{bg}} respectively. This foreground parametric model is learned during the learning stage from I t {\displaystyle I_{t}} , as well as prior information of learned categories. The background model we assume to be uniform across images. Omitting the constant ratio of category priors, p ( O f g ) p ( O b g ) {\displaystyle {\frac {p(O_{fg})}{p(O_{bg})}}} , and parametrizing over θ {\displaystyle \theta } and θ b g {\displaystyle \theta _{bg}} yields R ∝ ∫ p ( I | θ , O f g ) p ( θ | I t , O f g ) d θ ∫ p ( I | θ b g , O b g ) p ( θ b g | I t , O b g ) d θ b g = ∫ p ( I | θ ) p ( θ | I t , O f g ) d θ ∫ p ( I | θ b g ) p ( θ b g | I t , O b g ) d θ b g {\displaystyle R\propto {\frac {\int {p(I|\theta ,O_{fg})p(\theta |I_{t},O_{fg})}d\theta }{\int {p(I|\theta _{bg},O_{bg})p(\theta _{bg}|I_{t},O_{bg})}d\theta _{bg}}}={\frac {\int {p(I|\theta )p(\theta |I_{t},O_{fg})}d\theta }{\int {p(I|\theta _{bg})p(\theta _{bg}|I_{t},O_{bg})}d\theta _{bg}}}} , having simplified p ( I | θ , O f g ) {\displaystyle p(I|\theta ,O_{fg})} and p ( I | θ , O b g ) {\displaystyle p(I|\theta ,O_{bg})} to p ( I | θ f g ) {\displaystyle p(I|\theta _{fg})} and p ( I | θ b g ) . {\displaystyle p(I|\theta _{bg}).} The posterior distribution of model parameters given the training images, p ( θ | I t , O f g ) {\displaystyle p(\theta |I_{t},O_{fg})} is estimated in the learning phase. In this estimation, one-shot learning differs sharply from more traditional Bayesian estimation models that approximate the integral as δ ( θ M L ) {\displaystyle \delta (\theta ^{ML})} . Instead, it uses a variational approach using prior information from previously learned categories. However, the traditional maximum likelihood estimation of the model parameters is used for the background model and the categories learned in advance through training. === Object category model === For each query image I {\displaystyle I} and training images I t {\displaystyle I_{t}} , a constellation model is used for representation. To obtain this model for a given image I {\displaystyle I} , first a set of N interesting regions is detected in the image using the Kadir–Brady saliency detector. Each region selected is represented by a location in the image, X i {\displaystyle X_{i}} and a description of its appearance, A i {\displaystyle A_{i}} . Letting X = ∑ i = 1 N X i , A = ∑ i = 1 N A i {\displaystyle X=\sum _{i=1}^{N}X_{i},A=\sum _{i=1}^{N}A_{i}} and X t {\displaystyle X_{t}} and A t {\displaystyle A_{t}} the analogous representations for training images, the expression for R becomes: R ∝ ∫ p ( X , A | θ , O f g ) p ( θ | X t , A t , O f g ) d θ ∫ p ( X , A | θ b g , O b g ) p ( θ b g | X t , A t , O b g ) d θ b g = ∫ p ( X , A | θ ) p ( θ | X t , A t , O f g ) d θ ∫ p ( X , A | θ b g ) p ( θ b g | X t , A t , O b g ) d θ b g {\displaystyle R\propto {\frac {\int {p(X,A|\theta ,O_{fg})p(\theta |X_{t},A_{t},O_{fg})}d\theta }{\int {p(X,A|\theta _{bg},O_{bg})p(\theta _{bg}|X_{t},A_{t},O_{bg})}d\theta _{bg}}}={\frac {\int {p(X,A|\theta )p(\theta |X_{t},A_{t},O_{fg})}d\theta }{\int {p(X,A|\theta _{bg})p(\theta _{bg}|X_{t},A_{t},O_{bg})}\,d\theta _{bg}}}} The likelihoods p ( X , A | θ ) {\displaystyle p(X,A|\theta )} and p ( X , A | θ b g ) {\displaystyle p(X,A|\theta _{bg})} are represented as mixtures of constellation models. A typical constellation model has

    Read more →
  • Principal component analysis

    Principal component analysis

    Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing. The data are linearly transformed onto a new coordinate system such that the directions (principal components) capturing the largest variation in the data can be easily identified. The principal components of a collection of points in a real coordinate space are a sequence of p {\displaystyle p} unit vectors, where the i {\displaystyle i} -th vector is the direction of a line that best fits the data while being orthogonal to the first i − 1 {\displaystyle i-1} vectors. Here, a best-fitting line is defined as one that minimizes the average squared perpendicular distance from the points to the line. These directions (i.e., principal components) constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated. Many studies use the first two principal components in order to plot the data in two dimensions and to visually identify clusters of closely related data points. Principal component analysis has applications in many fields such as population genetics, microbiome studies, and atmospheric science. == Overview == When performing PCA, the first principal component of a set of p {\displaystyle p} variables is the derived variable formed as a linear combination of the original variables that explains the most variance. The second principal component explains the most variance in what is left once the effect of the first component is removed, and we may proceed through p {\displaystyle p} iterations until all the variance is explained. PCA is most commonly used when many of the variables are highly correlated with each other and it is desirable to reduce their number to an independent set. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data. The i {\displaystyle i} -th principal component can be taken as a direction orthogonal to the first i − 1 {\displaystyle i-1} principal components that maximizes the variance of the projected data. For either objective, it can be shown that the principal components are eigenvectors of the data's covariance matrix. Thus, the principal components are often computed by eigendecomposition of the data covariance matrix or singular value decomposition of the data matrix. PCA is the simplest of the true eigenvector-based multivariate analyses and is closely related to factor analysis. Factor analysis typically incorporates more domain-specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. PCA is also related to canonical correlation analysis (CCA). CCA defines coordinate systems that optimally describe the cross-covariance between two datasets while PCA defines a new orthogonal coordinate system that optimally describes variance in a single dataset. Robust and L1-norm-based variants of standard PCA have also been proposed. == History == PCA was invented in 1901 by Karl Pearson, as an analogue of the principal axis theorem in mechanics; it was later independently developed and named by Harold Hotelling in the 1930s. Depending on the field of application, it is also named the discrete Karhunen–Loève transform (KLT) in signal processing, the Hotelling transform in multivariate quality control, proper orthogonal decomposition (POD) in mechanical engineering, singular value decomposition (SVD) of X (invented in the last quarter of the 19th century), eigenvalue decomposition (EVD) of XTX in linear algebra, factor analysis (for a discussion of the differences between PCA and factor analysis see Ch. 7 of Jolliffe's Principal Component Analysis), Eckart–Young theorem (Harman, 1960), or empirical orthogonal functions (EOF) in meteorological science (Lorenz, 1956), empirical eigenfunction decomposition (Sirovich, 1987), quasiharmonic modes (Brooks et al., 1988), spectral decomposition in noise and vibration, and empirical modal analysis in structural dynamics. == Intuition == PCA can be thought of as fitting a p-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipsoid is small, then the variance along that axis is also small. To find the axes of the ellipsoid, we must first center the values of each variable in the dataset on 0 by subtracting the mean of the variable's observed values from each of those values. These transformed values are used instead of the original observed values for each of the variables. Then, we compute the covariance matrix of the data and calculate the eigenvalues and corresponding eigenvectors of this covariance matrix. Then we must normalize each of the orthogonal eigenvectors to turn them into unit vectors. Once this is done, each of the mutually-orthogonal unit eigenvectors can be interpreted as an axis of the ellipsoid fitted to the data. This choice of basis will transform the covariance matrix into a diagonalized form, in which the diagonal elements represent the variance of each axis. The proportion of the variance that each eigenvector represents can be calculated by dividing the eigenvalue corresponding to that eigenvector by the sum of all eigenvalues. Biplots and scree plots (degree of explained variance) are used to interpret findings of the PCA. == Details == PCA is defined as an orthogonal linear transformation on a real inner product space that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. Consider an n × p {\displaystyle n\times p} data matrix, X, with column-wise zero empirical mean (the sample mean of each column has been shifted to zero), where each of the n rows represents a different repetition of the experiment, and each of the p columns gives a particular kind of feature (say, the results from a particular sensor). Mathematically, the transformation is defined by a set of size l {\displaystyle l} (where l {\displaystyle l} is usually selected to be strictly less than p {\displaystyle p} to reduce dimensionality) of p {\displaystyle p} -dimensional vectors of weights or coefficients w ( k ) = ( w 1 , … , w p ) ( k ) {\displaystyle \mathbf {w} _{(k)}=(w_{1},\dots ,w_{p})_{(k)}} that map each row vector x ( i ) = ( x 1 , … , x p ) ( i ) {\displaystyle \mathbf {x} _{(i)}=(x_{1},\dots ,x_{p})_{(i)}} of X to a new vector of principal component scores t ( i ) = ( t 1 , … , t l ) ( i ) {\displaystyle \mathbf {t} _{(i)}=(t_{1},\dots ,t_{l})_{(i)}} , given by t k ( i ) = x ( i ) ⋅ w ( k ) f o r i = 1 , … , n k = 1 , … , l {\displaystyle {t_{k}}_{(i)}=\mathbf {x} _{(i)}\cdot \mathbf {w} _{(k)}\qquad \mathrm {for} \qquad i=1,\dots ,n\qquad k=1,\dots ,l} in such a way that the individual variables t 1 , … , t l {\displaystyle t_{1},\dots ,t_{l}} of t considered over the data set successively inherit the maximum possible variance from X, with each coefficient vector w constrained to be a unit vector. The above may equivalently be written in matrix form as T = X W {\displaystyle \mathbf {T} =\mathbf {X} \mathbf {W} } where T i k = t k ( i ) {\displaystyle {\mathbf {T} }_{ik}={t_{k}}_{(i)}} , X i j = x j ( i ) {\displaystyle {\mathbf {X} }_{ij}={x_{j}}_{(i)}} , and W j k = w j ( k ) {\displaystyle {\mathbf {W} }_{jk}={w_{j}}_{(k)}} . === First component === In order to maximize variance, the first weight vector w(1) thus has to satisfy w ( 1 ) = arg ⁡ max ‖ w ‖ = 1 { ∑ i ( t 1 ) ( i ) 2 } = arg ⁡ max ‖ w ‖ = 1 { ∑ i ( x ( i ) ⋅ w ) 2 } {\displaystyle \mathbf {w} _{(1)}=\arg \max _{\Vert \mathbf {w} \Vert =1}\,\left\{\sum _{i}(t_{1})_{(i)}^{2}\right\}=\arg \max _{\Vert \mathbf {w} \Vert =1}\,\left\{\sum _{i}\left(\mathbf {x} _{(i)}\cdot \mathbf {w} \right)^{2}\right\}} Equivalently, writing this in matrix form gives w ( 1 ) = arg ⁡ max ‖ w ‖ = 1 { ‖ X w ‖ 2 } = arg ⁡ max ‖ w ‖ = 1 { w T X T X w } {\displaystyle \mathbf {w} _{(1)}=\arg \max _{\left\|\mathbf {w} \right\|=1}\left\{\left\|\mathbf {Xw} \right\|^{2}\right\}=\arg \max _{\left\|\mathbf {w} \right\|=1}\left\{\mathbf {w} ^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {Xw} \right\}} Since w(1) has been defined to be a unit vector, it equivalently also satisfies w ( 1 ) = arg ⁡ max { w T X T X w w T w } {\displaystyle \mathbf {w} _{(1)}=\arg \max \left\{{\frac {\mathbf {w} ^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {Xw} }{\mathbf {w} ^{\mathsf {T}}\mathbf {w} }}\right\}} The quantity to be maximised can be recognised as a Rayleigh quotient. A standard result for a positive semidefinite matrix such as XTX is that the quotient's maximum possible value is the largest eigenvalue of the matrix, which occurs when w is the corresponding eigenvector. With w(1) found, the first principal component of a data vector

    Read more →