Information history may refer to the history of each of the categories listed below (or to combinations of them). It should be recognized that the understanding of, for example, libraries as information systems only goes back to about 1950. The application of the term information for earlier systems or societies is a retronym. == Academic discipline == Information history is an emerging discipline related to, but broader than, library history. An important introduction and review was made by Alistair Black (2006). A prolific scholar in this field is also Toni Weller, for example, Weller (2007, 2008, 2010a and 2010b). As part of her work Toni Weller has argued that there are important links between the modern information age and its historical precedents. A description from Russia is Volodin (2000). Alistair Black (2006, p. 445) wrote: "This chapter explores issues of discipline definition and legitimacy by segmenting information history into its various components: The history of print and written culture, including relatively long-established areas such as the histories of libraries and librarianship, book history, publishing history, and the history of reading. The history of more recent information disciplines and practice, that is to say, the history of information management, information systems, and information science. The history of contiguous areas, such as the history of the information society and information infrastructure, necessarily enveloping communication history (including telecommunications history) and the history of information policy. The history of information as social history, with emphasis on the importance of informal information networks." "Bodies influential in the field include the American Library Association’s Round Table on Library History, the Library History Section of the International Federation of Library Associations and Institutions (IFLA), and, in the U.K., the Library and Information History Group of the Chartered Institute of Library and Information Professionals (CILIP). Each of these bodies has been busy in recent years, running conferences and seminars, and initiating scholarly projects. Active library history groups function in many other countries, including Germany (The Wolfenbuttel Round Table on Library History, the History of the Book and the History of Media, located at the Herzog August Bibliothek), Denmark (The Danish Society for Library History, located at the Royal School of Library and Information Science), Finland (The Library History Research Group, University of Tamepere), and Norway (The Norwegian Society for Book and Library History). Sweden has no official group dedicated to the subject, but interest is generated by the existence of a museum of librarianship in Bods, established by the Library Museum Society and directed by Magnus Torstensson. Activity in Argentina, where, as in Europe and the U.S., a "new library history" has developed, is described by Parada (2004)." (Black (2006, p. 447). === Journals === Information & Culture (previously Libraries & the Cultural Record, Libraries & Culture) Library & Information History (until 2008: Library History; until 1967: Library Association. Library History Group. Newsletter) == Information technology (IT) == The term IT is ambiguous although mostly synonym with computer technology. Haigh (2011, pp. 432-433) wrote "In fact, the great majority of references to information technology have always been concerned with computers, although the exact meaning has shifted over time (Kline, 2006). The phrase received its first prominent usage in a Harvard Business Review article (Haigh, 2001b; Leavitt & Whisler, 1958) intended to promote a technocratic vision for the future of business management. Its initial definition was at the conjunction of computers, operations research methods, and simulation techniques. Having failed initially to gain much traction (unlike related terms of a similar vintage such as information systems, information processing, and information science) it was revived in policy and economic circles in the 1970s with a new meaning. Information technology now described the expected convergence of the computing, media, and telecommunications industries (and their technologies), understood within the broader context of a wave of enthusiasm for the computer revolution, post-industrial society, information society (Webster, 1995), and other fashionable expressions of the belief that new electronic technologies were bringing a profound rupture with the past. As it spread broadly during the 1980s, IT increasingly lost its association with communications (and, alas, any vestigial connection to the idea of anybody actually being informed of anything) to become a new and more pretentious way of saying "computer". The final step in this process is the recent surge in references to "information and communication technologies" or ICTs, a coinage that makes sense only if one assumes that a technology can inform without communicating". Some people use the term information technology about technologies used before the development of the computer. This is however to use the term as a retronym. =
Avid Symphony
Avid Symphony is non-linear editing software aimed at professionals in the film and television industry. It is available for Microsoft Windows PCs and Apple Macintosh platforms. Symphony is Avid's high end SD/HD finishing platform for long form work, such as documentary and episodic TV. Its interface is based on the same look and feature set as the Media Composer and Xpress systems, but contains the highest level of features and resolution including secondary color correction, uncompressed HD, and higher real-time performance. == Release history == Symphony is the software component of a tightly integrated package that includes specific hardware audio/video interfaces, storage, and the computer, also sold by Avid. Its release history is therefore tightly related to the release of new Avid interface hardware: Symphony was introduced to the market in 1998. It was based on Avid's Meridien hardware, supporting SD only, and was available first only for the PC and later for the Macintosh platforms. Its last release was 5.0.5 which supported Windows 2000 and Mac OS X v10.2. The next major upgrade was Symphony Nitris in 2005, with a redesigned software and integration with the Nitris DNA hardware (PCI-X). It supported 8 bit and 10 bit SD and HD resolutions in both compressed and uncompressed forms, the MXF format and DNxHD codec, and ran only on Windows PC platforms. Symphony Nitris DX, released in 2008, added support for a range of HD codecs, including HDV, XDCAM-HD, DVCPRO HD, and AVC-I, and brought back Mac OS support for OS X 10.5, as well as Windows Vista. Since the introduction of Symphony 6, it can be used in software-only mode (where a Nitris or Nitris DX BOB used to be required), and at the same time, like Media Composer, Symphony was opened up with "Open I/O", allowing users to have Symphony use their third party hardware from companies like AJA, Matrox, BlueFish, Blackmagic Design and MOTU. The last remaining features that differentiate it from Media Composer are Advanced Color Correction (channels, secondary color correction,), Relational Color Correction (corrections based on common clip name, tape name, program track) and Universal HD Mastering (only with Nitris DX hardware). The latter allows cross-conversions of 23.976p or 24p projects sequences to most any other format during Digital Cut. In 2013, Avid announced it would no longer offer Symphony a standalone product. Starting version 7, Symphony will be sold as an option to Media Composer. This optional package (sold at a premium) will contain all the traditional Symphony-only features to any Media Composer install. == Use in movies == The Celibacy, Director: Horacio Bocaranda Avid Media Composer 6 and Avid Symphony 6 Nitris DX American Hardcore, Director: Paul Rachman Avid Xpress Pro and Symphony Summercamp!, Director: Spike Lee Avid Xpress Pro and Symphony When the Levees Broke Avid Media Composer and Symphony Nitris Superman Returns Edited with Mac-based Film Composer XL, but HD screenings prepped with Symphony
Triplet loss
Triplet loss is a machine learning loss function widely used in one-shot learning, a setting where models are trained to generalize effectively from limited examples. It was conceived by Google researchers for their prominent FaceNet algorithm for face detection. Triplet loss is designed to support metric learning. Namely, to assist training models to learn an embedding (mapping to a feature space) where similar data points are closer together and dissimilar ones are farther apart, enabling robust discrimination across varied conditions. In the context of face detection, data points correspond to images. == Definition == The loss function is defined using triplets of training points of the form ( A , P , N ) {\displaystyle (A,P,N)} . In each triplet, A {\displaystyle A} (called an "anchor point") denotes a reference point of a particular identity, P {\displaystyle P} (called a "positive point") denotes another point of the same identity in point A {\displaystyle A} , and N {\displaystyle N} (called a "negative point") denotes a point of an identity different from the identity in point A {\displaystyle A} and P {\displaystyle P} . Let x {\displaystyle x} be some point and let f ( x ) {\displaystyle f(x)} be the embedding of x {\displaystyle x} in the finite-dimensional Euclidean space. It shall be assumed that the L2-norm of f ( x ) {\displaystyle f(x)} is unity (the L2 norm of a vector X {\displaystyle X} in a finite dimensional Euclidean space is denoted by ‖ X ‖ {\displaystyle \Vert X\Vert } .) We assemble m {\displaystyle m} triplets of points from the training dataset. The goal of training here is to ensure that, after learning, the following condition (called the "triplet constraint") is satisfied by all triplets ( A ( i ) , P ( i ) , N ( i ) ) {\displaystyle (A^{(i)},P^{(i)},N^{(i)})} in the training data set: ‖ f ( A ( i ) ) − f ( P ( i ) ) ‖ 2 2 + α < ‖ f ( A ( i ) ) − f ( N ( i ) ) ‖ 2 2 {\displaystyle \Vert f(A^{(i)})-f(P^{(i)})\Vert _{2}^{2}+\alpha <\Vert f(A^{(i)})-f(N^{(i)})\Vert _{2}^{2}} The variable α {\displaystyle \alpha } is a hyperparameter called the margin, and its value must be set manually. In the FaceNet system, its value was set as 0.2. Thus, the full form of the function to be minimized is the following: L = ∑ i = 1 m max ( ‖ f ( A ( i ) ) − f ( P ( i ) ) ‖ 2 2 − ‖ f ( A ( i ) ) − f ( N ( i ) ) ‖ 2 2 + α , 0 ) {\displaystyle L=\sum _{i=1}^{m}\max {\Big (}\Vert f(A^{(i)})-f(P^{(i)})\Vert _{2}^{2}-\Vert f(A^{(i)})-f(N^{(i)})\Vert _{2}^{2}+\alpha ,0{\Big )}} == Intuition == A baseline for understanding the effectiveness of triplet loss is the contrastive loss, which operates on pairs of samples (rather than triplets). Training with the contrastive loss pulls embeddings of similar pairs closer together, and pushes dissimilar pairs apart. Its pairwise approach is greedy, as it considers each pair in isolation. Triplet loss innovates by considering relative distances. Its goal is that the embedding of an anchor (query) point be closer to positive points than to negative points (also accounting for the margin). It does not try to further optimize the distances once this requirement is met. This is approximated by simultaneously considering two pairs (anchor-positive and anchor-negative), rather than each pair in isolation. == Triplet "mining" == One crucial implementation detail when training with triplet loss is triplet "mining", which focuses on the smart selection of triplets for optimization. This process adds an additional layer of complexity compared to contrastive loss. A naive approach to preparing training data for the triplet loss involves randomly selecting triplets from the dataset. In general, the set of valid triplets of the form ( A ( i ) , P ( i ) , N ( i ) ) {\displaystyle (A^{(i)},P^{(i)},N^{(i)})} is very large. To speed-up training convergence, it is essential to focus on challenging triplets. In the FaceNet paper, several options were explored, eventually arriving at the following. For each anchor-positive pair, the algorithm considers only semi-hard negatives. These are negatives that violate the triplet requirement (i.e, are "hard"), but lie farther from the anchor than the positive (not too hard). Restated, for each A ( i ) {\displaystyle A^{(i)}} and P ( i ) {\displaystyle P^{(i)}} , they seek N ( i ) {\displaystyle N^{(i)}} such that: The rationale for this design choice is heuristic. It may appear puzzling that the mining process neglects "very hard" negatives (i.e., closer to the anchor than the positive). Experiments conducted by the FaceNet designers found that this often leads to a convergence to degenerate local minima. Triplet mining is performed at each training step, from within the sample points contained in the training batch (this is known as online mining), after embeddings were computed for all points in the batch. While ideally the entire dataset could be used, this is impractical in general. To support a large search space for triplets, the FaceNet authors used very large batches (1800 samples). Batches are constructed by selecting a large number of same-category sample points (40), and randomly selected negatives for them. == Extensions == Triplet loss has been extended to simultaneously maintain a series of distance orders by optimizing a continuous relevance degree with a chain (i.e., ladder) of distance inequalities. This leads to the Ladder Loss, which has been demonstrated to offer performance enhancements of visual-semantic embedding in learning to rank tasks. In Natural Language Processing, triplet loss is one of the loss functions considered for BERT fine-tuning in the SBERT architecture. Other extensions involve specifying multiple negatives (multiple negatives ranking loss).
Mean squared error
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the true value. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive (and not zero) is because of randomness or because the estimator does not account for information that could produce a more accurate estimate. In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk (the average loss on an observed data set), as an estimate of the true MSE (the true risk: the average loss on the actual population distribution). The MSE is a measure of the quality of an estimator. As it is derived from the square of Euclidean distance, it is always a positive value that decreases as the error approaches zero. The MSE is the second moment (about the origin) of the error, and thus incorporates both the variance of the estimator (how widely spread the estimates are from one data sample to another) and its bias (how far off the average estimated value is from the true value). For an unbiased estimator, the MSE is the variance of the estimator. Like the variance, MSE has the same units of measurement as the square of the quantity being estimated. In an analogy to standard deviation, taking the square root of MSE yields the root-mean-square error or root-mean-square deviation (RMSE or RMSD), which has the same units as the quantity being estimated; for an unbiased estimator, the RMSE is the square root of the variance, known as the standard error. == Definition and basic properties == The MSE either assesses the quality of a predictor (i.e., a function mapping arbitrary inputs to a sample of values of some random variable), or of an estimator (i.e., a mathematical function mapping a sample of data to an estimate of a parameter of the population from which the data is sampled). In the context of prediction, understanding the prediction interval can also be useful as it provides a range within which a future observation will fall, with a certain probability. The definition of an MSE differs according to whether one is describing a predictor or an estimator. === Predictor === If a vector of n {\displaystyle n} predictions is generated from a sample of n {\displaystyle n} data points on all variables, and Y {\displaystyle Y} is the vector of observed values of the variable being predicted, with Y ^ {\displaystyle {\hat {Y}}} being the predicted values (e.g. as from a least-squares fit), then the within-sample MSE of the predictor is computed as MSE = 1 n ∑ i = 1 n ( Y i − Y i ^ ) 2 {\displaystyle \operatorname {MSE} ={\frac {1}{n}}\sum _{i=1}^{n}\left(Y_{i}-{\hat {Y_{i}}}\right)^{2}} In other words, the MSE is the mean ( 1 n ∑ i = 1 n ) {\textstyle \left({\frac {1}{n}}\sum _{i=1}^{n}\right)} of the squares of the errors ( Y i − Y i ^ ) 2 {\textstyle \left(Y_{i}-{\hat {Y_{i}}}\right)^{2}} . This is an easily computable quantity for a particular sample (and hence is sample-dependent). In matrix notation, MSE = 1 n ∑ i = 1 n ( e i ) 2 = 1 n e T e {\displaystyle \operatorname {MSE} ={\frac {1}{n}}\sum _{i=1}^{n}(e_{i})^{2}={\frac {1}{n}}\mathbf {e} ^{\mathsf {T}}\mathbf {e} } where e i {\displaystyle e_{i}} is Y i − Y i ^ {\displaystyle Y_{i}-{\hat {Y_{i}}}} and e {\displaystyle \mathbf {e} } is a n × 1 {\displaystyle n\times 1} column vector. The MSE can also be computed on q data points that were not used in estimating the model, either because they were held back for this purpose, or because these data have been newly obtained. Within this process, known as cross-validation, the MSE is often called the test MSE, and is computed as MSE = 1 q ∑ i = n + 1 n + q ( Y i − Y i ^ ) 2 {\displaystyle \operatorname {MSE} ={\frac {1}{q}}\sum _{i=n+1}^{n+q}\left(Y_{i}-{\hat {Y_{i}}}\right)^{2}} === Estimator === The MSE of an estimator θ ^ {\displaystyle {\hat {\theta }}} with respect to an unknown parameter θ {\displaystyle \theta } is defined as MSE ( θ ^ ) = E θ [ ( θ ^ − θ ) 2 ] . {\displaystyle \operatorname {MSE} ({\hat {\theta }})=\operatorname {E} _{\theta }\left[({\hat {\theta }}-\theta )^{2}\right].} This definition depends on the unknown parameter, therefore the MSE is a priori property of an estimator. The MSE could be a function of unknown parameters, in which case any estimator of the MSE based on estimates of these parameters would be a function of the data (and thus a random variable). If the estimator θ ^ {\displaystyle {\hat {\theta }}} is derived as a sample statistic and is used to estimate some population parameter, then the expectation is with respect to the sampling distribution of the sample statistic. The MSE can be written as the sum of the variance of the estimator and the squared bias of the estimator, providing a useful way to calculate the MSE and implying that in the case of unbiased estimators, the MSE and variance are equivalent. MSE ( θ ^ ) = Var θ ( θ ^ ) + Bias ( θ ^ , θ ) 2 . {\displaystyle \operatorname {MSE} ({\hat {\theta }})=\operatorname {Var} _{\theta }({\hat {\theta }})+\operatorname {Bias} ({\hat {\theta }},\theta )^{2}.} ==== Proof of variance and bias relationship ==== MSE ( θ ^ ) = E θ [ ( θ ^ − θ ) 2 ] = E θ [ ( θ ^ − E θ [ θ ^ ] + E θ [ θ ^ ] − θ ) 2 ] = E θ [ ( θ ^ − E θ [ θ ^ ] ) 2 + 2 ( θ ^ − E θ [ θ ^ ] ) ( E θ [ θ ^ ] − θ ) + ( E θ [ θ ^ ] − θ ) 2 ] = E θ [ ( θ ^ − E θ [ θ ^ ] ) 2 ] + E θ [ 2 ( θ ^ − E θ [ θ ^ ] ) ( E θ [ θ ^ ] − θ ) ] + E θ [ ( E θ [ θ ^ ] − θ ) 2 ] = E θ [ ( θ ^ − E θ [ θ ^ ] ) 2 ] + 2 ( E θ [ θ ^ ] − θ ) E θ [ θ ^ − E θ [ θ ^ ] ] + ( E θ [ θ ^ ] − θ ) 2 E θ [ θ ^ ] − θ = constant = E θ [ ( θ ^ − E θ [ θ ^ ] ) 2 ] + 2 ( E θ [ θ ^ ] − θ ) ( E θ [ θ ^ ] − E θ [ θ ^ ] ) + ( E θ [ θ ^ ] − θ ) 2 E θ [ θ ^ ] = constant = E θ [ ( θ ^ − E θ [ θ ^ ] ) 2 ] + ( E θ [ θ ^ ] − θ ) 2 = Var θ ( θ ^ ) + Bias θ ( θ ^ , θ ) 2 {\displaystyle {\begin{aligned}\operatorname {MSE} ({\hat {\theta }})&=\operatorname {E} _{\theta }\left[({\hat {\theta }}-\theta )^{2}\right]\\&=\operatorname {E} _{\theta }\left[\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]+\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)^{2}\right]\\&=\operatorname {E} _{\theta }\left[\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)^{2}+2\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)+\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)^{2}\right]\\&=\operatorname {E} _{\theta }\left[\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)^{2}\right]+\operatorname {E} _{\theta }\left[2\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)\right]+\operatorname {E} _{\theta }\left[\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)^{2}\right]\\&=\operatorname {E} _{\theta }\left[\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)^{2}\right]+2\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)\operatorname {E} _{\theta }\left[{\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right]+\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)^{2}&&\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta ={\text{constant}}\\&=\operatorname {E} _{\theta }\left[\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)^{2}\right]+2\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)+\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)^{2}&&\operatorname {E} _{\theta }[{\hat {\theta }}]={\text{constant}}\\&=\operatorname {E} _{\theta }\left[\left({\hat {\theta }}-\operatorname {E} _{\theta }[{\hat {\theta }}]\right)^{2}\right]+\left(\operatorname {E} _{\theta }[{\hat {\theta }}]-\theta \right)^{2}\\&=\operatorname {Var} _{\theta }({\hat {\theta }})+\operatorname {Bias} _{\theta }({\hat {\theta }},\theta )^{2}\end{aligned}}} An even shorter proof can be achieved using the well-known formula that for a random variable X {\textstyle X} , E ( X 2 ) = Var ( X ) + ( E ( X ) ) 2 {\textstyle \mathbb {E} (X^{2})=\operatorname {Var} (X)+(\mathbb {E} (X))^{2}} . By substituting X {\textstyle X} with, θ ^ − θ {\textstyle {\hat {\theta }}-\theta } , we have MSE ( θ ^ ) = E [ ( θ ^ − θ ) 2 ] = Var ( θ ^ − θ ) + ( E [ θ ^ − θ ] ) 2 = Var ( θ ^ ) + Bias 2 ( θ ^ , θ ) {\displaystyle {\begin{aligned}\operatorname {MSE} ({\hat {\theta }})&=\mathbb {E} [({\hat {\theta }}-\theta )^{2}]\\&=\operator
Tensor sketch
In statistics, machine learning and algorithms, a tensor sketch is a type of dimensionality reduction that is particularly efficient when applied to vectors that have tensor structure. Such a sketch can be used to speed up explicit kernel methods, bilinear pooling in neural networks and is a cornerstone in many numerical linear algebra algorithms. == Mathematical definition == Mathematically, a dimensionality reduction or sketching matrix is a matrix M ∈ R k × d {\displaystyle M\in \mathbb {R} ^{k\times d}} , where k < d {\displaystyle k Mobile Passport Control (MPC) is a mobile app that enables eligible travelers entering the United States to submit their passport information and customs declaration form to Customs and Border Protection via smartphone or tablet and go through the inspections process using an expedited lane. It is available to "U.S. citizens, U.S. lawful permanent residents, Canadian B1/B2 citizen visitors and returning Visa Waiver Program travelers with approved ESTA". The app is available on iOS and Android devices and is operational at 34 US airports, 14 international airports offering preclearance facilities, and 4 seaports. The use of Mobile Passport Control operations have increased threefold from 2016 to 2017. == History == Mobile Passport Control operations were launched in Atlanta at the Hartsfield-Jackson International Airport in 2016 and is now available at 34 U.S. airports, 14 international airports that offer preclearance and 4 U.S. cruise ports. The Mobile Passport app is authorized by CBP and sponsored by the Airports Council International-North America, Boeing, and the Port of Everglades. Airside Mobile, Inc. secured a Series A funding of $6 million in the fall of 2017. == How it works == During the customs process at the Federal Inspection Service (FIS) area of a U.S. airport, travelers arriving from international locations typically wait in long lines before presenting passports and paperwork and verbally answering questions made by CBP officials. Eligible travelers who have downloaded the Mobile Passport app can expedite this process by submitting information regarding their passport and trip details, and a newly-taken selfie, via their mobile device to CBP officials, then access an expedited line. Mobile Passport Control users will be required to show their physical passport(s) and briefly talk to a CBP officer. == Locations == === US airports === Atlanta (ATL) Baltimore (BWI) Boston (BOS) Charlotte (CLT) Chicago (ORD) Dallas/Ft Worth (DFW) Denver (DEN) Detroit (DTW) as of 7/2024 Ft. Lauderdale (FLL) Honolulu (HNL) Houston (HOU and IAH) Kansas City (MCI) Las Vegas (LAS) Los Angeles (LAX) Miami (MIA) Minneapolis (MSP) New York (JFK) Newark (EWR) Oakland (OAK) Orlando (MCO) Palm Beach (PBI) Philadelphia (PHL) Phoenix (PHX) Pittsburgh (PIT) Portland (PDX) Sacramento (SMF) San Diego (SAN) San Francisco (SFO) San Jose (SJC) San Juan (SJU) Seattle (SEA) Tampa (TPA) Washington Dulles (IAD) === International Preclearance locations === Abu Dhabi (AUH) Aruba (AUA) Bermuda (BDA) Calgary (YYC) Dublin (DUB) Edmonton (YEG) Halifax (YHZ) Montreal (YUL) Nassau (NAS) Ottawa (YOW) Shannon (SNN) Toronto (YYZ) Vancouver (YVR) Winnipeg (YWG) Sepinggan (BPN) === Seaports === Fort Lauderdale (PEV) Miami (MSE) San Juan (PUE) West Palm Beach (WPB) Correspondence analysis (CA) is a multivariate statistical technique proposed by Herman Otto Hartley (Hirschfeld) and later developed by Jean-Paul Benzécri. It is conceptually similar to principal component analysis, but applies to categorical rather than continuous data. In a manner similar to principal component analysis, it provides a means of displaying or summarising a set of data in two-dimensional graphical form. Its aim is to display in a biplot any structure hidden in the multivariate setting of the data table. As such it is a technique from the field of multivariate ordination. Since the variant of CA described here can be applied either with a focus on the rows or on the columns it should in fact be called simple (symmetric) correspondence analysis. It is traditionally applied to the contingency table of a pair of nominal variables where each cell contains either a count or a zero value. If more than two categorical variables are to be summarized, a variant called multiple correspondence analysis should be chosen instead. CA may also be applied to binary data given the presence/absence coding represents simplified count data i.e. a 1 describes a positive count and 0 stands for a count of zero. Depending on the scores used CA preserves the chi-square distance between either the rows or the columns of the table. Because CA is a descriptive technique, it can be applied to tables regardless of a significant chi-squared test. Although the χ 2 {\displaystyle \chi ^{2}} statistic used in inferential statistics and the chi-square distance are computationally related they should not be confused since the latter works as a multivariate statistical distance measure in CA while the χ 2 {\displaystyle \chi ^{2}} statistic is in fact a scalar not a metric. == Details == Like principal components analysis, correspondence analysis creates orthogonal components (or axes) and, for each item in a table i.e. for each row, a set of scores (sometimes called factor scores, see Factor analysis). Correspondence analysis is performed on the data table, conceived as matrix C of size m × n where m is the number of rows and n is the number of columns. In the following mathematical description of the method capital letters in italics refer to a matrix while letters in italics refer to vectors. Understanding the following computations requires knowledge of matrix algebra. === Preprocessing === Before proceeding to the central computational step of the algorithm, the values in matrix C have to be transformed. First compute a set of weights for the columns and the rows (sometimes called masses), where row and column weights are given by the row and column vectors, respectively: w m = 1 n C C 1 , w n = 1 n C 1 T C . {\displaystyle w_{m}={\frac {1}{n_{C}}}C\mathbf {1} ,\quad w_{n}={\frac {1}{n_{C}}}\mathbf {1} ^{T}C.} Here n C = ∑ i = 1 n ∑ j = 1 m C i j {\displaystyle n_{C}=\sum _{i=1}^{n}\sum _{j=1}^{m}C_{ij}} is the sum of all cell values in matrix C, or short the sum of C, and 1 {\displaystyle \mathbf {1} } is a column vector of ones with the appropriate dimension. Put in simple words, w m {\displaystyle w_{m}} is just a vector whose elements are the row sums of C divided by the sum of C, and w n {\displaystyle w_{n}} is a vector whose elements are the column sums of C divided by the sum of C. The weights are transformed into diagonal matrices W m = diag ( 1 / w m ) {\displaystyle W_{m}=\operatorname {diag} (1/{\sqrt {w_{m}}})} and W n = diag ( 1 / w n ) {\displaystyle W_{n}=\operatorname {diag} (1/{\sqrt {w_{n}}})} where the diagonal elements of W n {\displaystyle W_{n}} are 1 / w n {\displaystyle 1/{\sqrt {w_{n}}}} and those of W m {\displaystyle W_{m}} are 1 / w m {\displaystyle 1/{\sqrt {w_{m}}}} respectively i.e. the vector elements are the inverses of the square roots of the masses. The off-diagonal elements are all 0. Next, compute matrix P {\displaystyle P} by dividing C {\displaystyle C} by its sum P = 1 n C C . {\displaystyle P={\frac {1}{n_{C}}}C.} In simple words, Matrix P {\displaystyle P} is just the data matrix (contingency table or binary table) transformed into portions i.e. each cell value is just the cell portion of the sum of the whole table. Finally, compute matrix S {\displaystyle S} , sometimes called the matrix of standardized residuals, by matrix multiplication as S = W m ( P − w m w n ) W n {\displaystyle S=W_{m}(P-w_{m}w_{n})W_{n}} Note, the vectors w m {\displaystyle w_{m}} and w n {\displaystyle w_{n}} are combined in an outer product resulting in a matrix of the same dimensions as P {\displaystyle P} . In words the formula reads: matrix outer ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} is subtracted from matrix P {\displaystyle P} and the resulting matrix is scaled (weighted) by the diagonal matrices W m {\displaystyle W_{m}} and W n {\displaystyle W_{n}} . Multiplying the resulting matrix by the diagonal matrices is equivalent to multiply the i-th row (or column) of it by the i-th element of the diagonal of W m {\displaystyle W_{m}} or W n {\displaystyle W_{n}} , respectively. === Interpretation of preprocessing === The vectors w m {\displaystyle w_{m}} and w n {\displaystyle w_{n}} are the row and column masses or the marginal probabilities for the rows and columns, respectively. Subtracting matrix outer ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} from matrix P {\displaystyle P} is the matrix algebra version of double centering the data. Multiplying this difference by the diagonal weighting matrices results in a matrix containing weighted deviations from the origin of a vector space. This origin is defined by matrix outer ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} . In fact matrix outer ( w m , w n ) {\displaystyle \operatorname {outer} (w_{m},w_{n})} is identical with the matrix of expected frequencies in the chi-squared test. Therefore S {\displaystyle S} is computationally related to the independence model used in that test. But since CA is not an inferential method the term independence model is inappropriate here. === Orthogonal components === The table S {\displaystyle S} is then decomposed by a singular value decomposition as S = U Σ V ∗ {\displaystyle S=U\Sigma V^{}\,} where U {\displaystyle U} and V {\displaystyle V} are the left and right singular vectors of S {\displaystyle S} and Σ {\displaystyle \Sigma } is a square diagonal matrix with the singular values σ i {\displaystyle \sigma _{i}} of S {\displaystyle S} on the diagonal. Σ {\displaystyle \Sigma } is of dimension p ≤ ( min ( m , n ) − 1 ) {\displaystyle p\leq (\min(m,n)-1)} hence U {\displaystyle U} is of dimension m×p and V {\displaystyle V} is of n×p. As orthonormal vectors U {\displaystyle U} and V {\displaystyle V} fulfill U ∗ U = V ∗ V = I {\displaystyle U^{}U=V^{}V=I} . In other words, the multivariate information that is contained in C {\displaystyle C} as well as in S {\displaystyle S} is now distributed across two (coordinate) matrices U {\displaystyle U} and V {\displaystyle V} and a diagonal (scaling) matrix Σ {\displaystyle \Sigma } . The vector space defined by them has as number of dimensions p, that is the smaller of the two values, number of rows and number of columns, minus 1. === Inertia === While a principal component analysis may be said to decompose the (co)variance, and hence its measure of success is the amount of (co-)variance covered by the first few PCA axes - measured in eigenvalue -, a CA works with a weighted (co-)variance which is called inertia. The sum of the squared singular values is the total inertia I {\displaystyle \mathrm {I} } of the data table, computed as I = ∑ i = 1 p σ i 2 . {\displaystyle \mathrm {I} =\sum _{i=1}^{p}\sigma _{i}^{2}.} The total inertia I {\displaystyle \mathrm {I} } of the data table can also computed directly from S {\displaystyle S} as I = ∑ i = 1 n ∑ j = 1 m s i j 2 . {\displaystyle \mathrm {I} =\sum _{i=1}^{n}\sum _{j=1}^{m}s_{ij}^{2}.} The amount of inertia covered by the i-th set of singular vectors is ι i {\displaystyle \iota _{i}} , the principal inertia. The higher the portion of inertia covered by the first few singular vectors i.e. the larger the sum of the principal inertiae in comparison to the total inertia, the more successful a CA is. Therefore, all principal inertia values are expressed as portion ϵ i {\displaystyle \epsilon _{i}} of the total inertia ϵ i = σ i 2 / ∑ i = 1 p σ i 2 {\displaystyle \epsilon _{i}=\sigma _{i}^{2}/\sum _{i=1}^{p}\sigma _{i}^{2}} and are presented in the form of a scree plot. In fact a scree plot is just a bar plot of all principal inertia portions ϵ i {\displaystyle \epsilon _{i}} . === Coordinates === To transform the singular vectors to coordinates which preserve the chi-square distances between rows or columns an additional weighting step is necessary. The resulting coordinates are called principal coordinates in CA text books. If principal coordinates are used for Mobile Passport Control
Correspondence analysis