The Data Reference Model (DRM) is one of the five reference models of the Federal Enterprise Architecture. == Overview == The DRM is a framework whose primary purpose is to enable information sharing and reuse across the United States federal government via the standard description and discovery of common data and the promotion of uniform data management practices. The DRM describes artifacts which can be generated from the data architectures of federal government agencies. The DRM provides a flexible and standards-based approach to accomplish its purpose. The scope of the DRM is broad, as it may be applied within a single agency, within a community of interest, or cross-community of interest. == Data Reference Model topics == === DRM structure === The DRM provides a standard means by which data may be described, categorized, and shared. These are reflected within each of the DRM's three standardization areas: Data Description: Provides a means to uniformly describe data, thereby supporting its discovery and sharing. Data Context: Facilitates discovery of data through an approach to the categorization of data according to taxonomies. Additionally, enables the definition of authoritative data assets within a community of interest. Data Sharing: Supports the access and exchange of data where access consists of ad hoc requests (such as a query of a data asset), and exchange consists of fixed, re-occurring transactions between parties. Enabled by capabilities provided by both the Data Context and Data Description standardization areas. === DRM Version 2 === The Data Reference Model version 2 released in November 2005 is a 114-page document with detailed architectural diagrams and an extensive glossary of terms. The DRM also make many references to ISO standards specifically the ISO/IEC 11179 metadata registry standard. === DRM usage === The DRM is not technically a published technical interoperability standard such as web services, it is an excellent starting point for data architects within federal and state agencies. Any federal or state agencies that are involved with exchanging information with other agencies or that are involved in data warehousing efforts should use this document as a guide.
NumPy
NumPy (pronounced NUM-py) is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. The predecessor of NumPy, Numeric, was originally created by Jim Hugunin with contributions from several other developers. In 2005, Travis Oliphant created NumPy by incorporating features of the competing Numarray into Numeric, with extensive modifications. NumPy is open-source software and has many contributors. NumPy is fiscally sponsored by NumFOCUS. == History == === matrix-sig === The Python programming language was not originally designed for numerical computing, but attracted the attention of the scientific and engineering community early on. In 1995 the special interest group (SIG) matrix-sig was founded with the aim of defining an array computing package; among its members was Python designer and maintainer Guido van Rossum, who extended Python's syntax (in particular the indexing syntax) to make array computing easier. === Numeric === An implementation of a matrix package was completed by Jim Fulton, then expanded to support multi-dimensional arrays by Jim Hugunin and called Numeric (also variously known as the "Numerical Python extensions" or "NumPy"), with influences from the APL family of languages, Basis, MATLAB, FORTRAN, S and S+, and others. Hugunin, a graduate student at the Massachusetts Institute of Technology (MIT), joined the Corporation for National Research Initiatives (CNRI) in 1997 to work on JPython, leaving Paul Dubois of Lawrence Livermore National Laboratory (LLNL) to take over as maintainer. Other early contributors include David Ascher, Konrad Hinsen and Travis Oliphant. === Numarray === A new package called Numarray was written as a more flexible replacement for Numeric. Like Numeric, it too is now deprecated. Numarray had faster operations for large arrays, but was slower than Numeric on small ones, so for a time both packages were used in parallel for different use cases. The last version of Numeric (v24.2) was released on 11 November 2005, while the last version of numarray (v1.5.2) was released on 24 August 2006. There was a desire to get Numeric into the Python standard library, but Guido van Rossum decided that the code was not maintainable in its state then. === NumPy === In early 2005, NumPy developer Travis Oliphant wanted to unify the community around a single array package and ported Numarray's features to Numeric, releasing the result as NumPy 1.0 in 2006. This new project was part of SciPy. To avoid installing the large SciPy package just to get an array object, this new package was separated and called NumPy. Support for Python 3 was added in 2011 with NumPy version 1.5.0. In 2011, PyPy started development on an implementation of the NumPy API for PyPy. As of 2023, it is not yet fully compatible with NumPy. == Features == NumPy targets the CPython reference implementation of Python, which is a non-optimizing bytecode interpreter. Mathematical algorithms written for this version of Python often run much slower than compiled equivalents due to the absence of compiler optimization. NumPy addresses the slowness problem partly by providing multidimensional arrays and functions and operators that operate efficiently on arrays; using these requires rewriting some code, mostly inner loops, using NumPy. Using NumPy in Python gives functionality comparable to MATLAB since they are both interpreted, and they both allow the user to write fast programs as long as most operations work on arrays or matrices instead of scalars. In comparison, MATLAB boasts a large number of additional toolboxes, notably Simulink, whereas NumPy is intrinsically integrated with Python, a more modern and complete programming language. Moreover, complementary Python packages are available; SciPy is a library that adds more MATLAB-like functionality and Matplotlib is a plotting package that provides MATLAB-like plotting functionality. Although MATLAB can perform sparse matrix operations, NumPy alone cannot perform such operations and requires the use of the scipy.sparse library. Internally, both MATLAB and NumPy rely on BLAS and LAPACK for efficient linear algebra computations. Python bindings of the widely used computer vision library OpenCV utilize NumPy arrays to store and operate on data. Since images with multiple channels are simply represented as three-dimensional arrays, indexing, slicing or masking with other arrays are very efficient ways to access specific pixels of an image. The NumPy array as universal data structure in OpenCV for images, extracted feature points, filter kernels and many more vastly simplifies the programming workflow and debugging. Importantly, many NumPy operations release the global interpreter lock, which allows for multithreaded processing. NumPy also provides a C API, which allows Python code to interoperate with external libraries written in low-level languages. === The ndarray data structure === The core functionality of NumPy is its "ndarray", for n-dimensional array, data structure. These arrays are strided views on memory. In contrast to Python's built-in list data structure, these arrays are homogeneously typed: all elements of a single array must be of the same type. Such arrays can also be views into memory buffers allocated by C/C++, Python, and Fortran extensions to the CPython interpreter without the need to copy data around, giving a degree of compatibility with existing numerical libraries. This functionality is exploited by the SciPy package, which wraps a number of such libraries (notably BLAS and LAPACK). NumPy has built-in support for memory-mapped ndarrays. === Limitations === Inserting or appending entries to an array is not as trivially possible as it is with Python's lists. The np.pad(...) routine to extend arrays actually creates new arrays of the desired shape and padding values, copies the given array into the new one and returns it. NumPy's np.concatenate([a1,a2]) operation does not actually link the two arrays but returns a new one, filled with the entries from both given arrays in sequence. Reshaping the dimensionality of an array with np.reshape(...) is only possible as long as the number of elements in the array does not change. These circumstances originate from the fact that NumPy's arrays must be views on contiguous memory buffers. Algorithms that are not expressible as a vectorized operation will typically run slowly because they must be implemented in "pure Python", while vectorization may increase memory complexity of some operations from constant to linear, because temporary arrays must be created that are as large as the inputs. Runtime compilation of numerical code has been implemented by several groups to avoid these problems; open source solutions that interoperate with NumPy include numexpr and Numba. Cython and Pythran are static-compiling alternatives to these. Many modern large-scale scientific computing applications have requirements that exceed the capabilities of the NumPy arrays. For example, NumPy arrays are usually loaded into a computer's memory, which might have insufficient capacity for the analysis of large datasets. Further, NumPy operations are executed on a single CPU. However, many linear algebra operations can be accelerated by executing them on clusters of CPUs or of specialized hardware, such as GPUs and TPUs, which many deep learning applications rely on. As a result, several alternative array implementations have arisen in the scientific python ecosystem over the recent years, such as Dask for distributed arrays and TensorFlow or JAX for computations on GPUs. Because of its popularity, these often implement a subset of NumPy's API or mimic it, so that users can change their array implementation with minimal changes to their code required. A library named CuPy, accelerated by Nvidia's CUDA framework, has also shown potential for faster computing, being a 'drop-in replacement' of NumPy. == Examples == NumPy is conventionally imported as np. === Basic operations === === Universal functions === === Linear algebra === === Multidimensional arrays === === Incorporation with OpenCV === === Nearest-neighbor search === Functional Python and vectorized NumPy version. === F2PY === Quickly wrap native code for faster scripts.
Word2vec
Word2vec is a technique in natural language processing for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov, Kai Chen, Greg Corrado, Ilya Sutskever and Jeff Dean at Google, and published in 2013. Word2vec represents a word as a high-dimension vector of numbers which capture relationships between words. In particular, words which appear in similar contexts are mapped to vectors which are nearby as measured by cosine similarity. This indicates the level of semantic similarity between the words, so for example the vectors for walk and ran are nearby, as are those for "but" and "however", and "Berlin" and "Germany". == Approach == Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a mapping of the set of words to a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a vector in the space. Word2vec can use either of two model architectures to produce these distributed representations of words: continuous bag of words (CBOW) or continuously sliding skip-gram. In both architectures, word2vec considers both individual words and a sliding context window as it iterates over the corpus. The CBOW can be viewed as a 'fill in the blank' task, where the word embedding represents the way the word influences the relative probabilities of other words in the context window. Words which are semantically similar should influence these probabilities in similar ways, because semantically similar words should be used in similar contexts. The order of context words does not influence prediction (bag of words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words. According to the authors' note, CBOW is faster while skip-gram does a better job for infrequent words. After the model is trained, the learned word embeddings are positioned in the vector space such that words that share common contexts in the corpus — that is, words that are semantically and syntactically similar — are located close to one another in the space. More dissimilar words are located farther from one another in the space. == Mathematical details == This section is based on expositions. A corpus is a sequence of words. Both CBOW and skip-gram are methods to learn one vector per word appearing in the corpus. Let V {\displaystyle V} ("vocabulary") be the set of all words appearing in the corpus C {\displaystyle C} . Our goal is to learn one vector v w ∈ R d {\displaystyle v_{w}\in \mathbb {R} ^{d}} for each word w ∈ V {\displaystyle w\in V} . The idea of skip-gram is that the vector of a word should be close to the vector of each of its neighbors. The idea of CBOW is that the vector-sum of a word's neighbors should be close to the vector of the word. === Continuous bag-of-words (CBOW) === The idea of CBOW is to represent each word with a vector, such that it is possible to predict a word using the sum of the vectors of its neighbors. Specifically, for each word w i {\displaystyle w_{i}} in the corpus, the one-hot encoding of the word is used as the input to the neural network. The output of the neural network is a probability distribution over the dictionary, representing a prediction of individual words in the neighborhood of w i {\displaystyle w_{i}} . The objective of training is to maximize ∑ i ln Pr ( w i ∣ w i + j : j ∈ N ) {\displaystyle \sum _{i}\ln \Pr(w_{i}\mid w_{i+j}\colon j\in N)} where N {\displaystyle N} is a set of (non-zero) indices representing the relative locations of nearby words considered to be in w i {\displaystyle w_{i}} 's neighborhood. For example, if we want each word in the corpus to be predicted by every other word in a small span of 4 words. The set of relative indexes of neighbor words will be: N = { − 2 , − 1 , + 1 , + 2 } {\displaystyle N=\{-2,-1,+1,+2\}} , and the objective is to maximize ∑ i ln Pr ( w i ∣ w i − 2 , w i − 1 , w i + 1 , w i + 2 ) {\displaystyle \sum _{i}\ln \Pr(w_{i}\mid w_{i-2},w_{i-1},w_{i+1},w_{i+2})} . In standard bag-of-words, a word's context is represented by a word-count (aka a word histogram) of its neighboring words. For example, the "sat" in "the cat sat on the mat" is represented as {"the": 2, "cat": 1, "on": 1}. Note that the last word "mat" is not used to represent "sat", because it is outside the neighborhood N = { − 2 , − 1 , + 1 , + 2 } {\displaystyle N=\{-2,-1,+1,+2\}} . In continuous bag-of-words, the histogram is multiplied by a matrix V {\displaystyle V} to obtain a continuous representation of the word's context. The matrix V {\displaystyle V} is also called a dictionary. Its columns are the word vectors. It has D {\displaystyle D} columns, where D {\displaystyle D} is the size of the dictionary. Let d {\displaystyle d} be the length of each word vector. We have V ∈ R d × D {\displaystyle V\in \mathbb {R} ^{d\times D}} . For example, multiplying the word histogram {"the": 2, "cat": 1, "on": 1} with V {\displaystyle V} , we obtain 2 v the + v cat + v on {\displaystyle 2v_{\text{the}}+v_{\text{cat}}+v_{\text{on}}} . This is then multiplied with another matrix V ′ {\displaystyle V'} of shape R D × d {\displaystyle \mathbb {R} ^{D\times d}} . Each row of it is a word vector v ′ {\displaystyle v'} . This results in a vector of length D {\displaystyle D} , one entry per dictionary entry. Then, apply the softmax to obtain a probability distribution over the dictionary. This system can be visualized as a neural network, similar in spirit to an autoencoder, of architecture linear-linear-softmax, as depicted in the diagram. The system is trained by gradient descent to minimize the cross-entropy loss. In full formula, the cross-entropy loss is: − ∑ i ln e v w i ′ ⋅ ( ∑ j ∈ N v w j + i ) ∑ w ′ e v w ′ ′ ⋅ ( ∑ j ∈ N v w j + i ) {\displaystyle -\sum _{i}\ln {\frac {e^{v_{w_{i}}'\cdot (\sum _{j\in N}v_{w_{j+i}})}}{\sum _{w'}e^{v_{w'}'\cdot (\sum _{j\in N}v_{w_{j+i}})}}}} where the outer summation ∑ i {\displaystyle \sum _{i}} is over the words in a corpus, the quantity ∑ j ∈ N v w j + i {\displaystyle \sum _{j\in N}v_{w_{j+i}}} is the sum of a word's neighbors' vectors, etc. Once such a system is trained, we have two trained matrices V , V ′ {\displaystyle V,V'} . Either the column vectors of V {\displaystyle V} or the row vectors of V ′ {\displaystyle V'} can serve as the dictionary. For example, the word "sat" can be represented as either the "sat"-th column of V {\displaystyle V} or the "sat"-th row of V ′ {\displaystyle V'} . It is also possible to simply define V ′ = V ⊤ {\displaystyle V'=V^{\top }} , in which case there would no longer be a choice. === Skip-gram === The idea of skip-gram is to represent each word with a vector, such that it is possible to predict the vectors of its neighbors using the vector of a word. The architecture is still linear-linear-softmax, the same as CBOW, but the input and the output are switched. Specifically, for each word w i {\displaystyle w_{i}} in the corpus, the one-hot encoding of the word is used as the input to the neural network. The output of the neural network is a probability distribution over the dictionary, representing a prediction of individual words in the neighborhood of w i {\displaystyle w_{i}} . The objective of training is to maximize ∑ i ∑ j ∈ N ln Pr ( w j + i ∣ w i ) {\displaystyle \sum _{i}\sum _{j\in N}\ln \Pr(w_{j+i}\mid w_{i})} . In full formula, the loss function is − ∑ i ∑ j ∈ N ln e v w j + i ′ ⋅ v w i ∑ w ′ e v w ′ ′ ⋅ v w i {\displaystyle -\sum _{i}\sum _{j\in N}\ln {\frac {e^{v_{w_{j+i}}'\cdot v_{w_{i}}}}{\sum _{w'}e^{v_{w'}'\cdot v_{w_{i}}}}}} Same as CBOW, once such a system is trained, we have two trained matrices V , V ′ {\displaystyle V,V'} . Either the column vectors of V {\displaystyle V} or the row vectors of V ′ {\displaystyle V'} can serve as the dictionary. It is also possible to simply define V ′ = V ⊤ {\displaystyle V'=V^{\top }} , in which case there would no longer be a choice. Essentially, skip-gram and CBOW are exactly the same in architecture. They only differ in the objective function during training. == History == During the 1980s, there were some early attempts at using neural networks to represent words and concepts as vectors. In 2010, Tomáš Mikolov (then at Brno University of Technology) with co-authors applied a simple recurrent neural network with a single hidden
Sum of absolute differences
In digital image processing, the sum of absolute differences (SAD) is a measure of the similarity between image blocks. It is calculated by taking the absolute difference between each pixel in the original block and the corresponding pixel in the block being used for comparison. These differences are summed to create a simple metric of block similarity, the L1 norm of the difference image or Manhattan distance between two image blocks. The sum of absolute differences may be used for a variety of purposes, such as object recognition, the generation of disparity maps for stereo images, and motion estimation for video compression. == Example == This example uses the sum of absolute differences to identify which part of a search image is most similar to a template image. In this example, the template image is 3 by 3 pixels in size, while the search image is 3 by 5 pixels in size. Each pixel is represented by a single integer from 0 to 9. Template Search image 2 5 5 2 7 5 8 6 4 0 7 1 7 4 2 7 7 5 9 8 4 6 8 5 There are exactly three unique locations within the search image where the template may fit: the left side of the image, the center of the image, and the right side of the image. To calculate the SAD values, the absolute value of the difference between each corresponding pair of pixels is used: the difference between 2 and 2 is 0, 4 and 1 is 3, 7 and 8 is 1, and so forth. Calculating the values of the absolute differences for each pixel, for the three possible template locations, gives the following: Left Center Right 0 2 0 5 0 3 3 3 1 3 7 3 3 4 5 0 2 0 1 1 3 3 1 1 1 3 4 For each of these three image patches, the 9 absolute differences are added together, giving SAD values of 20, 25, and 17, respectively. From these SAD values, it could be asserted that the right side of the search image is the most similar to the template image, because it has the lowest sum of absolute differences as compared to the other two locations. == Comparison to other metrics == === Object recognition === The sum of absolute differences provides a simple way to automate the searching for objects inside an image, but may be unreliable due to the effects of contextual factors such as changes in lighting, color, viewing direction, size, or shape. The SAD may be used in conjunction with other object recognition methods, such as edge detection, to improve the reliability of results. === Video compression === SAD is an extremely fast metric due to its simplicity; it is effectively the simplest possible metric that takes into account every pixel in a block. Therefore, it is very effective for a wide motion search of many different blocks. SAD is also easily parallelizable since it analyzes each pixel separately, making it easily implementable with such instructions as ARM NEON or x86 SSE2. For example, SSE has packed sum of absolute differences instruction (PSADBW) specifically for this purpose. Once candidate blocks are found, the final refinement of the motion estimation process is often done with other slower but more accurate metrics, which better take into account human perception. These include the sum of absolute transformed differences (SATD), the sum of squared differences (SSD), and rate–distortion optimization.
Quadratic unconstrained binary optimization
Quadratic unconstrained binary optimization (QUBO), also known as unconstrained binary quadratic programming (UBQP), is a combinatorial optimization problem with a wide range of applications from finance and economics to machine learning. QUBO is an NP hard problem, and for many classical problems from theoretical computer science, like maximum cut, graph coloring and the partition problem, embeddings into QUBO have been formulated. Embeddings for machine learning models include support-vector machines, clustering and probabilistic graphical models. Moreover, due to its close connection to Ising models, QUBO constitutes a central problem class for adiabatic quantum computation, where it is solved through a physical process called quantum annealing. == Definition == Let B = { 0 , 1 } {\displaystyle \mathbb {B} =\lbrace 0,1\rbrace } the set of binary digits (or bits), then B n {\displaystyle \mathbb {B} ^{n}} is the set of binary vectors of fixed length n ∈ N {\displaystyle n\in \mathbb {N} } . Given a symmetric or upper triangular matrix Q ∈ R n × n {\displaystyle {\boldsymbol {Q}}\in \mathbb {R} ^{n\times n}} , whose entries Q i j {\displaystyle Q_{ij}} define a weight for each pair of indices i , j ∈ { 1 , … , n } {\displaystyle i,j\in \lbrace 1,\dots ,n\rbrace } , we can define the function f Q : B n → R {\displaystyle f_{\boldsymbol {Q}}:\mathbb {B} ^{n}\rightarrow \mathbb {R} } that assigns a value to each binary vector x {\displaystyle {\boldsymbol {x}}} through f Q ( x ) = x ⊺ Q x = ∑ i = 1 n ∑ j = 1 n Q i j x i x j . {\displaystyle f_{\boldsymbol {Q}}({\boldsymbol {x}})={\boldsymbol {x}}^{\intercal }{\boldsymbol {Qx}}=\sum _{i=1}^{n}\sum _{j=1}^{n}Q_{ij}x_{i}x_{j}.} Alternatively, the linear and quadratic parts can be separated as f Q ′ , q ( x ) = x ⊺ Q ′ x + q ⊺ x , {\displaystyle f_{{\boldsymbol {Q}}',{\boldsymbol {q}}}({\boldsymbol {x}})={\boldsymbol {x}}^{\intercal }{\boldsymbol {Q}}'{\boldsymbol {x}}+{\boldsymbol {q}}^{\intercal }{\boldsymbol {x}},} where Q ′ ∈ R n × n {\displaystyle {\boldsymbol {Q}}'\in \mathbb {R} ^{n\times n}} and q ∈ R n {\displaystyle {\boldsymbol {q}}\in \mathbb {R} ^{n}} . This is equivalent to the previous definition through Q = Q ′ + diag [ q ] {\displaystyle {\boldsymbol {Q}}={\boldsymbol {Q}}'+\operatorname {diag} [{\boldsymbol {q}}]} using the diag operator, exploiting that x = x ⋅ x {\displaystyle x=x\cdot x} for all binary values x {\displaystyle x} . Intuitively, the weight Q i j {\displaystyle Q_{ij}} is added if both x i = 1 {\displaystyle x_{i}=1} and x j = 1 {\displaystyle x_{j}=1} . The QUBO problem consists of finding a binary vector x ∗ {\displaystyle {\boldsymbol {x}}^{}} that minimizes f Q {\displaystyle f_{\boldsymbol {Q}}} , i.e., ∀ x ∈ B n : f Q ( x ∗ ) ≤ f Q ( x ) {\displaystyle \forall {\boldsymbol {x}}\in \mathbb {B} ^{n}:~f_{\boldsymbol {Q}}({\boldsymbol {x}}^{})\leq f_{\boldsymbol {Q}}({\boldsymbol {x}})} . In general, x ∗ {\displaystyle {\boldsymbol {x}}^{}} is not unique, meaning there may be a set of minimizing vectors with equal value w.r.t. f Q {\displaystyle f_{\boldsymbol {Q}}} . The complexity of QUBO arises from the number of candidate binary vectors to be evaluated, as | B n | = 2 n {\displaystyle \left|\mathbb {B} ^{n}\right|=2^{n}} grows exponentially in n {\displaystyle n} . Sometimes, QUBO is defined as the problem of maximizing f Q {\displaystyle f_{\boldsymbol {Q}}} , which is equivalent to minimizing f − Q = − f Q {\displaystyle f_{-{\boldsymbol {Q}}}=-f_{\boldsymbol {Q}}} . == Properties == QUBO is scale invariant for positive factors α > 0 {\displaystyle \alpha >0} , which leave the optimum x ∗ {\displaystyle {\boldsymbol {x}}^{}} unchanged: f α Q ( x ) = x ⊺ ( α Q ) x = α ( x ⊺ Q x ) = α f Q ( x ) {\displaystyle f_{\alpha {\boldsymbol {Q}}}({\boldsymbol {x}})={\boldsymbol {x}}^{\intercal }(\alpha {\boldsymbol {Q}}){\boldsymbol {x}}=\alpha ({\boldsymbol {x}}^{\intercal }{\boldsymbol {Qx}})=\alpha f_{\boldsymbol {Q}}({\boldsymbol {x}})} . In its general form, QUBO is NP-hard and cannot be solved efficiently by any known polynomial-time algorithm. However, there are polynomially-solvable special cases, where Q {\displaystyle {\boldsymbol {Q}}} has certain properties, for example: If all coefficients are positive, the optimum is trivially x ∗ = ( 0 , … , 0 ) ⊺ {\displaystyle {\boldsymbol {x}}^{}=(0,\dots ,0)^{\intercal }} . Similarly, if all coefficients are negative, the optimum is x ∗ = ( 1 , … , 1 ) ⊺ {\displaystyle {\boldsymbol {x}}^{}=(1,\dots ,1)^{\intercal }} . If Q {\displaystyle {\boldsymbol {Q}}} is diagonal, the bits can be optimized independently, and the problem is solvable in O ( n ) {\displaystyle {\mathcal {O}}(n)} . The optimal variable assignments are simply x i ∗ = 1 {\displaystyle x_{i}^{}=1} if Q i i < 0 {\displaystyle Q_{ii}<0} , and x i ∗ = 0 {\displaystyle x_{i}^{}=0} otherwise. If all off-diagonal elements of Q {\displaystyle {\boldsymbol {Q}}} are non-positive, the corresponding QUBO problem is solvable in polynomial time. QUBO can be solved using integer linear programming solvers like CPLEX or Gurobi Optimizer. This is possible since QUBO can be reformulated as a linear constrained binary optimization problem. To achieve this, substitute the product x i x j {\displaystyle x_{i}x_{j}} by an additional binary variable z i j ∈ B {\displaystyle z_{ij}\in \mathbb {B} } and add the constraints x i ≥ z i j {\displaystyle x_{i}\geq z_{ij}} , x j ≥ z i j {\displaystyle x_{j}\geq z_{ij}} and x i + x j − 1 ≤ z i j {\displaystyle x_{i}+x_{j}-1\leq z_{ij}} . Note that z i j {\displaystyle z_{ij}} can also be relaxed to continuous variables within the bounds zero and one. == Applications == QUBO is a structurally simple, yet computationally hard optimization problem. It can be used to encode a wide range of optimization problems from various scientific areas. === Maximum Cut === Given a graph G = ( V , E ) {\displaystyle G=(V,E)} with vertex set V = { 1 , … , n } {\displaystyle V=\lbrace 1,\dots ,n\rbrace } and edges E ⊆ V × V {\displaystyle E\subseteq V\times V} , the maximum cut (max-cut) problem consists of finding two subsets S , T ⊆ V {\displaystyle S,T\subseteq V} with T = V ∖ S {\displaystyle T=V\setminus S} , such that the number of edges between S {\displaystyle S} and T {\displaystyle T} is maximized. The more general weighted max-cut problem assumes edge weights w i j ≥ 0 ∀ i , j ∈ V {\displaystyle w_{ij}\geq 0~\forall i,j\in V} , with ( i , j ) ∉ E ⇒ w i j = 0 {\displaystyle (i,j)\notin E\Rightarrow w_{ij}=0} , and asks for a partition S , T ⊆ V {\displaystyle S,T\subseteq V} that maximizes the sum of edge weights between S {\displaystyle S} and T {\displaystyle T} , i.e., max S ⊆ V ∑ i ∈ S , j ∉ S w i j . {\displaystyle \max _{S\subseteq V}\sum _{i\in S,j\notin S}w_{ij}.} By setting w i j = 1 {\displaystyle w_{ij}=1} for all ( i , j ) ∈ E {\displaystyle (i,j)\in E} this becomes equivalent to the original max-cut problem above, which is why we focus on this more general form in the following. For every vertex in i ∈ V {\displaystyle i\in V} we introduce a binary variable x i {\displaystyle x_{i}} with the interpretation x i = 0 {\displaystyle x_{i}=0} if i ∈ S {\displaystyle i\in S} and x i = 1 {\displaystyle x_{i}=1} if i ∈ T {\displaystyle i\in T} . As T = V ∖ S {\displaystyle T=V\setminus S} , every i {\displaystyle i} is in exactly one set, meaning there is a 1:1 correspondence between binary vectors x ∈ B n {\displaystyle {\boldsymbol {x}}\in \mathbb {B} ^{n}} and partitions of V {\displaystyle V} into two subsets. We observe that, for any i , j ∈ V {\displaystyle i,j\in V} , the expression x i ( 1 − x j ) + ( 1 − x i ) x j {\displaystyle x_{i}(1-x_{j})+(1-x_{i})x_{j}} evaluates to 1 if and only if i {\displaystyle i} and j {\displaystyle j} are in different subsets, equivalent to logical XOR. Let W ∈ R + n × n {\displaystyle {\boldsymbol {W}}\in \mathbb {R} _{+}^{n\times n}} with W i j = w i j ∀ i , j ∈ V {\displaystyle W_{ij}=w_{ij}~\forall i,j\in V} . By extending above expression to matrix-vector form we find that x ⊺ W ( 1 − x ) + ( 1 − x ) ⊺ W x = − 2 x ⊺ W x + ( W 1 + W ⊺ 1 ) ⊺ x {\displaystyle {\boldsymbol {x}}^{\intercal }{\boldsymbol {W}}({\boldsymbol {1}}-{\boldsymbol {x}})+({\boldsymbol {1}}-{\boldsymbol {x}})^{\intercal }{\boldsymbol {Wx}}=-2{\boldsymbol {x}}^{\intercal }{\boldsymbol {Wx}}+({\boldsymbol {W1}}+{\boldsymbol {W}}^{\intercal }{\boldsymbol {1}})^{\intercal }{\boldsymbol {x}}} is the sum of weights of all edges between S {\displaystyle S} and T {\displaystyle T} , where 1 = ( 1 , 1 , … , 1 ) ⊺ ∈ R n {\displaystyle {\boldsymbol {1}}=(1,1,\dots ,1)^{\intercal }\in \mathbb {R} ^{n}} . As this is a quadratic function over x {\displaystyle {\boldsymbol {x}}} , it is a QUBO problem whose parameter matrix we can read from above expression as Q = 2 W − diag [ W 1 + W ⊺ 1 ] , {\displaystyle {\boldsymbol {Q}}=2{\boldsymbol {W}}-\operatorname {diag} [{\boldsymbol {W1}}+{\boldsymbol {W}}^{\intercal }{\bol
Phase congruency
Phase congruency is a measure of feature significance in computer images, a method of edge detection that is particularly robust against changes in illumination and contrast. == Foundations == Phase congruency reflects the behaviour of the image in the frequency domain. It has been noted that edgelike features have many of their frequency components in the same phase. The concept is similar to coherence, except that it applies to functions of different wavelength. For example, the Fourier decomposition of a square wave consists of sine functions, whose frequencies are odd multiples of the fundamental frequency. At the rising edges of the square wave, each sinusoidal component has a rising phase; the phases have maximal congruency at the edges. This corresponds to the human-perceived edges in an image where there are sharp changes between light and dark. == Definition == Phase congruency compares the weighted alignment of the Fourier components of a signal A n {\displaystyle A_{\rm {n}}} with the sum of the Fourier components. P C ( t ) = max ϕ ¯ ∑ n A n cos ( ϕ n ( t ) − ϕ ¯ ) ∑ n A n {\displaystyle PC(t)=\max _{\bar {\phi }}{\frac {\sum _{\rm {n}}A_{\rm {n}}\cos(\phi _{\rm {n}}(t)-{\bar {\phi }})}{\sum _{\rm {n}}A_{n}}}} where ϕ n {\displaystyle \phi _{\rm {n}}} is the local or instantaneous phase as can be calculated using the Hilbert transform and A n {\displaystyle A_{\rm {n}}} are the local amplitude, or energy, of the signal. When all the phases are aligned, this is equal to 1. Several ways of implementing phase congruency have been developed, of which two versions are available in open source, one written for MATLAB and the other written in Java as a plugin for the ImageJ software. Given the different notations used for its formulation, a unified version has been recently presented, where a methodology for the parameter tuning is also presented. == Advantages == The square-wave example is naive in that most edge detection methods deal with it equally well. For example, the first derivative has a maximal magnitude at the edges. However, there are cases where the perceived edge does not have a sharp step or a large derivative. The method of phase congruency applies to many cases where other methods fail. A notable example is an image feature consisting of a single line, such as the letter "l". Many edge-detection algorithms will pick up two adjacent edges: the transitions from white to black, and black to white. On the other hand, the phase congruency map has a single line. A simple Fourier analogy of this case is a triangle wave. In each of its crests there is a congruency of crests from different sinusoidal functions. == Disadvantages == Calculating the phase congruency map of an image is very computationally intensive, and sensitive to image noise. Techniques of noise reduction are usually applied prior to the calculation.
Sufficient dimension reduction
In statistics, sufficient dimension reduction (SDR) is a paradigm for analyzing data that combines the ideas of dimension reduction with the concept of sufficiency. Dimension reduction has long been a primary goal of regression analysis. Given a response variable y and a p-dimensional predictor vector x {\displaystyle {\textbf {x}}} , regression analysis aims to study the distribution of y ∣ x {\displaystyle y\mid {\textbf {x}}} , the conditional distribution of y {\displaystyle y} given x {\displaystyle {\textbf {x}}} . A dimension reduction is a function R ( x ) {\displaystyle R({\textbf {x}})} that maps x {\displaystyle {\textbf {x}}} to a subset of R k {\displaystyle \mathbb {R} ^{k}} , k < p, thereby reducing the dimension of x {\displaystyle {\textbf {x}}} . For example, R ( x ) {\displaystyle R({\textbf {x}})} may be one or more linear combinations of x {\displaystyle {\textbf {x}}} . A dimension reduction R ( x ) {\displaystyle R({\textbf {x}})} is said to be sufficient if the distribution of y ∣ R ( x ) {\displaystyle y\mid R({\textbf {x}})} is the same as that of y ∣ x {\displaystyle y\mid {\textbf {x}}} . In other words, no information about the regression is lost in reducing the dimension of x {\displaystyle {\textbf {x}}} if the reduction is sufficient. == Graphical motivation == In a regression setting, it is often useful to summarize the distribution of y ∣ x {\displaystyle y\mid {\textbf {x}}} graphically. For instance, one may consider a scatterplot of y {\displaystyle y} versus one or more of the predictors or a linear combination of the predictors. A scatterplot that contains all available regression information is called a sufficient summary plot. When x {\displaystyle {\textbf {x}}} is high-dimensional, particularly when p ≥ 3 {\displaystyle p\geq 3} , it becomes increasingly challenging to construct and visually interpret sufficiency summary plots without reducing the data. Even three-dimensional scatter plots must be viewed via a computer program, and the third dimension can only be visualized by rotating the coordinate axes. However, if there exists a sufficient dimension reduction R ( x ) {\displaystyle R({\textbf {x}})} with small enough dimension, a sufficient summary plot of y {\displaystyle y} versus R ( x ) {\displaystyle R({\textbf {x}})} may be constructed and visually interpreted with relative ease. Hence sufficient dimension reduction allows for graphical intuition about the distribution of y ∣ x {\displaystyle y\mid {\textbf {x}}} , which might not have otherwise been available for high-dimensional data. Most graphical methodology focuses primarily on dimension reduction involving linear combinations of x {\displaystyle {\textbf {x}}} . The rest of this article deals only with such reductions. == Dimension reduction subspace == Suppose R ( x ) = A T x {\displaystyle R({\textbf {x}})=A^{T}{\textbf {x}}} is a sufficient dimension reduction, where A {\displaystyle A} is a p × k {\displaystyle p\times k} matrix with rank k ≤ p {\displaystyle k\leq p} . Then the regression information for y ∣ x {\displaystyle y\mid {\textbf {x}}} can be inferred by studying the distribution of y ∣ A T x {\displaystyle y\mid A^{T}{\textbf {x}}} , and the plot of y {\displaystyle y} versus A T x {\displaystyle A^{T}{\textbf {x}}} is a sufficient summary plot. Without loss of generality, only the space spanned by the columns of A {\displaystyle A} need be considered. Let η {\displaystyle \eta } be a basis for the column space of A {\displaystyle A} , and let the space spanned by η {\displaystyle \eta } be denoted by S ( η ) {\displaystyle {\mathcal {S}}(\eta )} . It follows from the definition of a sufficient dimension reduction that F y ∣ x = F y ∣ η T x , {\displaystyle F_{y\mid x}=F_{y\mid \eta ^{T}x},} where F {\displaystyle F} denotes the appropriate distribution function. Another way to express this property is y ⊥ ⊥ x ∣ η T x , {\displaystyle y\perp \!\!\!\perp {\textbf {x}}\mid \eta ^{T}{\textbf {x}},} or y {\displaystyle y} is conditionally independent of x {\displaystyle {\textbf {x}}} , given η T x {\displaystyle \eta ^{T}{\textbf {x}}} . Then the subspace S ( η ) {\displaystyle {\mathcal {S}}(\eta )} is defined to be a dimension reduction subspace (DRS). === Structural dimensionality === For a regression y ∣ x {\displaystyle y\mid {\textbf {x}}} , the structural dimension, d {\displaystyle d} , is the smallest number of distinct linear combinations of x {\displaystyle {\textbf {x}}} necessary to preserve the conditional distribution of y ∣ x {\displaystyle y\mid {\textbf {x}}} . In other words, the smallest dimension reduction that is still sufficient maps x {\displaystyle {\textbf {x}}} to a subset of R d {\displaystyle \mathbb {R} ^{d}} . The corresponding DRS will be d-dimensional. === Minimum dimension reduction subspace === A subspace S {\displaystyle {\mathcal {S}}} is said to be a minimum DRS for y ∣ x {\displaystyle y\mid {\textbf {x}}} if it is a DRS and its dimension is less than or equal to that of all other DRSs for y ∣ x {\displaystyle y\mid {\textbf {x}}} . A minimum DRS S {\displaystyle {\mathcal {S}}} is not necessarily unique, but its dimension is equal to the structural dimension d {\displaystyle d} of y ∣ x {\displaystyle y\mid {\textbf {x}}} , by definition. If S {\displaystyle {\mathcal {S}}} has basis η {\displaystyle \eta } and is a minimum DRS, then a plot of y versus η T x {\displaystyle \eta ^{T}{\textbf {x}}} is a minimal sufficient summary plot, and it is (d + 1)-dimensional. == Central subspace == If a subspace S {\displaystyle {\mathcal {S}}} is a DRS for y ∣ x {\displaystyle y\mid {\textbf {x}}} , and if S ⊂ S drs {\displaystyle {\mathcal {S}}\subset {\mathcal {S}}_{\text{drs}}} for all other DRSs S drs {\displaystyle {\mathcal {S}}_{\text{drs}}} , then it is a central dimension reduction subspace, or simply a central subspace, and it is denoted by S y ∣ x {\displaystyle {\mathcal {S}}_{y\mid x}} . In other words, a central subspace for y ∣ x {\displaystyle y\mid {\textbf {x}}} exists if and only if the intersection ⋂ S drs {\textstyle \bigcap {\mathcal {S}}_{\text{drs}}} of all dimension reduction subspaces is also a dimension reduction subspace, and that intersection is the central subspace S y ∣ x {\displaystyle {\mathcal {S}}_{y\mid x}} . The central subspace S y ∣ x {\displaystyle {\mathcal {S}}_{y\mid x}} does not necessarily exist because the intersection ⋂ S drs {\textstyle \bigcap {\mathcal {S}}_{\text{drs}}} is not necessarily a DRS. However, if S y ∣ x {\displaystyle {\mathcal {S}}_{y\mid x}} does exist, then it is also the unique minimum dimension reduction subspace. === Existence of the central subspace === While the existence of the central subspace S y ∣ x {\displaystyle {\mathcal {S}}_{y\mid x}} is not guaranteed in every regression situation, there are some rather broad conditions under which its existence follows directly. For example, consider the following proposition from Cook (1998): Let S 1 {\displaystyle {\mathcal {S}}_{1}} and S 2 {\displaystyle {\mathcal {S}}_{2}} be dimension reduction subspaces for y ∣ x {\displaystyle y\mid {\textbf {x}}} . If x {\displaystyle {\textbf {x}}} has density f ( a ) > 0 {\displaystyle f(a)>0} for all a ∈ Ω x {\displaystyle a\in \Omega _{x}} and f ( a ) = 0 {\displaystyle f(a)=0} everywhere else, where Ω x {\displaystyle \Omega _{x}} is convex, then the intersection S 1 ∩ S 2 {\displaystyle {\mathcal {S}}_{1}\cap {\mathcal {S}}_{2}} is also a dimension reduction subspace. It follows from this proposition that the central subspace S y ∣ x {\displaystyle {\mathcal {S}}_{y\mid x}} exists for such x {\displaystyle {\textbf {x}}} . == Methods for dimension reduction == There are many existing methods for dimension reduction, both graphical and numeric. For example, sliced inverse regression (SIR) and sliced average variance estimation (SAVE) were introduced in the 1990s and continue to be widely used. Although SIR was originally designed to estimate an effective dimension reducing subspace, it is now understood that it estimates only the central subspace, which is generally different. More recent methods for dimension reduction include likelihood-based sufficient dimension reduction, estimating the central subspace based on the inverse third moment (or kth moment), estimating the central solution space, graphical regression, envelope model, and the principal support vector machine. For more details on these and other methods, consult the statistical literature. Principal components analysis (PCA) and similar methods for dimension reduction are not based on the sufficiency principle. === Example: linear regression === Consider the regression model y = α + β T x + ε , where ε ⊥ ⊥ x . {\displaystyle y=\alpha +\beta ^{T}{\textbf {x}}+\varepsilon ,{\text{ where }}\varepsilon \perp \!\!\!\perp {\textbf {x}}.} Note that the distribution of y ∣ x {\displaystyle y\mid {\textbf {x}}} is the same as the distribution of y ∣ β T x {\displ