Ordination (statistics)

Ordination (statistics)

Ordination or gradient analysis, in multivariate analysis, is a method complementary to data clustering, and used mainly in exploratory data analysis (rather than in hypothesis testing). In contrast to cluster analysis, ordination orders quantities in a (usually lower-dimensional) latent space. In the ordination space, quantities that are near each other share attributes (i.e., are similar to some degree), and dissimilar objects are farther from each other. Such relationships between the objects, on each of several axes or latent variables, are then characterized numerically and/or graphically in a biplot. The first ordination method, principal components analysis, was suggested by Karl Pearson in 1901. == Methods == Ordination methods can broadly be categorized in eigenvector-, algorithm-, or model-based methods. Many classical ordination techniques, including principal components analysis, correspondence analysis (CA) and its derivatives (detrended correspondence analysis, canonical correspondence analysis, and redundancy analysis, belong to the first group). The second group includes some distance-based methods such as non-metric multidimensional scaling, and machine learning methods such as T-distributed stochastic neighbor embedding and nonlinear dimensionality reduction. The third group includes model-based ordination methods, which can be considered as multivariate extensions of Generalized Linear Models. Model-based ordination methods are more flexible in their application than classical ordination methods, so that it is for example possible to include random-effects. Unlike in the aforementioned two groups, there is no (implicit or explicit) distance measure in the ordination. Instead, a distribution needs to be specified for the responses as is typical for statistical models. These and other assumptions, such as the assumed mean-variance relationship, can be validated with the use of residual diagnostics, unlike in other ordination methods. == Applications == Ordination can be used on the analysis of any set of multivariate objects. It is frequently used in several environmental or ecological sciences, particularly plant community ecology. It is also used in genetics and systems biology for microarray data analysis and in psychometrics.

Language engineering

Language engineering involves the creation of natural language processing systems, whose cost and outputs are measurable and predictable. It is a distinct field contrasted to natural language processing and computational linguistics. A recent trend of language engineering is the use of Semantic Web technologies for the creation, archiving, processing, and retrieval of machine processable language data. Meta-Language Engineering is a proposed extension of Language Engineering first recorded in 2025, associated with the work of Delyone de Paula Canedo Filho. The term is used to designate an approach that, in addition to natural language processing, encompasses the symbolic, cognitive, and epistemological structuring of language systems.

T-distributed stochastic neighbor embedding

t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Geoffrey Hinton and Sam Roweis, where Laurens van der Maaten and Hinton proposed the t-distributed variant. It is a nonlinear dimensionality reduction technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. The t-SNE algorithm comprises two main stages. First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence (KL divergence) between the two distributions with respect to the locations of the points in the map. While the original algorithm uses the Euclidean distance between objects as the base of its similarity metric, this can be changed as appropriate. A Riemannian variant is UMAP. t-SNE has been used for visualization in a wide range of applications, including genomics, computer security research, natural language processing, music analysis, cancer research, bioinformatics, geological domain interpretation, and biomedical signal processing. For a data set with n {\displaystyle n} elements, t-SNE runs in O ( n 2 ) {\displaystyle O(n^{2})} time and requires O ( n 2 ) {\displaystyle O(n^{2})} space. == Details == Given a set of N {\displaystyle N} high-dimensional objects x 1 , … , x N {\displaystyle \mathbf {x} _{1},\dots ,\mathbf {x} _{N}} , t-SNE first computes probabilities p i j {\displaystyle p_{ij}} that are proportional to the similarity of objects x i {\displaystyle \mathbf {x} _{i}} and x j {\displaystyle \mathbf {x} _{j}} , as follows. For i ≠ j {\displaystyle i\neq j} , define p j ∣ i = exp ⁡ ( − ‖ x i − x j ‖ 2 / 2 σ i 2 ) ∑ k ≠ i exp ⁡ ( − ‖ x i − x k ‖ 2 / 2 σ i 2 ) {\displaystyle p_{j\mid i}={\frac {\exp(-\lVert \mathbf {x} _{i}-\mathbf {x} _{j}\rVert ^{2}/2\sigma _{i}^{2})}{\sum _{k\neq i}\exp(-\lVert \mathbf {x} _{i}-\mathbf {x} _{k}\rVert ^{2}/2\sigma _{i}^{2})}}} and set p i ∣ i = 0 {\displaystyle p_{i\mid i}=0} . Note the above denominator ensures ∑ j p j ∣ i = 1 {\displaystyle \sum _{j}p_{j\mid i}=1} for all i {\displaystyle i} . As van der Maaten and Hinton explained: "The similarity of datapoint x j {\displaystyle x_{j}} to datapoint x i {\displaystyle x_{i}} is the conditional probability, p j | i {\displaystyle p_{j|i}} , that x i {\displaystyle x_{i}} would pick x j {\displaystyle x_{j}} as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at x i {\displaystyle x_{i}} ." Now define p i j = p j ∣ i + p i ∣ j 2 N {\displaystyle p_{ij}={\frac {p_{j\mid i}+p_{i\mid j}}{2N}}} This is motivated because p i {\displaystyle p_{i}} and p j {\displaystyle p_{j}} from the N samples are estimated as 1/N, so the conditional probability can be written as p i ∣ j = N p i j {\displaystyle p_{i\mid j}=Np_{ij}} and p j ∣ i = N p j i {\displaystyle p_{j\mid i}=Np_{ji}} . Since p i j = p j i {\displaystyle p_{ij}=p_{ji}} , you can obtain previous formula. Also note that p i i = 0 {\displaystyle p_{ii}=0} and ∑ i , j p i j = 1 {\displaystyle \sum _{i,j}p_{ij}=1} . The bandwidth of the Gaussian kernels σ i {\displaystyle \sigma _{i}} is set in such a way that the entropy of the conditional distribution equals a predefined entropy using the bisection method. As a result, the bandwidth is adapted to the density of the data: smaller values of σ i {\displaystyle \sigma _{i}} are used in denser parts of the data space. The entropy increases with the perplexity of this distribution P i {\displaystyle P_{i}} ; this relation is seen as P e r p ( P i ) = 2 H ( P i ) {\displaystyle Perp(P_{i})=2^{H(P_{i})}} where H ( P i ) {\displaystyle H(P_{i})} is the Shannon entropy H ( P i ) = − ∑ j p j | i log 2 ⁡ p j | i . {\displaystyle H(P_{i})=-\sum _{j}p_{j|i}\log _{2}p_{j|i}.} The perplexity is a hand-chosen parameter of t-SNE, and as the authors state, "perplexity can be interpreted as a smooth measure of the effective number of neighbors. The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.". Since the Gaussian kernel uses the Euclidean distance ‖ x i − x j ‖ {\displaystyle \lVert x_{i}-x_{j}\rVert } , it is affected by the curse of dimensionality, and in high dimensional data when distances lose the ability to discriminate, the p i j {\displaystyle p_{ij}} become too similar (asymptotically, they would converge to a constant). It has been proposed to adjust the distances with a power transform, based on the intrinsic dimension of each point, to alleviate this. t-SNE aims to learn a d {\displaystyle d} -dimensional map y 1 , … , y N {\displaystyle \mathbf {y} _{1},\dots ,\mathbf {y} _{N}} (with y i ∈ R d {\displaystyle \mathbf {y} _{i}\in \mathbb {R} ^{d}} and d {\displaystyle d} typically chosen as 2 or 3) that reflects the similarities p i j {\displaystyle p_{ij}} as well as possible. To this end, it measures similarities q i j {\displaystyle q_{ij}} between two points in the map y i {\displaystyle \mathbf {y} _{i}} and y j {\displaystyle \mathbf {y} _{j}} , using a very similar approach. Specifically, for i ≠ j {\displaystyle i\neq j} , define q i j {\displaystyle q_{ij}} as q i j = ( 1 + ‖ y i − y j ‖ 2 ) − 1 ∑ k ∑ l ≠ k ( 1 + ‖ y k − y l ‖ 2 ) − 1 {\displaystyle q_{ij}={\frac {(1+\lVert \mathbf {y} _{i}-\mathbf {y} _{j}\rVert ^{2})^{-1}}{\sum _{k}\sum _{l\neq k}(1+\lVert \mathbf {y} _{k}-\mathbf {y} _{l}\rVert ^{2})^{-1}}}} and set q i i = 0 {\displaystyle q_{ii}=0} . Herein a heavy-tailed Student t-distribution (with one-degree of freedom, which is the same as a Cauchy distribution) is used to measure similarities between low-dimensional points in order to allow dissimilar objects to be modeled far apart in the map. The locations of the points y i {\displaystyle \mathbf {y} _{i}} in the map are determined by minimizing the (non-symmetric) Kullback–Leibler divergence of the distribution P {\displaystyle P} from the distribution Q {\displaystyle Q} , that is: K L ( P ∥ Q ) = ∑ i ≠ j p i j log ⁡ p i j q i j {\displaystyle \mathrm {KL} \left(P\parallel Q\right)=\sum _{i\neq j}p_{ij}\log {\frac {p_{ij}}{q_{ij}}}} The minimization of the Kullback–Leibler divergence with respect to the points y i {\displaystyle \mathbf {y} _{i}} is performed using gradient descent. The result of this optimization is a map that reflects the similarities between the high-dimensional inputs. == Output == While t-SNE plots often seem to display clusters, the visual clusters can be strongly influenced by the chosen parameterization (especially the perplexity) and so a good understanding of the parameters for t-SNE is needed. Such "clusters" can be shown to even appear in structured data with no clear clustering, and so may be false findings. Similarly, the size of clusters produced by t-SNE is not informative, and neither is the distance between clusters. Thus, interactive exploration may be needed to choose parameters and validate results. It has been shown that t-SNE can often recover well-separated clusters, and with special parameter choices, approximates a simple form of spectral clustering. == Software == A C++ implementation of Barnes-Hut is available on the github account of one of the original authors. The R package Rtsne implements t-SNE in R. ELKI contains tSNE, also with Barnes-Hut approximation scikit-learn, a popular machine learning library in Python implements t-SNE with both exact solutions and the Barnes-Hut approximation. Tensorboard, the visualization kit associated with TensorFlow, also implements t-SNE (online version) The Julia package TSne implements t-SNE

Constructing skill trees

Constructing skill trees (CST) is a hierarchical reinforcement learning algorithm which can build skill trees from a set of sample solution trajectories obtained from demonstration. CST uses an incremental MAP (maximum a posteriori) change point detection algorithm to segment each demonstration trajectory into skills and integrate the results into a skill tree. CST was introduced by George Konidaris, Scott Kuindersma, Andrew Barto and Roderic Grupen in 2010. == Algorithm == CST consists of mainly three parts;change point detection, alignment and merging. The main focus of CST is online change-point detection. The change-point detection algorithm is used to segment data into skills and uses the sum of discounted reward R t {\displaystyle R_{t}} as the target regression variable. Each skill is assigned an appropriate abstraction. A particle filter is used to control the computational complexity of CST. The change point detection algorithm is implemented as follows. The data for times t ∈ T {\displaystyle t\in T} and models Q with prior p ( q ∈ Q ) {\displaystyle p(q\in Q)} are given. The algorithm is assumed to be able to fit a segment from time j + 1 {\displaystyle j+1} to t using model q with the fit probability P ( j , t , q ) {\displaystyle P(j,t,q)_{}^{}} . A linear regression model with Gaussian noise is used to compute P ( j , t , q ) {\displaystyle P(j,t,q)} . The Gaussian noise prior has mean zero, and variance which follows I n v e r s e G a m m a ( v 2 , u 2 ) {\displaystyle \mathrm {InverseGamma} \left({\frac {v}{2}},{\frac {u}{2}}\right)} . The prior for each weight follows N o r m a l ( 0 , σ 2 δ ) {\displaystyle \mathrm {Normal} (0,\sigma ^{2}\delta )} . The fit probability P ( j , t , q ) {\displaystyle P(j,t,q)} is computed by the following equation. P ( j , t , q ) = π − n 2 δ m | ( A + D ) − 1 | 1 2 u v 2 ( y + u ) u + v 2 Γ ( n + v 2 ) Γ ( v 2 ) {\displaystyle P(j,t,q)={\frac {\pi ^{-{\frac {n}{2}}}}{\delta ^{m}}}\left|(A+D)^{-1}\right|^{\frac {1}{2}}{\frac {u^{\frac {v}{2}}}{(y+u)^{\frac {u+v}{2}}}}{\frac {\Gamma ({\frac {n+v}{2}})}{\Gamma ({\frac {v}{2}})}}} Then, CST compute the probability of the changepoint at time j with model q, P t ( j , q ) {\displaystyle P_{t}(j,q)} and P j MAP {\displaystyle P_{j}^{\text{MAP}}} using a Viterbi algorithm. P t ( j , q ) = ( 1 − G ( t − j − 1 ) ) P ( j , t , q ) p ( q ) P j MAP {\displaystyle P_{t}(j,q)=(1-G(t-j-1))P(j,t,q)p(q)P_{j}^{\text{MAP}}} P j MAP = max i , q P j ( i , q ) g ( j − i ) 1 − G ( j − i − 1 ) , ∀ j < t {\displaystyle P_{j}^{\text{MAP}}=\max _{i,q}{\frac {P_{j}(i,q)g(j-i)}{1-G(j-i-1)}},\forall j

Multi expression programming

Multi Expression Programming (MEP) is an evolutionary algorithm for generating mathematical functions describing a given set of data. MEP is a Genetic Programming variant encoding multiple solutions in the same chromosome. MEP representation is not specific (multiple representations have been tested). In the simplest variant, MEP chromosomes are linear strings of instructions. This representation was inspired by Three-address code. MEP strength consists in the ability to encode multiple solutions, of a problem, in the same chromosome. In this way, one can explore larger zones of the search space. For most of the problems this advantage comes with no running-time penalty compared with genetic programming variants encoding a single solution in a chromosome. == Representation == MEP chromosomes are arrays of instructions represented in Three-address code format. Each instruction contains a variable, a constant, or a function. If the instruction is a function, then the arguments (given as instruction's addresses) are also present. === Example of MEP program === Here is a simple MEP chromosome (labels on the left side are not a part of the chromosome): 1: a 2: b 3: + 1, 2 4: c 5: d 6: + 4, 5 7: 3, 5 == Fitness computation == When the chromosome is evaluated it is unclear which instruction will provide the output of the program. In many cases, a set of programs is obtained, some of them being completely unrelated (they do not have common instructions). For the above chromosome, here is the list of possible programs obtained during decoding: E1 = a, E2 = b, E4 = c, E5 = d, E3 = a + b. E6 = c + d. E7 = (a + b) d. Each instruction is evaluated as a possible output of the program. The fitness (or error) is computed in a standard manner. For instance, in the case of symbolic regression, the fitness is the sum of differences (in absolute value) between the expected output (called target) and the actual output. == Fitness assignment process == Which expression will represent the chromosome? Which one will give the fitness of the chromosome? In MEP, the best of them (which has the lowest error) will represent the chromosome. This is different from other GP techniques: In Linear genetic programming the last instruction will give the output. In Cartesian Genetic Programming the gene providing the output is evolved like all other genes. Note that, for many problems, this evaluation has the same complexity as in the case of encoding a single solution in each chromosome. Thus, there is no penalty in running time compared to other techniques. == Software == === MEPX === MEPX is a cross-platform (Windows, macOS, and Linux Ubuntu) free software for the automatic generation of computer programs. It can be used for data analysis, particularly for solving symbolic regression, statistical classification and time-series problems. === libmep === Libmep is a free and open source library implementing Multi Expression Programming technique. It is written in C++. === hmep === hmep is a new open source library implementing Multi Expression Programming technique in Haskell programming language.

VOCEDplus

VOCEDplus is a free international research database about tertiary education, maintained and developed by staff at the c (NCVER) in Adelaide, South Australia. The focus of the database content is the relation of post-compulsory education and training to workforce needs, skills development, and social inclusion. == Structure == The content of the VOCEDplus database encompasses vocational education and training (VET), higher education, lifelong learning, informal learning, VET in schools, adult and community education, apprenticeships/traineeships, international education, providers of education and training, and workforce development. It is international in scope and contains over 84,000 English language records, many with links to full text documents. VOCEDplus contains extensive Australian materials and includes a wide range of international information, covering outcomes of tertiary education in the shape of published research, practice, policy, and statistics. Entries are included for the following types of publications: reports; annual reports; papers; discussion papers; occasional papers; working papers; books; book chapters; conference papers; conference proceedings; journals; journal articles; policy documents; published statistics; theses; podcasts; and teaching and training materials. Each database entry contains standard bibliographic information and an abstract. Many entries include full text access via the publisher's website or a digitised copy. == History == === 1989-1997 === In the early years VOCEDplus was known as VOCED. The original database was produced by a network of clearinghouses across Australia with the aim of sharing activities in the technical and further education (TAFE) sector. VOCED was produced in hardcopy and an electronic version was distributed on diskette. === 1997-2001 === 1997 - the first web version of VOCED was made available from the National Centre for Vocational Education Research (NCVER) organisational website 1998 - a major project to upgrade the database and expand its international coverage commenced 2001 - creation of VOCED's own website 2001 - VOCED endorsed as the UNESCO international database for technical and vocational education and training (TVET) research information === 2001-2009 === Many changes to the database and website occurred during this period with a focus on continuous improvement to meet the needs of users and utilise emerging technologies. 2006 - materials produced for two adult literacy and learning programs funded by the Australian Department of Education, Employment and Workplace Relations (DEEWR) - the Workplace English Language and Learning (WELL) Programme and the Adult Literacy National Project (ALNP) included in VOCED 2007 - the Australian clearinghouse network transferred most of the hardcopy collections to NCVER, to form a centralised repository of resources 2009 - materials produced by Reframing the Future (RTF) a vocational education and training workforce development initiative of the Australian, State and Territory Governments included in VOCED === 2009-2014 === A major rebuild of the database and website was undertaken during this period to take advantage of the potential of new technologies to provide improved services and incorporate Web 2.0 technologies (RSS feeds, and share and bookmark tools). 2009 - scope expanded to more fully encompass the higher education sector 2011 - launch of VOCEDplus with the name change representing the enhanced features and extended focus 2012 - a major retrospective digitisation project commenced and by the end of the 2012-2013 financial year a total of 9,328 publications (593,534 pages/microfiche frames) had been digitised, ensuring these publications are available electronically for free === 2014-2019 === A number of significant curated content products were released during this period. 2015 - release of a refreshed look to adopt the new NCVER branding plus a number of search enhancements (Guided search, Expert search, and Glossary search) were added 2015 - first in the series of 'Focus on...' pages released 2016 - launch of the 'Pod Network', a convenient and efficient platform that allows instant access to research and a multitude of resources on a range of subjects 2017 - completion of the 'Pod Network', consisting of 20 Pods (on broad subjects including Apprenticeships and traineeships, Foundation skills, Teaching and learning, Career development, and Students) and 74 Podlets (on narrow topics including Online learning, Social media, VET in schools, STEM skills, and Adult literacy) 2018 - launch of the 'Timeline of Australian VET Policy Initiatives' and the 'VET Knowledge Bank' which contains a suite of products capturing Australia's diverse, complex and ever-changing VET system 2019 - after an internal review, a refreshed, streamlined version of the 'Pod Network' was released, consisting of 13 Pods and 20 Podlets 2019 - launch of the 'VET Practitioner Resource' which contains a range of information to support VET practitioners in their work and is organised into three sections: (1) Teaching, training and assessment: standards, guidance, research and good practice resources to inform daily work; (2) Practitioners as researchers: information for undertaking practitioner-led research; and (3) The VET workforce: information about VET teachers and trainers, and the professional development needs of the VET workforce 2019 - VOCEDplus celebrated 30 years of providing information to the tertiary education sector and the homepage was refreshed to make it more modern and easier to use === 2020- === VOCEDplus continued to be accessible throughout the COVID-19 pandemic. 2020-2021 - the VET Knowledge Bank added a dedicated page, 'COVID-19 announcements', that showcases the measures introduced by the Australian, state and territory governments to mitigate the impact of the pandemic and promote economic recovery 2020-2024 - published research about the effects of the pandemic on education and training, providers, students, labour markets, employment and employees was collected and made permanently available in the database 2024 - VOCEDplus celebrated 35 years of providing information to the tertiary education sector. The homepage was refreshed and a number of enhancements and new features were implemented including a new My Profile feature, improvements to My Selection, accessible search history and saved searches, enhanced search functionality, and improved navigation.

Synaptic weight

In neuroscience and computer science, synaptic weight refers to the strength or amplitude of a connection between two nodes, corresponding in biology to the amount of influence the firing of one neuron has on another. The term is typically used in artificial and biological neural network research. == Computation == In a computational neural network, a vector or set of inputs x {\displaystyle {\textbf {x}}} and outputs y {\displaystyle {\textbf {y}}} , or pre- and post-synaptic neurons respectively, are interconnected with synaptic weights represented by the matrix w {\displaystyle w} , where for a linear neuron y j = ∑ i w i j x i or y = w x {\displaystyle y_{j}=\sum _{i}w_{ij}x_{i}~~{\textrm {or}}~~{\textbf {y}}=w{\textbf {x}}} . where the rows of the synaptic matrix represent the vector of synaptic weights for the output indexed by j {\displaystyle j} . The synaptic weight is changed by using a learning rule, the most basic of which is Hebb's rule, which is usually stated in biological terms as Neurons that fire together, wire together. Computationally, this means that if a large signal from one of the input neurons results in a large signal from one of the output neurons, then the synaptic weight between those two neurons will increase. The rule is unstable, however, and is typically modified using such variations as Oja's rule, radial basis functions or the backpropagation algorithm. == Biology == For biological networks, the effect of synaptic weights is not as simple as for linear neurons or Hebbian learning. However, biophysical models such as BCM theory have seen some success in mathematically describing these networks. In the mammalian central nervous system, signal transmission is carried out by interconnected networks of nerve cells, or neurons. For the basic pyramidal neuron, the input signal is carried by the axon, which releases neurotransmitter chemicals into the synapse which is picked up by the dendrites of the next neuron, which can then generate an action potential which is analogous to the output signal in the computational case. The synaptic weight in this process is determined by several variable factors: How well the input signal propagates through the axon (see myelination), The amount of neurotransmitter released into the synapse and the amount that can be absorbed in the following cell (determined by the number of AMPA and NMDA receptors on the cell membrane and the amount of intracellular calcium and other ions), The number of such connections made by the axon to the dendrites, How well the signal propagates and integrates in the postsynaptic cell. The changes in synaptic weight that occur is known as synaptic plasticity, and the process behind long-term changes (long-term potentiation and depression) is still poorly understood. Hebb's original learning rule was originally applied to biological systems, but has had to undergo many modifications as a number of theoretical and experimental problems came to light.