Unit of work

Unit of work

A unit of work is a behavioral pattern in software development. Martin Fowler has defined it as everything one does during a business transaction which can affect the database. When the unit of work is finished, it will provide everything that needs to be done to change the database as a result of the work. A unit of work encapsulates one or more code repositories[de] and a list of actions to be performed which are necessary for the successful implementation of self-contained and consistent data change. A unit of work is also responsible for handling concurrency issues, and can be used for transactions and stability patterns.[de]

Computer Graphics International

Computer Graphics International (CGI) is one of the oldest annual international conferences on computer graphics. It is organized by the Computer Graphics Society (CGS). Researchers across the whole world are invited to share their experiences and novel achievements in various fields - like computer graphics and human-computer interaction. Former conferences have been held recently in Hong Kong (China), Geneva (Switzerland), Shanghai (China), Geneva (virtually), Calgary (Canada), Bintan (Indonesia) and Yokohama (Japan). == Awards == Starting in the year of 2013, CGI has given yearly a Best Paper Award and a Career Achievement Award. == Venues ==

Semidefinite embedding

Maximum Variance Unfolding (MVU), also known as Semidefinite Embedding (SDE), is an algorithm in computer science that uses semidefinite programming to perform non-linear dimensionality reduction of high-dimensional vectorial input data. It is motivated by the observation that kernel Principal Component Analysis (kPCA) does not reduce the data dimensionality, as it leverages the Kernel trick to non-linearly map the original data into an inner-product space. == Algorithm == MVU creates a mapping from the high dimensional input vectors to some low dimensional Euclidean vector space in the following steps: A neighbourhood graph is created. Each input is connected with its k-nearest input vectors (according to Euclidean distance metric) and all k-nearest neighbors are connected with each other. If the data is sampled well enough, the resulting graph is a discrete approximation of the underlying manifold. The neighbourhood graph is "unfolded" with the help of semidefinite programming. Instead of learning the output vectors directly, the semidefinite programming aims to find an inner product matrix that maximizes the pairwise distances between any two inputs that are not connected in the neighbourhood graph while preserving the nearest neighbors distances. The low-dimensional embedding is finally obtained by application of multidimensional scaling on the learned inner product matrix. The steps of applying semidefinite programming followed by a linear dimensionality reduction step to recover a low-dimensional embedding into a Euclidean space were first proposed by Linial, London, and Rabinovich. == Optimization formulation == Let X {\displaystyle X\,\!} be the original input and Y {\displaystyle Y\,\!} be the embedding. If i , j {\displaystyle i,j\,\!} are two neighbors, then the local isometry constraint that needs to be satisfied is: | X i − X j | 2 = | Y i − Y j | 2 {\displaystyle |X_{i}-X_{j}|^{2}=|Y_{i}-Y_{j}|^{2}\,\!} Let G , K {\displaystyle G,K\,\!} be the Gram matrices of X {\displaystyle X\,\!} and Y {\displaystyle Y\,\!} (i.e.: G i j = X i ⋅ X j , K i j = Y i ⋅ Y j {\displaystyle G_{ij}=X_{i}\cdot X_{j},K_{ij}=Y_{i}\cdot Y_{j}\,\!} ). We can express the above constraint for every neighbor points i , j {\displaystyle i,j\,\!} in term of G , K {\displaystyle G,K\,\!} : G i i + G j j − G i j − G j i = K i i + K j j − K i j − K j i {\displaystyle G_{ii}+G_{jj}-G_{ij}-G_{ji}=K_{ii}+K_{jj}-K_{ij}-K_{ji}\,\!} In addition, we also want to constrain the embedding Y {\displaystyle Y\,\!} to center at the origin: 0 = | ∑ i Y i | 2 ⇔ ( ∑ i Y i ) ⋅ ( ∑ i Y i ) ⇔ ∑ i , j Y i ⋅ Y j ⇔ ∑ i , j K i j {\displaystyle 0=|\sum _{i}Y_{i}|^{2}\Leftrightarrow (\sum _{i}Y_{i})\cdot (\sum _{i}Y_{i})\Leftrightarrow \sum _{i,j}Y_{i}\cdot Y_{j}\Leftrightarrow \sum _{i,j}K_{ij}} As described above, except the distances of neighbor points are preserved, the algorithm aims to maximize the pairwise distance of every pair of points. The objective function to be maximized is: T ( Y ) = 1 2 N ∑ i , j | Y i − Y j | 2 {\displaystyle T(Y)={\dfrac {1}{2N}}\sum _{i,j}|Y_{i}-Y_{j}|^{2}} Intuitively, maximizing the function above is equivalent to pulling the points as far away from each other as possible and therefore "unfold" the manifold. The local isometry constraint Let τ = m a x { η i j | Y i − Y j | 2 } {\displaystyle \tau =max\{\eta _{ij}|Y_{i}-Y_{j}|^{2}\}\,\!} where η i j := { 1 if i is a neighbour of j 0 otherwise . {\displaystyle \eta _{ij}:={\begin{cases}1&{\mbox{if}}\ i{\mbox{ is a neighbour of }}j\\0&{\mbox{otherwise}}.\end{cases}}} prevents the objective function from diverging (going to infinity). Since the graph has N points, the distance between any two points | Y i − Y j | 2 ≤ N τ {\displaystyle |Y_{i}-Y_{j}|^{2}\leq N\tau \,\!} . We can then bound the objective function as follows: T ( Y ) = 1 2 N ∑ i , j | Y i − Y j | 2 ≤ 1 2 N ∑ i , j ( N τ ) 2 = N 3 τ 2 2 {\displaystyle T(Y)={\dfrac {1}{2N}}\sum _{i,j}|Y_{i}-Y_{j}|^{2}\leq {\dfrac {1}{2N}}\sum _{i,j}(N\tau )^{2}={\dfrac {N^{3}\tau ^{2}}{2}}\,\!} The objective function can be rewritten purely in the form of the Gram matrix: T ( Y ) = 1 2 N ∑ i , j | Y i − Y j | 2 = 1 2 N ∑ i , j ( Y i 2 + Y j 2 − Y i ⋅ Y j − Y j ⋅ Y i ) = 1 2 N ( ∑ i , j Y i 2 + ∑ i , j Y j 2 − ∑ i , j Y i ⋅ Y j − ∑ i , j Y j ⋅ Y i ) = 1 2 N ( ∑ i , j Y i 2 + ∑ i , j Y j 2 − 0 − 0 ) = 1 N ( ∑ i Y i 2 ) = 1 N ( T r ( K ) ) {\displaystyle {\begin{aligned}T(Y)&{}={\dfrac {1}{2N}}\sum _{i,j}|Y_{i}-Y_{j}|^{2}\\&{}={\dfrac {1}{2N}}\sum _{i,j}(Y_{i}^{2}+Y_{j}^{2}-Y_{i}\cdot Y_{j}-Y_{j}\cdot Y_{i})\\&{}={\dfrac {1}{2N}}(\sum _{i,j}Y_{i}^{2}+\sum _{i,j}Y_{j}^{2}-\sum _{i,j}Y_{i}\cdot Y_{j}-\sum _{i,j}Y_{j}\cdot Y_{i})\\&{}={\dfrac {1}{2N}}(\sum _{i,j}Y_{i}^{2}+\sum _{i,j}Y_{j}^{2}-0-0)\\&{}={\dfrac {1}{N}}(\sum _{i}Y_{i}^{2})={\dfrac {1}{N}}(Tr(K))\\\end{aligned}}\,\!} Finally, the optimization can be formulated as: Maximize T r ( K ) subject to K ⪰ 0 , ∑ i j K i j = 0 and G i i + G j j − G i j − G j i = K i i + K j j − K i j − K j i , ∀ i , j where η i j = 1 , {\displaystyle {\begin{aligned}&{\text{Maximize}}&&Tr(\mathbf {K} )\\&{\text{subject to}}&&\mathbf {K} \succeq 0,\sum _{ij}\mathbf {K} _{ij}=0\\&{\text{and}}&&G_{ii}+G_{jj}-G_{ij}-G_{ji}=K_{ii}+K_{jj}-K_{ij}-K_{ji},\forall i,j{\mbox{ where }}\eta _{ij}=1,\end{aligned}}} After the Gram matrix K {\displaystyle K\,\!} is learned by semidefinite programming, the output Y {\displaystyle Y\,\!} can be obtained via Cholesky decomposition. In particular, the Gram matrix can be written as K i j = ∑ α = 1 N ( λ α V α i V α j ) {\displaystyle K_{ij}=\sum _{\alpha =1}^{N}(\lambda _{\alpha }V_{\alpha i}V_{\alpha j})\,\!} where V α i {\displaystyle V_{\alpha i}\,\!} is the i-th element of eigenvector V α {\displaystyle V_{\alpha }\,\!} of the eigenvalue λ α {\displaystyle \lambda _{\alpha }\,\!} . It follows that the α {\displaystyle \alpha \,\!} -th element of the output Y i {\displaystyle Y_{i}\,\!} is λ α V α i {\displaystyle {\sqrt {\lambda _{\alpha }}}V_{\alpha i}\,\!} .

Constructing skill trees

Constructing skill trees (CST) is a hierarchical reinforcement learning algorithm which can build skill trees from a set of sample solution trajectories obtained from demonstration. CST uses an incremental MAP (maximum a posteriori) change point detection algorithm to segment each demonstration trajectory into skills and integrate the results into a skill tree. CST was introduced by George Konidaris, Scott Kuindersma, Andrew Barto and Roderic Grupen in 2010. == Algorithm == CST consists of mainly three parts;change point detection, alignment and merging. The main focus of CST is online change-point detection. The change-point detection algorithm is used to segment data into skills and uses the sum of discounted reward R t {\displaystyle R_{t}} as the target regression variable. Each skill is assigned an appropriate abstraction. A particle filter is used to control the computational complexity of CST. The change point detection algorithm is implemented as follows. The data for times t ∈ T {\displaystyle t\in T} and models Q with prior p ( q ∈ Q ) {\displaystyle p(q\in Q)} are given. The algorithm is assumed to be able to fit a segment from time j + 1 {\displaystyle j+1} to t using model q with the fit probability P ( j , t , q ) {\displaystyle P(j,t,q)_{}^{}} . A linear regression model with Gaussian noise is used to compute P ( j , t , q ) {\displaystyle P(j,t,q)} . The Gaussian noise prior has mean zero, and variance which follows I n v e r s e G a m m a ( v 2 , u 2 ) {\displaystyle \mathrm {InverseGamma} \left({\frac {v}{2}},{\frac {u}{2}}\right)} . The prior for each weight follows N o r m a l ( 0 , σ 2 δ ) {\displaystyle \mathrm {Normal} (0,\sigma ^{2}\delta )} . The fit probability P ( j , t , q ) {\displaystyle P(j,t,q)} is computed by the following equation. P ( j , t , q ) = π − n 2 δ m | ( A + D ) − 1 | 1 2 u v 2 ( y + u ) u + v 2 Γ ( n + v 2 ) Γ ( v 2 ) {\displaystyle P(j,t,q)={\frac {\pi ^{-{\frac {n}{2}}}}{\delta ^{m}}}\left|(A+D)^{-1}\right|^{\frac {1}{2}}{\frac {u^{\frac {v}{2}}}{(y+u)^{\frac {u+v}{2}}}}{\frac {\Gamma ({\frac {n+v}{2}})}{\Gamma ({\frac {v}{2}})}}} Then, CST compute the probability of the changepoint at time j with model q, P t ( j , q ) {\displaystyle P_{t}(j,q)} and P j MAP {\displaystyle P_{j}^{\text{MAP}}} using a Viterbi algorithm. P t ( j , q ) = ( 1 − G ( t − j − 1 ) ) P ( j , t , q ) p ( q ) P j MAP {\displaystyle P_{t}(j,q)=(1-G(t-j-1))P(j,t,q)p(q)P_{j}^{\text{MAP}}} P j MAP = max i , q P j ( i , q ) g ( j − i ) 1 − G ( j − i − 1 ) , ∀ j < t {\displaystyle P_{j}^{\text{MAP}}=\max _{i,q}{\frac {P_{j}(i,q)g(j-i)}{1-G(j-i-1)}},\forall j

Persian Speech Corpus

The Persian Speech Corpus is a Modern Persian speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions of about 2.5 hours of Persian speech aligned with recorded speech on the phoneme level, including annotations of word boundaries. Previous spoken corpora of Persian include FARSDAT, which consists of read aloud speech from newspaper texts from 100 Persian speakers and the Telephone FARsi Spoken language DATabase (TFARSDAT) which comprises seven hours of read and spontaneous speech produced by 60 native speakers of Persian from ten regions of Iran. The Persian Speech Corpus was built using the same methodologies laid out in the doctoral project on Modern Standard Arabic of Nawar Halabi at the University of Southampton. The work was funded by MicroLinkPC, who own an exclusive license to commercialise the corpus, though the corpus is available for non-commercial use through the corpus' website. It is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. The corpus was built for speech synthesis purposes, but has been used for building HMM based voices in Persian. It can also be used to automatically align other speech corpora with their phonetic transcript and could be used as part of a larger corpus for training speech recognition systems. == Contents == The corpus is downloadable from its website, and contains the following: 396 .wav files containing spoken utterances 396 .lab files containing text utterances 396 .TextGrid files containing the phoneme labels with time stamps of the boundaries where these occur in the .wav files. phonetic-transcript.txt which has the form "[wav_filename]" "[Phoneme Sequence]" in every line orthographic-transcript.txt which has the form "[wav_filename]" "[Orthographic Transcript]" in every line

Version space learning

Version space learning is a logical approach to machine learning, specifically binary classification. Version space learning algorithms search a predefined space of hypotheses, viewed as a set of logical sentences. Formally, the hypothesis space is a disjunction H 1 ∨ H 2 ∨ . . . ∨ H n {\displaystyle H_{1}\lor H_{2}\lor ...\lor H_{n}} (i.e., one or more of hypotheses 1 through n are true). A version space learning algorithm is presented with examples, which it will use to restrict its hypothesis space; for each example x, the hypotheses that are inconsistent with x are removed from the space. This iterative refining of the hypothesis space is called the candidate elimination algorithm, the hypothesis space maintained inside the algorithm, its version space. == The version space algorithm == In settings where there is a generality-ordering on hypotheses, it is possible to represent the version space by two sets of hypotheses: (1) the most specific consistent hypotheses, and (2) the most general consistent hypotheses, where "consistent" indicates agreement with observed data. The most specific hypotheses (i.e., the specific boundary SB) cover the observed positive training examples, and as little of the remaining feature space as possible. These hypotheses, if reduced any further, exclude a positive training example, and hence become inconsistent. These minimal hypotheses essentially constitute a (pessimistic) claim that the true concept is defined just by the positive data already observed: Thus, if a novel (never-before-seen) data point is observed, it should be assumed to be negative. (I.e., if data has not previously been ruled in, then it's ruled out.) The most general hypotheses (i.e., the general boundary GB) cover the observed positive training examples, but also cover as much of the remaining feature space without including any negative training examples. These, if enlarged any further, include a negative training example, and hence become inconsistent. These maximal hypotheses essentially constitute a (optimistic) claim that the true concept is defined just by the negative data already observed: Thus, if a novel (never-before-seen) data point is observed, it should be assumed to be positive. (I.e., if data has not previously been ruled out, then it's ruled in.) Thus, during learning, the version space (which itself is a set – possibly infinite – containing all consistent hypotheses) can be represented by just its lower and upper bounds (maximally general and maximally specific hypothesis sets), and learning operations can be performed just on these representative sets. After learning, classification can be performed on unseen examples by testing the hypothesis learned by the algorithm. If the example is consistent with multiple hypotheses, a majority vote rule can be applied. == Historical background == The notion of version spaces was introduced by Mitchell in the early 1980s as a framework for understanding the basic problem of supervised learning within the context of solution search. Although the basic "candidate elimination" search method that accompanies the version space framework is not a popular learning algorithm, there are some practical implementations that have been developed (e.g., Sverdlik & Reynolds 1992, Hong & Tsang 1997, Dubois & Quafafou 2002). A major drawback of version space learning is its inability to deal with noise: any pair of inconsistent examples can cause the version space to collapse, i.e., become empty, so that classification becomes impossible. One solution of this problem is proposed by Dubois and Quafafou that proposed the Rough Version Space, where rough sets based approximations are used to learn certain and possible hypothesis in the presence of inconsistent data.

List of text mining software

Text mining computer programs are available from many commercial and open source companies and sources. == Commercial == Angoss – Angoss Text Analytics provides entity and theme extraction, topic categorization, sentiment analysis and document summarization capabilities via the embedded AUTINDEX – is a commercial text mining software package based on sophisticated linguistics by IAI (Institute for Applied Information Sciences), Saarbrücken. DigitalMR – social media listening & text+image analytics tool for market research. FICO Score – leading provider of analytics. General Sentiment – Social Intelligence platform that uses natural language processing to discover affinities between the fans of brands with the fans of traditional television shows in social media. Stand alone text analytics to capture social knowledge base on billions of topics stored to 2004. IBM LanguageWare – the IBM suite for text analytics (tools and Runtime). IBM SPSS – provider of Modeler Premium (previously called IBM SPSS Modeler and IBM SPSS Text Analytics), which contains advanced NLP-based text analysis capabilities (multi-lingual sentiment, event and fact extraction), that can be used in conjunction with Predictive Modeling. Text Analytics for Surveys provides the ability to categorize survey responses using NLP-based capabilities for further analysis or reporting. Inxight – provider of text analytics, search, and unstructured visualization technologies. (Inxight was bought by Business Objects that was bought by SAP AG in 2008). Language Computer Corporation – text extraction and analysis tools, available in multiple languages. Lexalytics – provider of a text analytics engine used in Social Media Monitoring, Voice of Customer, Survey Analysis, and other applications. Salience Engine. The software provides the unique capability of merging the output of unstructured, text-based analysis with structured data to provide additional predictive variables for improved predictive models and association analysis. Linguamatics – provider of natural language processing (NLP) based enterprise text mining and text analytics software, I2E, for high-value knowledge discovery and decision support. Mathematica – provides built in tools for text alignment, pattern matching, clustering and semantic analysis. See Wolfram Language, the programming language of Mathematica. MATLAB offers Text Analytics Toolbox for importing text data, converting it to numeric form for use in machine and deep learning, sentiment analysis and classification tasks. Medallia – offers one system of record for survey, social, text, written and online feedback. NetMiner – software for network analysis and text mining. Supports social media and bibliographic data collection, NLP for english and chinese, sentiment analysis, work co-occurrence network(text network analysis) and visualization. NetOwl – suite of multilingual text and entity analytics products, including entity extraction, link and event extraction, sentiment analysis, geotagging, name translation, name matching, and identity resolution, among others. PolyAnalyst - text analytics environment. PoolParty Semantic Suite - graph-based text mining platform. RapidMiner with its Text Processing Extension – data and text mining software. SAS – SAS Text Miner and Teragram; commercial text analytics, natural language processing, and taxonomy software used for Information Management. Sketch Engine – a corpus manager and analysis software which providing creating text corpora from uploaded texts or the Web including part-of-speech tagging and lemmatization or detecting a particular website. Sysomos – provider social media analytics software platform, including text analytics and sentiment analysis on online consumer conversations. WordStat – Content analysis and text mining add-on module of QDA Miner for analyzing large amounts of text data. == Open source == Carrot2 – text and search results clustering framework. GATE – general Architecture for Text Engineering, an open-source toolbox for natural language processing and language engineering. Gensim – large-scale topic modelling and extraction of semantic information from unstructured text (Python). KH Coder – for Quantitative Content Analysis or Text Mining The KNIME Text Processing extension. Natural Language Toolkit (NLTK) – a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language. OpenNLP – natural language processing. Orange with its text mining add-on. The PLOS Text Mining Collection. The programming language R provides a framework for text mining applications in the package tm. The Natural Language Processing task view contains tm and other text mining library packages. spaCy – open-source Natural Language Processing library for Python Stanbol – an open source text mining engine targeted at semantic content management. Voyant Tools – a web-based text analysis environment, created as a scholarly project.