AI Image Generation Tools

AI Image Generation Tools — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Braina

    Braina

    Braina is a virtual assistant and speech-to-text dictation application for Microsoft Windows developed by Brainasoft. Braina uses natural language interface, speech synthesis, and speech recognition technology to interact with its users and allows them to use natural language sentences to perform various tasks on a computer. The name Braina is a short form of "Brain Artificial". Braina is marketed as a Microsoft Copilot alternative. It provides a voice interface for several locally run and cloud large language models, including the latest LLMs from providers such as OpenAI, Anthropic, Google, xAI, Meta, Mistral, etc; while improving data privacy. Braina also allows responses from its in-house large language models like Braina Swift and Braina Pinnacle. It has an "Artificial Brain" feature that provides persistent memory support for supported LLMs. == Features == Braina provides is able to carry out various tasks on a computer, including automation. Braina can take commands inputted through typing or through dictation to store reminders, find information online, perform mathematical operations, open files, generate images from text, transcribe speech, and control open windows or programs. Braina adapts to user behavior over time with a goal of better anticipating needs. === Speech-to-text dictation === Braina Pro can type spoken words into an active window at the location of a user's cursor. Its speech recognition technology supports more than 100 languages and dialects and is able to isolate the recognition of a user's voice from disturbing environmental factors such as background noise, other human voices, or external devices. Braina can also be taught to dictate uncommon legal, medical, and scientific terms. Users can also teach Braina uncommon names and vocabulary. Users can edit or correct dictated text without using a keyboard or mouse by giving built-in voice commands. === Text-to-speech === Braina can read aloud selected texts, such as e-books. === Custom commands and automation === Braina can automate computer tasks. It lets users create custom voice commands to perform tasks such as opening files, programs, websites, or emails, as well as executing keyboard or mouse macros. === Transcription === Braina can transcribe media file formats such as WAV, MP3, and MP4 into text. === Notes and reminders === Braina can store and recall notes and reminders. These can include scheduled or unscheduled commands, checklist items, alarms, chat conversations, memos, website snippets, bookmarks, contacts. === Image and Video generation === Braina can generate AI images and videos from text and image inputs using generative cloud AI models. These include Black Forest Labs' FLUX.2, Google's Veo, Imagen, and Nano Banana Pro, Kuaishou's Kling, Alibaba's Wan, ByteDance's Seedance and Seedream, MiniMax's Hailuo, OpenAI's GPT Image, and Tongyi Lab's Z Image Turbo. == Platforms == In addition to the desktop version for Windows operating systems, Braina is also available for the iOS and Android operating systems. The mobile version of Braina has a feature allowing remote management of a Windows PC connected via Wi-Fi. == Distributions == Braina is distributed in multiple modes. These include Braina Lite, a freeware version with limitations, and premium versions Braina Pro, Pro Plus, and Pro Ultra. Some additional features in the Pro version include dictation, custom vocabulary, video transcription, automation, custom voice commands, and persistent LLM memory. == Reception == TechRadar has consistently listed Braina as one of the best dictation and virtual assistant apps between 2015 and 2024.

    Read more →
  • Scale-invariant feature operator

    Scale-invariant feature operator

    In the fields of computer vision and image analysis, the scale-invariant feature operator (or SFOP) is an algorithm to detect local features in images. The algorithm was published by Förstner et al. in 2009. == Algorithm == The scale-invariant feature operator (SFOP) is based on two theoretical concepts: spiral model feature operator Desired properties of keypoint detectors: Invariance and repeatability for object recognition Accuracy to support camera calibration Interpretability: Especially corners and circles, should be part of the detected keypoints (see figure). As few control parameters as possible with clear semantics Complementarity to known detectors scale-invariant corner/circle detector. == Theory == === Maximize the weight === Maximize the weight w {\displaystyle w} = 1/variance of a point p {\displaystyle p} w ( p , α , τ , σ ) = ( N ( σ ) − 2 ) λ m i n ( M ( p , α , τ , σ ) ) Ω ( p , α , τ , σ ) {\displaystyle w(\mathbf {p} ,\alpha ,\tau ,\sigma )=\left(N(\sigma )-2\right){\frac {\lambda _{min}(M(\mathbf {p} ,\alpha ,\tau ,\sigma ))}{\Omega (\mathbf {p} ,\alpha ,\tau ,\sigma )}}} comprising: 1. the image model Ω ( p , α , τ , σ ) = ∑ n = 1 N ( σ ) [ ( q n − p ) T R α ∇ T g ( q n ) ] 2 G σ ( q n − p ) = N ( σ ) t r { R α ∇ τ ∇ τ T R α T ∗ p p T G σ ( p ) } {\displaystyle {\begin{aligned}\Omega (\mathbf {p} ,\alpha ,\tau ,\sigma )&=\sum _{n=1}^{N(\sigma )}[(\mathbf {q} _{n}-\mathbf {p} )^{T}\mathbf {R} _{\alpha }\mathbf {\nabla } _{T}g(\mathbf {q} _{n})]^{2}G_{\sigma }(\mathbf {q} _{n}-\mathbf {p} )\\&=N(\sigma )\mathbf {tr} \left\{R_{\alpha }\mathbf {\nabla } _{\tau }\mathbf {\nabla } _{\tau }^{T}R_{\alpha }^{T}\mathbf {p} \mathbf {p} ^{T}G_{\sigma }(\mathbf {p} )\right\}\end{aligned}}} 2. the smaller eigenvalue of the structure tensor M ( p , α , τ , σ ) ⏟ structure tensor = G σ ( p ) ⏟ weighted summation ∗ ( R σ ∇ τ ∇ τ T R σ T ) ⏟ squared rotated gradients {\displaystyle \underbrace {M(\mathbf {p} ,\alpha ,\tau ,\sigma )} _{\text{structure tensor}}=\underbrace {G_{\sigma }(\mathbf {p} )} _{\text{weighted summation}}\underbrace {(R_{\sigma }\nabla _{\tau }\nabla _{\tau }^{T}R_{\sigma }^{T})} _{\text{squared rotated gradients}}} === Reduce the search space === Reduce the 5-dimensional search space by linking the differentiation scale τ {\displaystyle \tau } to the integration scale τ = σ / 3 {\displaystyle \tau =\sigma /3} solving for the optimal α ^ {\displaystyle {\hat {\alpha }}} using the model Ω ( α ) = a − b cos ⁡ ( 2 α − 2 α 0 ) {\displaystyle \Omega (\alpha )=a-b\cos(2\alpha -2\alpha _{0})} and determining the parameters from three angles, e. g. Ω ( 0 ∘ ) , Ω ( 60 ∘ ) , Ω ( 120 ∘ ) → a , b , α 0 → α ^ {\displaystyle \Omega (0^{\circ }),\Omega (60^{\circ }),\Omega (120^{\circ })\quad \rightarrow \quad a,b,\alpha _{0}\quad \rightarrow \quad {\hat {\alpha }}} pre-selection possible: α = 0 ∘ → junctions , α = 90 ∘ → circular features {\displaystyle \alpha =0^{\circ }\,\rightarrow \,{\mbox{junctions}},\quad \alpha =90^{\circ }\,\rightarrow \,{\mbox{circular features}}} === Filter potential keypoints === non-maxima suppression over scale, space and angle thresholding the isotropy λ 2 ( M ) {\displaystyle \lambda _{2(M)}} :eigenvalues characterize the shape of the keypoint, smallest eigenvalue has to be larger than threshold T λ {\displaystyle T_{\lambda }} derived from noise variance V ( n ) {\displaystyle V(n)} and significance level S {\displaystyle S} : T λ ( V ( n ) , τ , σ , S ) = N ( σ ) 16 π τ 4 V ( n ) χ 2 , S 2 {\displaystyle T_{\lambda }(V(n),\tau ,\sigma ,S)={\frac {N(\sigma )}{16\pi \tau ^{4}}}V(n)\chi _{2,S}^{2}} == Algorithm == == Results == === Interpretability of SFOP keypoints ===

    Read more →
  • Abess

    Abess

    abess (Adaptive Best Subset Selection, also ABESS) is a machine learning method designed to address the problem of best subset selection. It aims to determine which features or variables are crucial for optimal model performance when provided with a dataset and a prediction task. abess was introduced by Zhu in 2020 and it dynamically selects the appropriate model size adaptively, eliminating the need for selecting regularization parameters. abess is applicable in various statistical and machine learning tasks, including linear regression, the Single-index model, and other common predictive models. abess can also be applied in biostatistics. == Basic Form == The basic form of abess is employed to address the optimal subset selection problem in general linear regression. abess is an l 0 {\displaystyle l_{0}} method, it is characterized by its polynomial time complexity and the property of providing both unbiased and consistent estimates. In the context of linear regression, assuming we have knowledge of n {\displaystyle n} independent samples ( x i , y i ) , i = 1 , … , n {\displaystyle (x_{i},y_{i}),i=1,\ldots ,n} , where x i ∈ R p × 1 {\displaystyle x_{i}\in \mathbb {R} ^{p\times 1}} and y i ∈ R {\displaystyle y_{i}\in \mathbb {R} } , we define X = ( x 1 , … , x n ) ⊤ {\displaystyle X=(x_{1},\ldots ,x_{n})^{\top }} and y = ( y 1 , … , y n ) ⊤ {\displaystyle y=(y_{1},\ldots ,y_{n})^{\top }} . The following equation represents the general linear regression model: y = X β + ε . {\displaystyle y=X\beta +\varepsilon .} To obtain appropriate parameters β {\displaystyle \beta } , one can consider the loss function for linear regression: L n LR ( β ; X , y ) = 1 2 n ‖ y − X β ‖ 2 2 . {\displaystyle {\mathcal {L}}_{n}^{\text{LR}}(\beta ;X,y)={\frac {1}{2n}}\|y-X\beta \|_{2}^{2}.} In abess, the initial focus is on optimizing the loss function under the l 0 {\displaystyle l_{0}} constraint. That is, we consider the following problem: min β ∈ R p × 1 L n LR ( β ; X , y ) , subject to ‖ β ‖ 0 ≤ s , {\displaystyle \min _{\beta \in \mathbb {R} ^{p\times 1}}{\mathcal {L}}_{n}^{\text{LR}}(\beta ;X,y),{\text{ subject to }}\|\beta \|_{0}\leq s,} where s {\displaystyle s} represents the desired size of the support set, and ‖ β ‖ 0 = ∑ i = 1 p I ( β i ≠ 0 ) {\displaystyle \|\beta \|_{0}=\sum _{i=1}^{p}{\mathcal {I}}_{(\beta _{i}\neq 0)}} is the l 0 {\displaystyle l_{0}} norm of the vector. To address the optimization problem described above, abess iteratively exchanges an equal number of variables between the active set and the inactive set. In each iteration, the concept of sacrifice is introduced as follows: For j in the active set ( j ∈ A ^ {\displaystyle j\in {\hat {\mathcal {A}}}} ): ξ j = L n LR ( β ^ A ∖ { j } ) − L n LR ( β ^ A ) = X j ⊤ X j 2 n ( β ^ j ) 2 {\displaystyle \xi _{j}={\mathcal {L}}_{n}^{\text{LR}}\left({\hat {\boldsymbol {\beta }}}^{{\mathcal {A}}\backslash \{j\}}\right)-{\mathcal {L}}_{n}^{\text{LR}}\left({\hat {\boldsymbol {\beta }}}^{\mathcal {A}}\right)={\frac {{\boldsymbol {X}}_{j}^{\top }{\boldsymbol {X}}_{j}}{2n}}\left({\hat {\beta }}_{j}\right)^{2}} For j in the inactive set ( j ∉ A ^ {\displaystyle j\notin {\hat {\mathcal {A}}}} ): ξ j = L n LR ( β ^ A ) − L n LR ( β ^ A + t ^ { j } ) = X j ⊤ X j 2 n ( d ^ j X j ⊤ X j / n ) 2 {\displaystyle \xi _{j}={\mathcal {L}}_{n}^{\text{LR}}\left({\hat {\boldsymbol {\beta }}}^{\mathcal {A}}\right)-{\mathcal {L}}_{n}^{\text{LR}}\left({\hat {\boldsymbol {\beta }}}^{\mathcal {A}}+{\hat {\boldsymbol {t}}}^{\{j\}}\right)={\frac {{\boldsymbol {X}}_{j}^{\top }{\boldsymbol {X}}_{j}}{2n}}\left({\frac {{\hat {\mathrm {d} }}_{j}}{{\boldsymbol {X}}_{j}^{\top }{\boldsymbol {X}}_{j}/n}}\right)^{2}} Here are the key elements in the above equations: β ^ A {\displaystyle {\hat {\beta }}^{\mathcal {A}}} : This represents the estimate of β {\displaystyle \beta } obtained in the previous iteration. A ^ {\displaystyle {\hat {\mathcal {A}}}} : It denotes the estimated active set from the previous iteration. β ^ A ∖ { j } {\displaystyle {\hat {\boldsymbol {\beta }}}^{{\mathcal {A}}\backslash \{j\}}} : This is a vector where the j-th element is set to 0, while the other elements are the same as β ^ A {\displaystyle {\hat {\beta }}^{\mathcal {A}}} . t ^ { j } = arg ⁡ min t L n LR ( β ^ A + t { j } ) {\displaystyle {\hat {\boldsymbol {t}}}^{\{j\}}=\arg \min _{t}{\mathcal {L}}_{n}^{\text{LR}}\left({\hat {\boldsymbol {\beta }}}^{\mathcal {A}}+{\boldsymbol {t}}^{\{j\}}\right)} : Here, t { j } {\displaystyle t^{\{j\}}} represents a vector where all elements are 0 except the j-th element. d ^ j = X j ⊤ ( y − X β ^ ) / n {\displaystyle {\hat {d}}_{j}={\boldsymbol {X}}_{j}^{\top }({\boldsymbol {y}}-{\boldsymbol {X}}{\hat {\boldsymbol {\beta }}})/n} : This is calculated based on the equation mentioned. The iterative process involves exchanging variables, with the aim of minimizing the sacrifices in the active set while maximizing the sacrifices in the inactive set during each iteration. This approach allows abess to efficiently search for the optimal feature subset. In abess, select an appropriate s max {\displaystyle s_{\max }} and optimize the above problem for active sets size s = 1 , … , s max {\displaystyle s=1,\ldots ,s_{\max }} using the information criterion GIC = n log ⁡ L n LR + s log ⁡ p log ⁡ log ⁡ n , {\displaystyle {\text{GIC}}=n\log {\mathcal {L}}_{n}^{\text{LR}}+s\log p\log \log n,} to adaptively choose the appropriate active set size s {\displaystyle s} and obtain its corresponding abess estimator. == Generalizations == The splicing algorithm in abess can be employed for subset selection in other models. === Distribution-Free Location-Scale Regression === In 2023, Siegfried extends abess to the case of Distribution-Free and Location-Scale. Specifically, it considers the optimization problem max ϑ ∈ R P , β ∈ R J , γ ∈ R J ∑ i = 1 N ℓ i ( ϑ , x i ⊤ β , exp ⁡ ( x i ⊤ γ ) − 1 ) , {\displaystyle \max _{{\boldsymbol {\vartheta }}\in \mathbb {R} ^{P},{\boldsymbol {\beta }}\in \mathbb {R} ^{J},{\boldsymbol {\gamma }}\in \mathbb {R} ^{J}}\sum _{i=1}^{N}\ell _{i}\left({\boldsymbol {\vartheta }},{\boldsymbol {x}}_{i}^{\top }{\boldsymbol {\beta }},{\sqrt {\exp \left({\boldsymbol {x}}_{i}^{\top }{\boldsymbol {\gamma }}\right)}}^{-1}\right),} subject to ‖ ( β ⊤ , γ ⊤ ) ⊤ ‖ 0 ≤ s , {\displaystyle \left\|\left({\boldsymbol {\beta }}^{\top },{\boldsymbol {\gamma }}^{\top }\right)^{\top }\right\|_{0}\leq s,} where ℓ i {\displaystyle \ell _{i}} is a loss function, ϑ {\displaystyle {\boldsymbol {\vartheta }}} is a parameter vector, β {\displaystyle {\boldsymbol {\beta }}} and γ {\displaystyle {\boldsymbol {\gamma }}} are vectors, and x i {\displaystyle {\boldsymbol {x}}_{i}} is a data vector. This approach, demonstrated across various applications, enables parsimonious regression modeling for arbitrary outcomes while maintaining interpretability through innovative subset selection procedures. === Groups Selection === In 2023, Zhang applied the splicing algorithm to group selection, optimizing the following model: min β ∈ R p L n LR ( β ; X , y ) subject to ∑ j = 1 J I ( ‖ β G j ‖ 2 ≠ 0 ) ≤ s {\displaystyle \min _{{\boldsymbol {\beta }}\in \mathbb {R} ^{p}}{\mathcal {L}}_{n}^{\text{LR}}(\beta ;X,y){\text{ subject to }}\sum _{j=1}^{J}I\left(\|{\boldsymbol {\beta }}_{G_{j}}\|_{2}\neq 0\right)\leq s} Here are the symbols involved: J {\displaystyle J} : Total number of feature groups, representing the existence of J {\displaystyle J} non-overlapping feature groups in the dataset. G j {\displaystyle G_{j}} : Index set for the j {\displaystyle j} -th feature group, where j {\displaystyle j} ranges from 1 to J {\displaystyle J} , representing the feature grouping structure in the data. s {\displaystyle s} : Model size, a positive integer determined from the data, limiting the number of selected feature groups. === Regression with Corrupted Data === Zhang applied the splicing algorithm to handle corrupted data. Corrupted data refers to information that has been disrupted or contains errors during the data collection or recording process. This interference may include sensor inaccuracies, recording errors, communication issues, or other external disturbances, leading to inaccurate or distorted observations within the dataset. === Single Index Models === In 2023, Tang applied the splicing algorithm to optimal subset selection in the Single-index model. The form of the Single Index Model (SIM) is given by y i = g ( b ⊤ x i , e i ) , i = 1 , … , n , {\displaystyle y_{i}=g({\boldsymbol {b}}^{\top }{\boldsymbol {x}}_{i},e_{i}),\quad i=1,\ldots ,n,} where b {\displaystyle {\boldsymbol {b}}} is the parameter vector, e i {\displaystyle e_{i}} is the error term. The corresponding loss function is defined as l n ( β ) = ∑ i = 1 n ( r i n − 1 2 − x i ⊤ β ) 2 , {\displaystyle l_{n}({\boldsymbol {\beta }})=\sum _{i=1}^{n}\left({\frac {r_{i}}{n}}-{\frac {1}{2}}-{\boldsymbol {x}}_{i}^{\top }{\boldsymbol {\beta }}\right)^{2},} where r {\disp

    Read more →
  • Word2vec

    Word2vec

    Word2vec is a technique in natural language processing for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov, Kai Chen, Greg Corrado, Ilya Sutskever and Jeff Dean at Google, and published in 2013. Word2vec represents a word as a high-dimension vector of numbers which capture relationships between words. In particular, words which appear in similar contexts are mapped to vectors which are nearby as measured by cosine similarity. This indicates the level of semantic similarity between the words, so for example the vectors for walk and ran are nearby, as are those for "but" and "however", and "Berlin" and "Germany". == Approach == Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a mapping of the set of words to a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a vector in the space. Word2vec can use either of two model architectures to produce these distributed representations of words: continuous bag of words (CBOW) or continuously sliding skip-gram. In both architectures, word2vec considers both individual words and a sliding context window as it iterates over the corpus. The CBOW can be viewed as a 'fill in the blank' task, where the word embedding represents the way the word influences the relative probabilities of other words in the context window. Words which are semantically similar should influence these probabilities in similar ways, because semantically similar words should be used in similar contexts. The order of context words does not influence prediction (bag of words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words. According to the authors' note, CBOW is faster while skip-gram does a better job for infrequent words. After the model is trained, the learned word embeddings are positioned in the vector space such that words that share common contexts in the corpus — that is, words that are semantically and syntactically similar — are located close to one another in the space. More dissimilar words are located farther from one another in the space. == Mathematical details == This section is based on expositions. A corpus is a sequence of words. Both CBOW and skip-gram are methods to learn one vector per word appearing in the corpus. Let V {\displaystyle V} ("vocabulary") be the set of all words appearing in the corpus C {\displaystyle C} . Our goal is to learn one vector v w ∈ R d {\displaystyle v_{w}\in \mathbb {R} ^{d}} for each word w ∈ V {\displaystyle w\in V} . The idea of skip-gram is that the vector of a word should be close to the vector of each of its neighbors. The idea of CBOW is that the vector-sum of a word's neighbors should be close to the vector of the word. === Continuous bag-of-words (CBOW) === The idea of CBOW is to represent each word with a vector, such that it is possible to predict a word using the sum of the vectors of its neighbors. Specifically, for each word w i {\displaystyle w_{i}} in the corpus, the one-hot encoding of the word is used as the input to the neural network. The output of the neural network is a probability distribution over the dictionary, representing a prediction of individual words in the neighborhood of w i {\displaystyle w_{i}} . The objective of training is to maximize ∑ i ln ⁡ Pr ( w i ∣ w i + j : j ∈ N ) {\displaystyle \sum _{i}\ln \Pr(w_{i}\mid w_{i+j}\colon j\in N)} where N {\displaystyle N} is a set of (non-zero) indices representing the relative locations of nearby words considered to be in w i {\displaystyle w_{i}} 's neighborhood. For example, if we want each word in the corpus to be predicted by every other word in a small span of 4 words. The set of relative indexes of neighbor words will be: N = { − 2 , − 1 , + 1 , + 2 } {\displaystyle N=\{-2,-1,+1,+2\}} , and the objective is to maximize ∑ i ln ⁡ Pr ( w i ∣ w i − 2 , w i − 1 , w i + 1 , w i + 2 ) {\displaystyle \sum _{i}\ln \Pr(w_{i}\mid w_{i-2},w_{i-1},w_{i+1},w_{i+2})} . In standard bag-of-words, a word's context is represented by a word-count (aka a word histogram) of its neighboring words. For example, the "sat" in "the cat sat on the mat" is represented as {"the": 2, "cat": 1, "on": 1}. Note that the last word "mat" is not used to represent "sat", because it is outside the neighborhood N = { − 2 , − 1 , + 1 , + 2 } {\displaystyle N=\{-2,-1,+1,+2\}} . In continuous bag-of-words, the histogram is multiplied by a matrix V {\displaystyle V} to obtain a continuous representation of the word's context. The matrix V {\displaystyle V} is also called a dictionary. Its columns are the word vectors. It has D {\displaystyle D} columns, where D {\displaystyle D} is the size of the dictionary. Let d {\displaystyle d} be the length of each word vector. We have V ∈ R d × D {\displaystyle V\in \mathbb {R} ^{d\times D}} . For example, multiplying the word histogram {"the": 2, "cat": 1, "on": 1} with V {\displaystyle V} , we obtain 2 v the + v cat + v on {\displaystyle 2v_{\text{the}}+v_{\text{cat}}+v_{\text{on}}} . This is then multiplied with another matrix V ′ {\displaystyle V'} of shape R D × d {\displaystyle \mathbb {R} ^{D\times d}} . Each row of it is a word vector v ′ {\displaystyle v'} . This results in a vector of length D {\displaystyle D} , one entry per dictionary entry. Then, apply the softmax to obtain a probability distribution over the dictionary. This system can be visualized as a neural network, similar in spirit to an autoencoder, of architecture linear-linear-softmax, as depicted in the diagram. The system is trained by gradient descent to minimize the cross-entropy loss. In full formula, the cross-entropy loss is: − ∑ i ln ⁡ e v w i ′ ⋅ ( ∑ j ∈ N v w j + i ) ∑ w ′ e v w ′ ′ ⋅ ( ∑ j ∈ N v w j + i ) {\displaystyle -\sum _{i}\ln {\frac {e^{v_{w_{i}}'\cdot (\sum _{j\in N}v_{w_{j+i}})}}{\sum _{w'}e^{v_{w'}'\cdot (\sum _{j\in N}v_{w_{j+i}})}}}} where the outer summation ∑ i {\displaystyle \sum _{i}} is over the words in a corpus, the quantity ∑ j ∈ N v w j + i {\displaystyle \sum _{j\in N}v_{w_{j+i}}} is the sum of a word's neighbors' vectors, etc. Once such a system is trained, we have two trained matrices V , V ′ {\displaystyle V,V'} . Either the column vectors of V {\displaystyle V} or the row vectors of V ′ {\displaystyle V'} can serve as the dictionary. For example, the word "sat" can be represented as either the "sat"-th column of V {\displaystyle V} or the "sat"-th row of V ′ {\displaystyle V'} . It is also possible to simply define V ′ = V ⊤ {\displaystyle V'=V^{\top }} , in which case there would no longer be a choice. === Skip-gram === The idea of skip-gram is to represent each word with a vector, such that it is possible to predict the vectors of its neighbors using the vector of a word. The architecture is still linear-linear-softmax, the same as CBOW, but the input and the output are switched. Specifically, for each word w i {\displaystyle w_{i}} in the corpus, the one-hot encoding of the word is used as the input to the neural network. The output of the neural network is a probability distribution over the dictionary, representing a prediction of individual words in the neighborhood of w i {\displaystyle w_{i}} . The objective of training is to maximize ∑ i ∑ j ∈ N ln ⁡ Pr ( w j + i ∣ w i ) {\displaystyle \sum _{i}\sum _{j\in N}\ln \Pr(w_{j+i}\mid w_{i})} . In full formula, the loss function is − ∑ i ∑ j ∈ N ln ⁡ e v w j + i ′ ⋅ v w i ∑ w ′ e v w ′ ′ ⋅ v w i {\displaystyle -\sum _{i}\sum _{j\in N}\ln {\frac {e^{v_{w_{j+i}}'\cdot v_{w_{i}}}}{\sum _{w'}e^{v_{w'}'\cdot v_{w_{i}}}}}} Same as CBOW, once such a system is trained, we have two trained matrices V , V ′ {\displaystyle V,V'} . Either the column vectors of V {\displaystyle V} or the row vectors of V ′ {\displaystyle V'} can serve as the dictionary. It is also possible to simply define V ′ = V ⊤ {\displaystyle V'=V^{\top }} , in which case there would no longer be a choice. Essentially, skip-gram and CBOW are exactly the same in architecture. They only differ in the objective function during training. == History == During the 1980s, there were some early attempts at using neural networks to represent words and concepts as vectors. In 2010, Tomáš Mikolov (then at Brno University of Technology) with co-authors applied a simple recurrent neural network with a single hidden

    Read more →
  • Language Server Protocol

    Language Server Protocol

    The Language Server Protocol (LSP) is an open, JSON-RPC-based protocol for use between source-code editors or integrated development environments (IDEs) and servers that provide "language intelligence tools": programming language-specific features like code completion, syntax highlighting and marking of warnings and errors, as well as refactoring routines. The goal of the protocol is to allow programming language support to be implemented and distributed independently of any given editor or IDE. In the early 2020s, LSP quickly became a "norm" for language intelligence tools providers. == History == LSP was originally developed for Microsoft Visual Studio Code and is now an open standard. On June 27, 2016, Microsoft announced a collaboration with Red Hat and Codenvy to standardize the protocol's specification. Its specification is hosted and developed on GitHub. == Background == Modern IDEs provide programmers with sophisticated features like code completion, refactoring, navigating to a symbol's definition, syntax highlighting, and error and warning markers. For example, in a text-based programming language, a programmer might want to rename a method read. The programmer could either manually edit the respective source code files and change the appropriate occurrences of the old method name into the new name, or instead use an IDE's refactoring capabilities to make all the necessary changes automatically. To be able to support this style of refactoring, an IDE needs a sophisticated understanding of the programming language that the program's source is written in. A programming tool without such an understanding—for example, one that performs a naive search-and-replace instead—could introduce errors. When renaming a read method, for example, the tool should not replace the partial match in a variable that might be called readyState, nor should it replace the portion of a code comment containing the word "already". Neither should renaming a local variable read, for example, end up altering identically-named variables in other scopes. Conventional compilers or interpreters for a specific programming language are typically unable to provide these language services, because they are written with the goal of either transforming the source code into object code or immediately executing the code. Additionally, language services must be able to handle source code that is not well-formed, e.g. because the programmer is in the middle of editing and has not yet finished typing a statement, procedure, or other construct. Additionally, small changes to a source code file which are done during typing usually change the semantics of the program. In order to provide instant feedback to the user, the editing tool must be able to very quickly evaluate the syntactical and semantical consequences of a specific modification. Compilers and interpreters therefore provide a poor candidate for producing the information needed for an editing tool to consume. Prior to the design and implementation of the Language Server Protocol for the development of Visual Studio Code, most language services were generally tied to a given IDE or other editor. In the absence of the Language Server Protocol, language services are typically implemented by using a tool-specific extension API. Providing the same language service to another editing tool requires effort to adapt the existing code so that the service may target the second editor's extension interfaces. The Language Server Protocol allows for decoupling language services from the editor so that the services may be contained within a general-purpose language server. Any editor can inherit sophisticated support for many different languages by making use of existing language servers. Similarly, a programmer involved with the development of a new programming language can make services for that language available to existing editing tools. Making use of language servers via the Language Server Protocol thus also reduces the burden on vendors of editing tools, because vendors do not need to develop language services of their own for the languages the vendor intends to support, as long as the language servers have already been implemented. The Language Server Protocol also enables the distribution and development of servers contributed by an interested third party, such as end users, without additional involvement by either the vendor of the compiler for the programming language in use or the vendor of the editor to which the language support is being added. LSP is not restricted to programming languages. It can be used for any kind of text-based language, like specifications or domain-specific languages (DSL). == Technical overview == When a user edits one or more source code files using a language server protocol-enabled tool, the tool acts as a client that consumes the language services provided by a language server. The tool may be a text editor or IDE and the language services could be refactoring, code completion, etc. The client informs the server about what the user is doing, e.g., opening a file or inserting a character at a specific text position. The client can also request the server to perform a language service, e.g. to format a specified range in the text document. The server answers a client's request with an appropriate response. For example, the formatting request is answered either by a response that transfers the formatted text to the client or by an error response containing details about the error. The Language Server Protocol defines the messages to be exchanged between client and language server. They are JSON-RPC preceded by headers similar to HTTP. Messages may originate from the server or client. The protocol does not make any provisions about how requests, responses and notifications are transferred between client and server. For example, client and server could be components within the same process exchanging JSON strings via method calls. They could also be different processes on the same or on different machines communicating via network sockets. == Registry == There are lists of LSP-compatible implementations, maintained by the community-driven Langserver.org or Microsoft.

    Read more →
  • List of text mining software

    List of text mining software

    Text mining computer programs are available from many commercial and open source companies and sources. == Commercial == Angoss – Angoss Text Analytics provides entity and theme extraction, topic categorization, sentiment analysis and document summarization capabilities via the embedded AUTINDEX – is a commercial text mining software package based on sophisticated linguistics by IAI (Institute for Applied Information Sciences), Saarbrücken. DigitalMR – social media listening & text+image analytics tool for market research. FICO Score – leading provider of analytics. General Sentiment – Social Intelligence platform that uses natural language processing to discover affinities between the fans of brands with the fans of traditional television shows in social media. Stand alone text analytics to capture social knowledge base on billions of topics stored to 2004. IBM LanguageWare – the IBM suite for text analytics (tools and Runtime). IBM SPSS – provider of Modeler Premium (previously called IBM SPSS Modeler and IBM SPSS Text Analytics), which contains advanced NLP-based text analysis capabilities (multi-lingual sentiment, event and fact extraction), that can be used in conjunction with Predictive Modeling. Text Analytics for Surveys provides the ability to categorize survey responses using NLP-based capabilities for further analysis or reporting. Inxight – provider of text analytics, search, and unstructured visualization technologies. (Inxight was bought by Business Objects that was bought by SAP AG in 2008). Language Computer Corporation – text extraction and analysis tools, available in multiple languages. Lexalytics – provider of a text analytics engine used in Social Media Monitoring, Voice of Customer, Survey Analysis, and other applications. Salience Engine. The software provides the unique capability of merging the output of unstructured, text-based analysis with structured data to provide additional predictive variables for improved predictive models and association analysis. Linguamatics – provider of natural language processing (NLP) based enterprise text mining and text analytics software, I2E, for high-value knowledge discovery and decision support. Mathematica – provides built in tools for text alignment, pattern matching, clustering and semantic analysis. See Wolfram Language, the programming language of Mathematica. MATLAB offers Text Analytics Toolbox for importing text data, converting it to numeric form for use in machine and deep learning, sentiment analysis and classification tasks. Medallia – offers one system of record for survey, social, text, written and online feedback. NetMiner – software for network analysis and text mining. Supports social media and bibliographic data collection, NLP for english and chinese, sentiment analysis, work co-occurrence network(text network analysis) and visualization. NetOwl – suite of multilingual text and entity analytics products, including entity extraction, link and event extraction, sentiment analysis, geotagging, name translation, name matching, and identity resolution, among others. PolyAnalyst - text analytics environment. PoolParty Semantic Suite - graph-based text mining platform. RapidMiner with its Text Processing Extension – data and text mining software. SAS – SAS Text Miner and Teragram; commercial text analytics, natural language processing, and taxonomy software used for Information Management. Sketch Engine – a corpus manager and analysis software which providing creating text corpora from uploaded texts or the Web including part-of-speech tagging and lemmatization or detecting a particular website. Sysomos – provider social media analytics software platform, including text analytics and sentiment analysis on online consumer conversations. WordStat – Content analysis and text mining add-on module of QDA Miner for analyzing large amounts of text data. == Open source == Carrot2 – text and search results clustering framework. GATE – general Architecture for Text Engineering, an open-source toolbox for natural language processing and language engineering. Gensim – large-scale topic modelling and extraction of semantic information from unstructured text (Python). KH Coder – for Quantitative Content Analysis or Text Mining The KNIME Text Processing extension. Natural Language Toolkit (NLTK) – a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language. OpenNLP – natural language processing. Orange with its text mining add-on. The PLOS Text Mining Collection. The programming language R provides a framework for text mining applications in the package tm. The Natural Language Processing task view contains tm and other text mining library packages. spaCy – open-source Natural Language Processing library for Python Stanbol – an open source text mining engine targeted at semantic content management. Voyant Tools – a web-based text analysis environment, created as a scholarly project.

    Read more →
  • State–action–reward–state–action

    State–action–reward–state–action

    State–action–reward–state–action (SARSA) is an algorithm for learning a Markov decision process policy, used in the reinforcement learning area of machine learning. It was proposed by Rummery and Niranjan in a technical note with the name "Modified Connectionist Q-Learning" (MCQ-L). The alternative name SARSA, proposed by Rich Sutton, was only mentioned as a footnote. This name reflects the fact that the main function for updating the Q-value depends on the current state of the agent "S1", the action the agent chooses "A1", the reward "R2" the agent gets for choosing this action, the state "S2" that the agent enters after taking that action, and finally the next action "A2" the agent chooses in its new state. The acronym for the quintuple (St, At, Rt+1, St+1, At+1) is SARSA. Some authors use a slightly different convention and write the quintuple (St, At, Rt, St+1, At+1), depending on which time step the reward is formally assigned. The rest of the article uses the former convention. == Algorithm == Q new ( S t , A t ) ← ( 1 − α ) Q ( S t , A t ) + α [ R t + 1 + γ Q ( S t + 1 , A t + 1 ) ] {\displaystyle Q^{\textrm {new}}(S_{t},A_{t})\leftarrow (1-\alpha )Q(S_{t},A_{t})+\alpha \,[R_{t+1}+\gamma \,Q(S_{t+1},A_{t+1})]} A SARSA agent interacts with the environment and updates the policy based on actions taken, hence this is known as an on-policy learning algorithm. The Q value for a state-action is updated by an error, adjusted by the learning rate α. Q values represent the possible reward received in the next time step for taking action a in state s, plus the discounted future reward received from the next state-action observation. Watkin's Q-learning updates an estimate of the optimal state-action value function Q ∗ {\displaystyle Q^{}} based on the maximum reward of available actions. While SARSA learns the Q values associated with taking the policy it follows itself, Watkin's Q-learning learns the Q values associated with taking the optimal policy while following an exploration/exploitation policy. Some optimizations of Watkin's Q-learning may be applied to SARSA. == Hyperparameters == === Learning rate (alpha) === The learning rate determines to what extent newly acquired information overrides old information. A factor of 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only the most recent information. === Discount factor (gamma) === The discount factor determines the importance of future rewards. A discount factor of 0 makes the agent "opportunistic", or "myopic", e.g., by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the Q {\displaystyle Q} values may diverge. === Initial conditions (Q(S0, A0)) === Since SARSA is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. A high (infinite) initial value, also known as "optimistic initial conditions", can encourage exploration: no matter what action takes place, the update rule causes it to have higher values than the other alternative, thus increasing their choice probability. In 2013 it was suggested that the first reward r {\displaystyle r} could be used to reset the initial conditions. According to this idea, the first time an action is taken the reward is used to set the value of Q {\displaystyle Q} . This allows immediate learning in case of fixed deterministic rewards. This resetting-of-initial-conditions (RIC) approach seems to be consistent with human behavior in repeated binary choice experiments.

    Read more →
  • Evolutionary attractor

    Evolutionary attractor

    An evolutionary attractor is a point in an evolutionary space where a selection process will always drive trait values towards that point from the region around it. Because of the importance of evolution through natural selection, often such an evolutionary space will be defined by genetic or phenotypic traits, or possibly both. In this case the selection process will be a form of natural selection. The existence of an evolutionary attractor in a biological evolutionary space does not always imply that it can be reached from all points in that evolutionary space, nor does it identify what will happen when the evolutionary attractor is reached. While an evolutionary attractor may represent a point in evolutionary space that is resistant to further selection, such as an evolutionarily stable strategy, other possibilities are available. Because identification of an evolutionary attractor on its own does not describe everything about the evolutionary space in which it lies, this has led to interest in the evolutionary dynamics surrounding evolutionary attractors and in evolutionary spaces in general. (Theoretical biologists and mathematicians working in the area may prefer the terms adaptive dynamics or evolutionary invasion analysis to evolutionary dynamics.) These fields use differential equations which allows a more complete understanding of the dynamics in evolutionary spaces including the existence or otherwise of evolutionary attractors. Advances in the study of molecular evolution have also led to the identification of evolutionary attractors at a molecular level. Because biological evolutionary processes have been studied using evolutionary game theory, a technique inspired by game theory originally derived to address economic problems, not only can evolutionary attractors be found in biology but economists studying evolutionary economic models have also identified evolutionary attractors. Evolution in biology has also inspired evolutionary computation in computer science. Many algorithms in this field use a form of selection inspired by natural selection to generate results through evolutionary algorithms. This is therefore another area in which evolutionary attractors have been identified. == Evolutionary attractors in biology == It is not probably not surprising that biology is the field where most examples of evolutionary attractors have been identified, given the importance of evolution through natural selection. === Evolutionary attractors in adaptive landscapes === An evolutionary attractor is a point in genetic and/or phenotypic trait space, that evolution will always drive trait values towards via a selection process. The concept of an evolutionary attractor arose in population genetics following the origin of the adaptive landscape originally proposed by Sewall Wright in 1932. The height of a point in an adaptive landscape is a measure of evolutionary fitness. If a point in an adaptive landscape is a peak, then selection will always drive traits towards it and it will be an evolutionary attractor. While population genetics deals with discrete genetic traits, quantitative genetics extended such concepts to deal with continuous genetic traits, where the concept of evolutionary attractor is also valid. === Evolutionary attractors in evolutionary game models === Evolutionary game theory introduced into evolutionary biology concepts originally used in economics, with the advantage that evolution could be studied in relation to strategic choices made in animal conflicts. This is of particular interest because of the concept of the evolutionarily stable strategy or ESS, a strategy that once established is resistant to invasion by other strategies. ESSs will not always be evolutionary attractors, but if they are they will persist over evolutionary time. === Dynamics around evolutionary attractors in biology === Evolutionary attractors in biology do not exist in isolation. By definition they must exist in an evolutionary trait space where selection drives all traits towards them from a region immediately around them. That is, they must be convergence stable. Eshel (1983) modified the definition of an ESS by considering individually advantageous reduction from a majority deviation: he created the term continuous stability. A continuously stable ESS can be shown to be convergence stable, therefore it will act as an evolutionary attractor. But the nature of evolutionary trait spaces in biology means that it is not possible to guarantee that the region of convergence to the evolutionary attractor covers the whole of the trait space, nor that there is only one evolutionary attractor in a particular trait space. These issues have led to the emergence of the related fields of evolutionary dynamics, adaptive dynamics and evolutionary invasion analysis, all of which use differential equations to understand the dynamics in evolutionary trait spaces. Hence, if one or more evolutionary attractor exists in an evolutionary trait space, they provide techniques to understand the dynamics in that trait space around the evolutionary attractor. === Evolutionary attractors in an ecological context === Evolution in biology does not take place in single species in isolation. Ecological interaction of species leads to coevolution. Important examples of this are host-parasite or host-pathogen interaction, which can make both the dynamics around evolutionary attractors more complex, and the occurrence and number of evolutionary attractors more diverse. Evolutionary attractors have been identified in the analysis of evolutionary epidemiology of plant pathogens. In the above study working on plant populations the authors were able to identify evolutionary attractors using methods from adaptive dynamics. A model applied to the analysis of a maize (Zea mays L.) virus identified convergence stable equilibria through simulation modelling. A related model identified evolutionary attractors in the interaction of plants with fungal pathogens. === Evolutionary attractors in molecular genetics === As mentioned above much of the consideration of evolutionary attractors in biology has been through investigation of selection at a genetic or phenotypic level or both, in a single species or in coevolving species. Advances in the study of molecular genetics now allow the study of evolutionary attractors to be taken to a molecular genetic level. Wilson et. al (2019) studied the evolution of gene regulatory networks and identified the emergence of evolutionary attractors. == Evolutionary attractors in economics == Evolutionary game theory as applied in biology was inspired by game theory originally devised for applications in economics. Game theory remains an active field of research outside of biology, and thus it is not surprising that researchers in evolutionary economics use evolutionary game theory. Evolutionary attractors have been demonstrated by economists studying the evolutionary dynamics of market entry with market dynamics based on the replicator dynamics of biological evolutionary games. == Evolutionary attractors in computing == Evolutionary computation is a branch of computer science inspired by biological evolution. Many algorithms in evolutionary computation use a form of selection. Thus evolutionary attractors have been identified in computer science as well as in biology and economics. Evolutionary algorithms have generated evolutionary attractors, probably because of the similarity between adaptive hill-climbing in evolutionary heuristics and the adaptive landscape originated to explain evolution through natural selection.

    Read more →
  • Kdan Mobile

    Kdan Mobile

    Kdan Mobile Software Limited is a software application development company based in Tainan City, Taiwan. Kdan also has branches in Taipei, Changsha, Irvine, California, Japan, and South Korea. The company was founded in 2009 by Kenny Su, the company's CEO. == History == Kdan Mobile was founded in 2009 by Kenny Su (蘇柏州) and develops an application for PDF documents. Su previously worked at the Industrial Technology Research Institute (ITRI) . In 2018, the company completed its Series B round of fundraising, in which it raised 16 million USD in total. Four global firms, Dattoz Partners (South Korea), WI Harper Group (U.S.), Taiwania Capital (Taiwan), and Golden Asia Fund Mitsubishi UFJ Capital (Japan), made up the Series B investment. Kdan previously raised 5 million USD in its Series A round in 2018.

    Read more →
  • Distributional Soft Actor Critic

    Distributional Soft Actor Critic

    Distributional Soft Actor Critic (DSAC) is a suite of model-free off-policy reinforcement learning algorithms, tailored for learning decision-making or control policies in complex systems with continuous action spaces. Distinct from traditional methods that focus solely on expected returns, DSAC algorithms are designed to learn a Gaussian distribution over stochastic returns, called value distribution. This focus on Gaussian value distribution learning notably diminishes value overestimations, which in turn boosts policy performance. Additionally, the value distribution learned by DSAC can also be used for risk-aware policy learning. From a technical standpoint, DSAC is essentially a distributional adaptation of the well-established soft actor-critic (SAC) method. To date, the DSAC family comprises two iterations: the original DSAC-v1 and its successor, DSAC-T (also known as DSAC-v2), with the latter demonstrating superior capabilities over the Soft Actor-Critic (SAC) in Mujoco benchmark tasks. The source code for DSAC-T can be found at the following URL: Jingliang-Duan/DSAC-T. Both iterations have been integrated into an advanced, Pytorch-powered reinforcement learning toolkit named GOPS: GOPS (General Optimal control Problem Solver).

    Read more →
  • Sum of absolute differences

    Sum of absolute differences

    In digital image processing, the sum of absolute differences (SAD) is a measure of the similarity between image blocks. It is calculated by taking the absolute difference between each pixel in the original block and the corresponding pixel in the block being used for comparison. These differences are summed to create a simple metric of block similarity, the L1 norm of the difference image or Manhattan distance between two image blocks. The sum of absolute differences may be used for a variety of purposes, such as object recognition, the generation of disparity maps for stereo images, and motion estimation for video compression. == Example == This example uses the sum of absolute differences to identify which part of a search image is most similar to a template image. In this example, the template image is 3 by 3 pixels in size, while the search image is 3 by 5 pixels in size. Each pixel is represented by a single integer from 0 to 9. Template Search image 2 5 5 2 7 5 8 6 4 0 7 1 7 4 2 7 7 5 9 8 4 6 8 5 There are exactly three unique locations within the search image where the template may fit: the left side of the image, the center of the image, and the right side of the image. To calculate the SAD values, the absolute value of the difference between each corresponding pair of pixels is used: the difference between 2 and 2 is 0, 4 and 1 is 3, 7 and 8 is 1, and so forth. Calculating the values of the absolute differences for each pixel, for the three possible template locations, gives the following: Left Center Right 0 2 0 5 0 3 3 3 1 3 7 3 3 4 5 0 2 0 1 1 3 3 1 1 1 3 4 For each of these three image patches, the 9 absolute differences are added together, giving SAD values of 20, 25, and 17, respectively. From these SAD values, it could be asserted that the right side of the search image is the most similar to the template image, because it has the lowest sum of absolute differences as compared to the other two locations. == Comparison to other metrics == === Object recognition === The sum of absolute differences provides a simple way to automate the searching for objects inside an image, but may be unreliable due to the effects of contextual factors such as changes in lighting, color, viewing direction, size, or shape. The SAD may be used in conjunction with other object recognition methods, such as edge detection, to improve the reliability of results. === Video compression === SAD is an extremely fast metric due to its simplicity; it is effectively the simplest possible metric that takes into account every pixel in a block. Therefore, it is very effective for a wide motion search of many different blocks. SAD is also easily parallelizable since it analyzes each pixel separately, making it easily implementable with such instructions as ARM NEON or x86 SSE2. For example, SSE has packed sum of absolute differences instruction (PSADBW) specifically for this purpose. Once candidate blocks are found, the final refinement of the motion estimation process is often done with other slower but more accurate metrics, which better take into account human perception. These include the sum of absolute transformed differences (SATD), the sum of squared differences (SSD), and rate–distortion optimization.

    Read more →
  • Error tolerance (PAC learning)

    Error tolerance (PAC learning)

    In PAC learning, error tolerance refers to the ability of an algorithm to learn when the examples received have been corrupted in some way. In fact, this is a very common and important issue since in many applications it is not possible to access noise-free data. Noise can interfere with the learning process at different levels: the algorithm may receive data that have been occasionally mislabeled, or the inputs may have some false information, or the classification of the examples may have been maliciously adulterated. == Notation and the Valiant learning model == In the following, let X {\displaystyle X} be our n {\displaystyle n} -dimensional input space. Let H {\displaystyle {\mathcal {H}}} be a class of functions that we wish to use in order to learn a { 0 , 1 } {\displaystyle \{0,1\}} -valued target function f {\displaystyle f} defined over X {\displaystyle X} . Let D {\displaystyle {\mathcal {D}}} be the distribution of the inputs over X {\displaystyle X} . The goal of a learning algorithm A {\displaystyle {\mathcal {A}}} is to choose the best function h ∈ H {\displaystyle h\in {\mathcal {H}}} such that it minimizes e r r o r ( h ) = P x ∼ D ( h ( x ) ≠ f ( x ) ) {\displaystyle error(h)=P_{x\sim {\mathcal {D}}}(h(x)\neq f(x))} . Let us suppose we have a function s i z e ( f ) {\displaystyle size(f)} that can measure the complexity of f {\displaystyle f} . Let Oracle ( x ) {\displaystyle {\text{Oracle}}(x)} be an oracle that, whenever called, returns an example x {\displaystyle x} and its correct label f ( x ) {\displaystyle f(x)} . When no noise corrupts the data, we can define learning in the Valiant setting: Definition: We say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the Valiant setting if there exists a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x ) {\displaystyle {\text{Oracle}}(x)} and a polynomial p ( ⋅ , ⋅ , ⋅ , ⋅ ) {\displaystyle p(\cdot ,\cdot ,\cdot ,\cdot )} such that for any 0 < ε ≤ 1 {\displaystyle 0<\varepsilon \leq 1} and 0 < δ ≤ 1 {\displaystyle 0<\delta \leq 1} it outputs, in a number of calls to the oracle bounded by p ( 1 ε , 1 δ , n , size ( f ) ) {\displaystyle p\left({\frac {1}{\varepsilon }},{\frac {1}{\delta }},n,{\text{size}}(f)\right)} , a function h ∈ H {\displaystyle h\in {\mathcal {H}}} that satisfies with probability at least 1 − δ {\displaystyle 1-\delta } the condition error ( h ) ≤ ε {\displaystyle {\text{error}}(h)\leq \varepsilon } . In the following we will define learnability of f {\displaystyle f} when data have suffered some modification. == Classification noise == In the classification noise model a noise rate 0 ≤ η < 1 2 {\displaystyle 0\leq \eta <{\frac {1}{2}}} is introduced. Then, instead of Oracle ( x ) {\displaystyle {\text{Oracle}}(x)} that returns always the correct label of example x {\displaystyle x} , algorithm A {\displaystyle {\mathcal {A}}} can only call a faulty oracle Oracle ( x , η ) {\displaystyle {\text{Oracle}}(x,\eta )} that will flip the label of x {\displaystyle x} with probability η {\displaystyle \eta } . As in the Valiant case, the goal of a learning algorithm A {\displaystyle {\mathcal {A}}} is to choose the best function h ∈ H {\displaystyle h\in {\mathcal {H}}} such that it minimizes e r r o r ( h ) = P x ∼ D ( h ( x ) ≠ f ( x ) ) {\displaystyle error(h)=P_{x\sim {\mathcal {D}}}(h(x)\neq f(x))} . In applications it is difficult to have access to the real value of η {\displaystyle \eta } , but we assume we have access to its upperbound η B {\displaystyle \eta _{B}} . Note that if we allow the noise rate to be 1 / 2 {\displaystyle 1/2} , then learning becomes impossible in any amount of computation time, because every label conveys no information about the target function. Definition: We say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the classification noise model if there exists a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x , η ) {\displaystyle {\text{Oracle}}(x,\eta )} and a polynomial p ( ⋅ , ⋅ , ⋅ , ⋅ ) {\displaystyle p(\cdot ,\cdot ,\cdot ,\cdot )} such that for any 0 ≤ η ≤ 1 2 {\displaystyle 0\leq \eta \leq {\frac {1}{2}}} , 0 ≤ ε ≤ 1 {\displaystyle 0\leq \varepsilon \leq 1} and 0 ≤ δ ≤ 1 {\displaystyle 0\leq \delta \leq 1} it outputs, in a number of calls to the oracle bounded by p ( 1 1 − 2 η B , 1 ε , 1 δ , n , s i z e ( f ) ) {\displaystyle p\left({\frac {1}{1-2\eta _{B}}},{\frac {1}{\varepsilon }},{\frac {1}{\delta }},n,size(f)\right)} , a function h ∈ H {\displaystyle h\in {\mathcal {H}}} that satisfies with probability at least 1 − δ {\displaystyle 1-\delta } the condition e r r o r ( h ) ≤ ε {\displaystyle error(h)\leq \varepsilon } . == Statistical query learning == Statistical Query Learning is a kind of active learning problem in which the learning algorithm A {\displaystyle {\mathcal {A}}} can decide if to request information about the likelihood P f ( x ) {\displaystyle P_{f(x)}} that a function f {\displaystyle f} correctly labels example x {\displaystyle x} , and receives an answer accurate within a tolerance α {\displaystyle \alpha } . Formally, whenever the learning algorithm A {\displaystyle {\mathcal {A}}} calls the oracle Oracle ( x , α ) {\displaystyle {\text{Oracle}}(x,\alpha )} , it receives as feedback probability Q f ( x ) {\displaystyle Q_{f(x)}} , such that Q f ( x ) − α ≤ P f ( x ) ≤ Q f ( x ) + α {\displaystyle Q_{f(x)}-\alpha \leq P_{f(x)}\leq Q_{f(x)}+\alpha } . Definition: We say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the statistical query learning model if there exists a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x , α ) {\displaystyle {\text{Oracle}}(x,\alpha )} and polynomials p ( ⋅ , ⋅ , ⋅ ) {\displaystyle p(\cdot ,\cdot ,\cdot )} , q ( ⋅ , ⋅ , ⋅ ) {\displaystyle q(\cdot ,\cdot ,\cdot )} , and r ( ⋅ , ⋅ , ⋅ ) {\displaystyle r(\cdot ,\cdot ,\cdot )} such that for any 0 < ε ≤ 1 {\displaystyle 0<\varepsilon \leq 1} the following hold: Oracle ( x , α ) {\displaystyle {\text{Oracle}}(x,\alpha )} can evaluate P f ( x ) {\displaystyle P_{f(x)}} in time q ( 1 ε , n , s i z e ( f ) ) {\displaystyle q\left({\frac {1}{\varepsilon }},n,size(f)\right)} ; 1 α {\displaystyle {\frac {1}{\alpha }}} is bounded by r ( 1 ε , n , s i z e ( f ) ) {\displaystyle r\left({\frac {1}{\varepsilon }},n,size(f)\right)} A {\displaystyle {\mathcal {A}}} outputs a model h {\displaystyle h} such that e r r ( h ) < ε {\displaystyle err(h)<\varepsilon } , in a number of calls to the oracle bounded by p ( 1 ε , n , s i z e ( f ) ) {\displaystyle p\left({\frac {1}{\varepsilon }},n,size(f)\right)} . Note that the confidence parameter δ {\displaystyle \delta } does not appear in the definition of learning. This is because the main purpose of δ {\displaystyle \delta } is to allow the learning algorithm a small probability of failure due to an unrepresentative sample. Since now Oracle ( x , α ) {\displaystyle {\text{Oracle}}(x,\alpha )} always guarantees to meet the approximation criterion Q f ( x ) − α ≤ P f ( x ) ≤ Q f ( x ) + α {\displaystyle Q_{f(x)}-\alpha \leq P_{f(x)}\leq Q_{f(x)}+\alpha } , the failure probability is no longer needed. The statistical query model is strictly weaker than the PAC model: any efficiently SQ-learnable class is efficiently PAC learnable in the presence of classification noise, but there exist efficient PAC-learnable problems such as parity that are not efficiently SQ-learnable. == Malicious classification == In the malicious classification model an adversary generates errors to foil the learning algorithm. This setting describes situations of error burst, which may occur when for a limited time transmission equipment malfunctions repeatedly. Formally, algorithm A {\displaystyle {\mathcal {A}}} calls an oracle Oracle ( x , β ) {\displaystyle {\text{Oracle}}(x,\beta )} that returns a correctly labeled example x {\displaystyle x} drawn, as usual, from distribution D {\displaystyle {\mathcal {D}}} over the input space with probability 1 − β {\displaystyle 1-\beta } , but it returns with probability β {\displaystyle \beta } an example drawn from a distribution that is not related to D {\displaystyle {\mathcal {D}}} . Moreover, this maliciously chosen example may strategically selected by an adversary who has knowledge of f {\displaystyle f} , β {\displaystyle \beta } , D {\displaystyle {\mathcal {D}}} , or the current progress of the learning algorithm. Definition: Given a bound β B < 1 2 {\displaystyle \beta _{B}<{\frac {1}{2}}} for 0 ≤ β < 1 2 {\displaystyle 0\leq \beta <{\frac {1}{2}}} , we say that f {\displaystyle f} is efficiently learnable using H {\displaystyle {\mathcal {H}}} in the malicious classification model, if there exist a learning algorithm A {\displaystyle {\mathcal {A}}} that has access to Oracle ( x , β ) {\displaystyle {\text{Oracle}}(x,\beta )}

    Read more →
  • Hit-testing

    Hit-testing

    In computer graphics programming, hit-testing (hit detection, picking, or pick correlation) is the process of determining whether a user-controlled cursor (such as a mouse cursor or touch-point on a touch-screen interface) intersects a given graphical object (such as a shape, line, or curve) drawn on the screen. Hit-testing may be performed on the movement or activation of a mouse or other pointing device. Hit-testing is used by GUI environments to respond to user actions, such as selecting a menu item or a target in a game based on its visual location. In web programming languages such as HTML, SVG, and CSS, this is associated with the concept of pointer-events (e.g. user-initiated cursor movement or object selection). Collision detection is a related concept for detecting intersections of two or more different graphical objects, rather than intersection of a cursor with one or more graphical objects. == Algorithm == There are many different algorithms that may be used to perform hit-testing, with different performance or accuracy outcomes. One common hit-test algorithm for axis aligned bounding boxes. A key idea is that the box being tested must be either entirely above, entirely below, entirely to the right or left of the current box. If this is not possible, they are colliding. Example logic is presented in the pseudo-code below: In Python:

    Read more →
  • European Conference on Computer Vision

    European Conference on Computer Vision

    The European Conference on Computer Vision (ECCV) is a biennial research conference with the proceedings published by Springer Science+Business Media. Similar to ICCV in scope and quality, it is held those years which ICCV is not. It is considered to be one of the top conferences in computer vision, alongside CVPR and ICCV, with an 'A' rating from the Australian Ranking of ICT Conferences and an 'A1' rating from the Brazilian ministry of education. The acceptance rate for ECCV 2010 was 24.4% for posters and 3.3% for oral presentations. Like other top computer vision conferences, ECCV has tutorial talks, technical sessions, and poster sessions. The conference is usually spread over five to six days with the main technical program occupying three days in the middle, and tutorial and workshops, focused on specific topics, being held in the beginning and at the end. The ECCV presents the Koenderink Prize annually to recognize fundamental contributions in computer vision. == Location == The conference is usually held in autumn in Europe.

    Read more →
  • ID3 algorithm

    ID3 algorithm

    In decision tree learning, ID3 (Iterative Dichotomiser 3) is a greedy algorithm invented by Ross Quinlan used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm. The 3 in the name is meant to signify that this was Quinlan's third attempt at a model based on entropy-based splitting, and the term dichotimser is a misnomer as it implies a binary split, but the ID3 algorithm can split on multi-valued attributes. == Algorithm == The ID3 algorithm begins with the original set S {\displaystyle S} as the root node. On each iteration of the algorithm, it iterates through every unused attribute of the set S {\displaystyle S} and calculates the entropy H ( S ) {\displaystyle \mathrm {H} {(S)}} or the information gain I G ( S ) {\displaystyle IG(S)} of that attribute. It then selects the attribute which has the smallest entropy (or largest information gain) value. The set S {\displaystyle S} is then split or partitioned by the selected attribute to produce subsets of the data. (For example, a node can be split into child nodes based upon the subsets of the population whose ages are less than 50, between 50 and 100, and greater than 100.) The algorithm continues to recurse on each subset, considering only attributes never selected before. Recursion on a subset may stop in one of these cases: every element in the subset belongs to the same class; in which case the node is turned into a leaf node and labelled with the class of the examples. there are no more attributes to be selected, but the examples still do not belong to the same class. In this case, the node is made a leaf node and labelled with the most common class of the examples in the subset. there are no examples in the subset, which happens when no example in the parent set was found to match a specific value of the selected attribute. An example could be the absence of a person among the population with age over 100 years. Then a leaf node is created and labelled with the most common class of the examples in the parent node's set. Throughout the algorithm, the decision tree is constructed with each non-terminal node (internal node) representing the selected attribute on which the data was split, and terminal nodes (leaf nodes) representing the class label of the final subset of this branch. === Summary === Calculate the entropy of every attribute a {\displaystyle a} of the data set S {\displaystyle S} . Partition ("split") the set S {\displaystyle S} into subsets using the attribute for which the resulting entropy after splitting is minimized; or, equivalently, information gain is maximum. Make a decision tree node containing that attribute. Recurse on subsets using the remaining attributes. === Properties === ID3 does not guarantee an optimal solution. It can converge upon local optima. It uses a greedy strategy by selecting the locally best attribute to split the dataset on each iteration. The algorithm's optimality can be improved by using backtracking during the search for the optimal decision tree at the cost of possibly taking longer. ID3 can overfit the training data. To avoid overfitting, smaller decision trees should be preferred over larger ones. This algorithm usually produces small trees, but it does not always produce the smallest possible decision tree. ID3 is harder to use on continuous data than on factored data (factored data has a discrete number of possible values, thus reducing the possible branch points). If the values of any given attribute are continuous, then there are many more places to split the data on this attribute, and searching for the best value to split by can be time-consuming. === Usage === The ID3 algorithm is used by training on a data set S {\displaystyle S} to produce a decision tree which is stored in memory. At runtime, this decision tree is used to classify new test cases (feature vectors) by traversing the decision tree using the features of the datum to arrive at a leaf node. == The ID3 metrics == === Entropy === Entropy H ( S ) {\displaystyle \mathrm {H} {(S)}} is a measure of the amount of uncertainty in the (data) set S {\displaystyle S} (i.e. entropy characterizes the (data) set S {\displaystyle S} ). H ( S ) = ∑ x ∈ X − p ( x ) log 2 ⁡ p ( x ) {\displaystyle \mathrm {H} {(S)}=\sum _{x\in X}{-p(x)\log _{2}p(x)}} Where, S {\displaystyle S} – The current dataset for which entropy is being calculated This changes at each step of the ID3 algorithm, either to a subset of the previous set in the case of splitting on an attribute or to a "sibling" partition of the parent in case the recursion terminated previously. X {\displaystyle X} – The set of classes in S {\displaystyle S} p ( x ) {\displaystyle p(x)} – The proportion of the number of elements in class x {\displaystyle x} to the number of elements in set S {\displaystyle S} When H ( S ) = 0 {\displaystyle \mathrm {H} {(S)}=0} , the set S {\displaystyle S} is perfectly classified (i.e. all elements in S {\displaystyle S} are of the same class). In ID3, entropy is calculated for each remaining attribute. The attribute with the smallest entropy is used to split the set S {\displaystyle S} on this iteration. Entropy in information theory measures how much information is expected to be gained upon measuring a random variable; as such, it can also be used to quantify the amount to which the distribution of the quantity's values is unknown. A constant quantity has zero entropy, as its distribution is perfectly known. In contrast, a uniformly distributed random variable (discretely or continuously uniform) maximizes entropy. Therefore, the greater the entropy at a node, the less information is known about the classification of data at this stage of the tree; and therefore, the greater the potential to improve the classification here. As such, ID3 is a greedy heuristic performing a best-first search for locally optimal entropy values. Its accuracy can be improved by preprocessing the data. === Information gain === Information gain I G ( A ) {\displaystyle IG(A)} is the measure of the difference in entropy from before to after the set S {\displaystyle S} is split on an attribute A {\displaystyle A} . In other words, how much uncertainty in S {\displaystyle S} was reduced after splitting set S {\displaystyle S} on attribute A {\displaystyle A} . I G ( S , A ) = H ( S ) − ∑ t ∈ T p ( t ) H ( t ) = H ( S ) − H ( S | A ) . {\displaystyle IG(S,A)=\mathrm {H} {(S)}-\sum _{t\in T}p(t)\mathrm {H} {(t)}=\mathrm {H} {(S)}-\mathrm {H} {(S|A)}.} Where, H ( S ) {\displaystyle \mathrm {H} (S)} – Entropy of set S {\displaystyle S} T {\displaystyle T} – The subsets created from splitting set S {\displaystyle S} by attribute A {\displaystyle A} such that S = ⋃ t ∈ T t {\displaystyle S=\bigcup _{t\in T}t} p ( t ) {\displaystyle p(t)} – The proportion of the number of elements in t {\displaystyle t} to the number of elements in set S {\displaystyle S} H ( t ) {\displaystyle \mathrm {H} (t)} – Entropy of subset t {\displaystyle t} In ID3, information gain can be calculated (instead of entropy) for each remaining attribute. The attribute with the largest information gain is used to split the set S {\displaystyle S} on this iteration.

    Read more →