AI Face Combiner

AI Face Combiner — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • WeChat

    WeChat

    WeChat or Weixin in Chinese (Chinese: 微信; pinyin: Wēixìn ; lit. 'micro-message') is an instant messaging, social media, and mobile payment app developed by Tencent. First released in 2011, it became the world's largest standalone mobile app in 2018 with over 1 billion monthly active users. The Chinese version of WeChat, Weixin, has been described as China's "app for everything" and a super-app because of its wide range of functions. WeChat provides text messaging, hold-to-talk voice messaging, broadcast (one-to-many) messaging, video conferencing, video games, mobile payment, sharing of photographs and videos and location sharing. It has been described as having "an almost indispensable part of life in China". Accounts registered using Chinese phone numbers are managed under the Weixin brand, and their data is stored in mainland China and subject to Weixin's terms of service and privacy policy. Non-Chinese numbers are registered under WeChat, and WeChat users are subject to a more liberal terms of service and better privacy policy, and their data is stored in the Netherlands for users in the European Union, and in Singapore for other users. User activity on Weixin, the Chinese version of the app, is analyzed, tracked and shared with Chinese authorities upon request as part of the mass surveillance network in China. Chinese-registered Weixin accounts censor politically sensitive topics, and the software license agreement for Weixin (but not WeChat) explicitly forbids content which "[en]danger[s] national security, divulge[s] state secrets, subvert[s] state power and undermine[s] national unity", as well as other types of content such as content that "[u]ndermine[s] national religious policies" and content that is "[i]nciting illegal assembly, association, procession, demonstrations and gatherings disrupting the social order". Due to its central part of Chinese life, a Chinese person having their WeChat account banned can cause a significant disruption to their life. Any interactions between Weixin and WeChat users are subject to the terms of service and privacy policies of both services. == History == By 2010, Tencent had already attained a massive user base with their desktop messenger app QQ. Recognizing smart phones were likely to disrupt this status quo, CEO Pony Ma sought to proactively invest in alternatives to their own QQ messenger app. WeChat began as a project at Tencent Guangzhou Research and Project center in October 2010. The original version of the app was created by Allen Zhang, named "Weixin" (微信) by Pony Ma, and launched in 2011. The user adoption of WeChat was initially very slow, with users wondering why key features were missing; however, after the release of the Walkie-talkie-like voice messaging feature in May of that year, growth surged. By 2012, when the number of users reached 100 million, Weixin was re-branded "WeChat" by President Martin Lau for the international market. During a period of government support of e-commerce development—for example in the 12th five-year plan (2011–2015)—WeChat also saw new features enabling payments and commerce in 2013, which saw massive adoption after their virtual Red envelope promotion for Chinese New Year 2014. WeChat had over 889 million monthly active users by 2016, and as of 2019 WeChat's monthly active users had risen to an estimate of one billion. As of January 2022, it was reported that WeChat has more than 1.2 billion users. After the launch of WeChat payment in 2013, its users reached 400 million the next year, 90 percent of whom were in China. By comparison, Facebook Messenger and WhatsApp had about one billion monthly active users in 2016 but did not offer most of the other services available on WeChat. For example, in Q2 2017, WeChat's revenues from social media advertising were about US$0.9 billion (RMB6 billion) compared with Facebook's total revenues of US$9.3 billion, 98% of which were from social media advertising. WeChat's revenues from its value-added services were US$5.5 billion. By 2018, WeChat had been used by 93.5% of Chinese internet users. In that year, it became the world's largest standalone mobile app in 2018 with over 1 billion monthly active users. In response to a border dispute between India and China, WeChat was banned in India in June 2020 along with several other Chinese apps, including TikTok. U.S. president Donald Trump sought to ban U.S. "transactions" with WeChat through an executive order but was blocked by a preliminary injunction issued in the United States District Court for the Northern District of California in September 2020. Joe Biden officially dropped Trump's efforts to ban WeChat in the U.S. in June 2021. == Features == WeChat, has been described as China's "app for everything" and a super-app because of its wide range of functions. WeChat provides text messaging, hold-to-talk voice messaging, broadcast (one-to-many) messaging, video conferencing, video games, mobile payment, sharing of photographs and videos and location sharing. It has been described as having "an almost indispensable part of life in China". Due to its central part of Chinese life, a Chinese person having their WeChat account banned can cause a significant disruption to their life. === Messaging === WeChat provides a variety of features including text messaging, hold-to-talk voice messaging, broadcast (one-to-many) messaging, video calls and conferencing, video games, photograph and video sharing, as well as location sharing. WeChat also allows users to exchange contacts with people nearby via Bluetooth, as well as providing various features for contacting people at random if desired (if people are open to it). It can also integrate with other social networking services such as Facebook and Tencent QQ. Photographs may also be embellished with filters and captions, and automatic translation service is available and could also translate the conversation during messaging. WeChat supports different instant messaging methods, including text messages, voice messages, walkie talkie, and stickers. Users can send previously saved or live pictures and videos, profiles of other users, coupons, lucky money packages, or current GPS locations with friends either individually or in a group chat. WeChat also provides a message recall feature to allow users to recall and withdraw information (e.g. images, documents) that are sent within 2 minutes in a conversation. WeChat also provides a voice-to-text feature that brings convenience when it is not convenient to listen to voice messages, as well as the basic ability to recognize emojis based on different tones of voice. A distance sensing feature is implemented in WeChat. It has the ability to activate the receivers' hold-to-talk function when the phone was brought in close proximity to the ear. After the receiver was held at a certain distance from the ear, the sensor would then proceed to automatically disable the phone speakers. This feature eliminates the risk of the user's voice messages being inadvertently broadcast to the general public. === Public accounts === WeChat users can register as a public account (公众号), which enables them to push feeds to subscribers, interact with subscribers, and provide subscribers with services. Users can also create an official account, which fall under service, subscription, or enterprise accounts. Once users as individuals or organizations set up a type of account, they cannot change it to another type. By the end of 2014, the number of WeChat official accounts had reached 8 million. Official accounts of organizations can apply to be verified (cost 300 RMB or about US$45). Official accounts can be used as a platform for services such as hospital pre-registrations, or credit card service. To create an official account, the applicant must register with Chinese authorities, which discourages "foreign companies". In April 2022, WeChat announced that it will start displaying the location of users in China every time they post on a public account. Meanwhile, overseas users on public accounts will also display the country based on their IP address. === Moments === "Moments" (朋友圈) is WeChat's brand name for its social feed of friends' updates. "Moments" is an interactive platform that allows users to post images, text, and short videos taken by users. It also allows users to share articles and music (associated with QQ Music or other web-based music services). Friends in the contact list can like the content and leave comments, functioning similarly to a private social network. In 2017 WeChat had a policy of a maximum of two advertisements per day per Moments user. Privacy in WeChat works by groups of friends: only the friends from the user's contact are able to view their Moments' contents and comments. The friends of the user will only be able to see the likes and comments from other users only if they are in a mutual friend group. For example, friends from high school are not able to

    Read more →
  • Neural cryptography

    Neural cryptography

    Neural cryptography is a branch of cryptography dedicated to analyzing the application of stochastic algorithms, especially artificial neural network algorithms, for use in encryption and cryptanalysis. == Definition == Artificial neural networks are well known for their ability to selectively explore the solution space of a given problem. This feature finds a natural niche of application in the field of cryptanalysis. At the same time, neural networks offer a new approach to attack ciphering algorithms based on the principle that any function could be reproduced by a neural network, which is a powerful proven computational tool that can be used to find the inverse-function of any cryptographic algorithm. The ideas of mutual learning, self learning, and stochastic behavior of neural networks and similar algorithms can be used for different aspects of cryptography, like public-key cryptography, solving the key distribution problem using neural network mutual synchronization, hashing or generation of pseudo-random numbers. Another idea is the ability of a neural network to separate space in non-linear pieces using "bias". It gives different probabilities of activating the neural network or not. This is very useful in the case of Cryptanalysis. Two names are used to design the same domain of research: Neuro-Cryptography and Neural Cryptography. The first work that it is known on this topic can be traced back to 1995 in an IT Master Thesis. == Applications == In 1995, Sebastien Dourlens applied neural networks to cryptanalyze DES by allowing the networks to learn how to invert the S-tables of the DES. The bias in DES studied through Differential Cryptanalysis by Adi Shamir is highlighted. The experiment shows about 50% of the key bits can be found, allowing the complete key to be found in a short time. Hardware application with multi micro-controllers have been proposed due to the easy implementation of multilayer neural networks in hardware. One example of a public-key protocol is given by Khalil Shihab . He describes the decryption scheme and the public key creation that are based on a backpropagation neural network. The encryption scheme and the private key creation process are based on Boolean algebra. This technique has the advantage of small time and memory complexities. A disadvantage is the property of backpropagation algorithms: because of huge training sets, the learning phase of a neural network is very long. Therefore, the use of this protocol is only theoretical so far. == Neural key exchange protocol == The most used protocol for key exchange between two parties A and B in the practice is Diffie–Hellman key exchange protocol. Neural key exchange, which is based on the synchronization of two tree parity machines, should be a secure replacement for this method. Synchronizing these two machines is similar to synchronizing two chaotic oscillators in chaos communications. === Tree parity machine === The tree parity machine is a special type of multi-layer feedforward neural network. It consists of one output neuron, K hidden neurons and K×N input neurons. Inputs to the network take three values: x i j ∈ { − 1 , 0 , + 1 } {\displaystyle x_{ij}\in \left\{-1,0,+1\right\}} The weights between input and hidden neurons take the values: w i j ∈ { − L , . . . , 0 , . . . , + L } {\displaystyle w_{ij}\in \left\{-L,...,0,...,+L\right\}} Output value of each hidden neuron is calculated as a sum of all multiplications of input neurons and these weights: σ i = sgn ⁡ ( ∑ j = 1 N w i j x i j ) {\displaystyle \sigma _{i}=\operatorname {sgn}(\sum _{j=1}^{N}w_{ij}x_{ij})} Signum is a simple function, which returns −1,0 or 1: sgn ⁡ ( x ) = { − 1 if x < 0 , 0 if x = 0 , 1 if x > 0. {\displaystyle \operatorname {sgn}(x)={\begin{cases}-1&{\text{if }}x<0,\\0&{\text{if }}x=0,\\1&{\text{if }}x>0.\end{cases}}} If the scalar product is 0, the output of the hidden neuron is mapped to −1 in order to ensure a binary output value. The output of neural network is then computed as the multiplication of all values produced by hidden elements: τ = ∏ i = 1 K σ i {\displaystyle \tau =\prod _{i=1}^{K}\sigma _{i}} Output of the tree parity machine is binary. === Protocol === Each party (A and B) uses its own tree parity machine. Synchronization of the tree parity machines is achieved in these steps Initialize random weight values Execute these steps until the full synchronization is achieved Generate random input vector X Compute the values of the hidden neurons Compute the value of the output neuron Compare the values of both tree parity machines Outputs are the same: one of the suitable learning rules is applied to the weights Outputs are different: go to 2.1 After the full synchronization is achieved (the weights wij of both tree parity machines are same), A and B can use their weights as keys. This method is known as a bidirectional learning. One of the following learning rules can be used for the synchronization: Hebbian learning rule: w i + = g ( w i + σ i x i Θ ( σ i τ ) Θ ( τ A τ B ) ) {\displaystyle w_{i}^{+}=g(w_{i}+\sigma _{i}x_{i}\Theta (\sigma _{i}\tau )\Theta (\tau ^{A}\tau ^{B}))} Anti-Hebbian learning rule: w i + = g ( w i − σ i x i Θ ( σ i τ ) Θ ( τ A τ B ) ) {\displaystyle w_{i}^{+}=g(w_{i}-\sigma _{i}x_{i}\Theta (\sigma _{i}\tau )\Theta (\tau ^{A}\tau ^{B}))} Random walk: w i + = g ( w i + x i Θ ( σ i τ ) Θ ( τ A τ B ) ) {\displaystyle w_{i}^{+}=g(w_{i}+x_{i}\Theta (\sigma _{i}\tau )\Theta (\tau ^{A}\tau ^{B}))} Where: Θ ( a , b ) = 0 {\displaystyle \Theta (a,b)=0} if a ≠ b {\displaystyle a\neq b} otherwise Θ ( a , b ) = 1 {\displaystyle \Theta (a,b)=1} And: g ( x ) {\displaystyle g(x)} is a function that keeps the w i {\displaystyle w_{i}} in the range { − L , − L + 1 , . . . , 0 , . . . , L − 1 , L } {\displaystyle \{-L,-L+1,...,0,...,L-1,L\}} === Attacks and security of this protocol === In every attack it is considered, that the attacker E can eavesdrop messages between the parties A and B, but does not have an opportunity to change them. ==== Brute force ==== To provide a brute force attack, an attacker has to test all possible keys (all possible values of weights wij). By K hidden neurons, K×N input neurons and boundary of weights L, this gives (2L+1)KN possibilities. For example, the configuration K = 3, L = 3 and N = 100 gives us 310253 key possibilities, making the attack impossible with today's computer power. ==== Learning with own tree parity machine ==== One of the basic attacks can be provided by an attacker, who owns the same tree parity machine as the parties A and B. He wants to synchronize his tree parity machine with these two parties. In each step there are three situations possible: Output(A) ≠ Output(B): None of the parties updates its weights. Output(A) = Output(B) = Output(E): All the three parties update weights in their tree parity machines. Output(A) = Output(B) ≠ Output(E): Parties A and B update their tree parity machines, but the attacker can not do that. Because of this situation his learning is slower than the synchronization of parties A and B. It has been proven, that the synchronization of two parties is faster than learning of an attacker. It can be improved by increasing of the synaptic depth L of the neural network. That gives this protocol enough security and an attacker can find out the key only with small probability. ==== Other attacks ==== For conventional cryptographic systems, we can improve the security of the protocol by increasing of the key length. In the case of neural cryptography, we improve it by increasing of the synaptic depth L of the neural networks. Changing this parameter increases the cost of a successful attack exponentially, while the effort for the users grows polynomially. Therefore, breaking the security of neural key exchange belongs to the complexity class NP. Alexander Klimov, Anton Mityaguine, and Adi Shamir say that the original neural synchronization scheme can be broken by at least three different attacks—geometric, probabilistic analysis, and using genetic algorithms. Even though this particular implementation is insecure, the ideas behind chaotic synchronization could potentially lead to a secure implementation. === Permutation parity machine === The permutation parity machine is a binary variant of the tree parity machine. It consists of one input layer, one hidden layer and one output layer. The number of neurons in the output layer depends on the number of hidden units K. Each hidden neuron has N binary input neurons: x i j ∈ { 0 , 1 } {\displaystyle x_{ij}\in \left\{0,1\right\}} The weights between input and hidden neurons are also binary: w i j ∈ { 0 , 1 } {\displaystyle w_{ij}\in \left\{0,1\right\}} Output value of each hidden neuron is calculated as a sum of all exclusive disjunctions (exclusive or) of input neurons and these weights: σ i = θ N ( ∑ j = 1 N w i j ⊕ x i j ) {\displaystyle \sigma _{i}=\theta _{N}(\sum _{j=1}^{N}w_{ij}\oplus x_{ij})} (⊕ means XOR). Th

    Read more →
  • Bayesian network

    Bayesian network

    A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. Efficient algorithms can perform inference and learning in Bayesian networks. Bayesian networks that model sequences of variables (e.g. speech signals or protein sequences) are called dynamic Bayesian networks. Generalizations of Bayesian networks that can represent and solve decision problems under uncertainty are called influence diagrams. == Graphical model == Formally, Bayesian networks are directed acyclic graphs (DAGs) whose nodes represent variables in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters or hypotheses. Each edge represents a direct conditional dependency. Any pair of nodes that are not connected (i.e. no path connects one node to the other) represent variables that are conditionally independent of each other. Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node. For example, if m {\displaystyle m} parent nodes represent m {\displaystyle m} Boolean variables, then the probability function could be represented by a table of 2 m {\displaystyle 2^{m}} entries, one entry for each of the 2 m {\displaystyle 2^{m}} possible parent combinations. Similar ideas may be applied to undirected, and possibly cyclic, graphs such as Markov networks. == Example == Suppose we want to model the dependencies between three variables: the sprinkler (or more appropriately, its state - whether it is on or not), the presence or absence of rain and whether the grass is wet or not. Observe that two events can cause the grass to become wet: an active sprinkler or rain. Rain has a direct effect on the use of the sprinkler (namely that when it rains, the sprinkler usually is not active). This situation can be modeled with a Bayesian network (shown to the right). Each variable has two possible values, T (for true) and F (for false). The joint probability function is, by the chain rule of probability, Pr ( G , S , R ) = Pr ( G ∣ S , R ) Pr ( S ∣ R ) Pr ( R ) {\displaystyle \Pr(G,S,R)=\Pr(G\mid S,R)\Pr(S\mid R)\Pr(R)} where G = "Grass wet (true/false)", S = "Sprinkler turned on (true/false)", and R = "Raining (true/false)". The model can answer questions about the presence of a cause given the presence of an effect (so-called inverse probability) like "What is the probability that it is raining, given the grass is wet?" by using the conditional probability formula and summing over all nuisance variables: Pr ( R = T ∣ G = T ) = Pr ( G = T , R = T ) Pr ( G = T ) = ∑ x ∈ { T , F } Pr ( G = T , S = x , R = T ) ∑ x , y ∈ { T , F } Pr ( G = T , S = x , R = y ) {\displaystyle \Pr(R=T\mid G=T)={\frac {\Pr(G=T,R=T)}{\Pr(G=T)}}={\frac {\sum _{x\in \{T,F\}}\Pr(G=T,S=x,R=T)}{\sum _{x,y\in \{T,F\}}\Pr(G=T,S=x,R=y)}}} Using the expansion for the joint probability function Pr ( G , S , R ) {\displaystyle \Pr(G,S,R)} and the conditional probabilities from the conditional probability tables (CPTs) stated in the diagram, one can evaluate each term in the sums in the numerator and denominator. For example, Pr ( G = T , S = T , R = T ) = Pr ( G = T ∣ S = T , R = T ) Pr ( S = T ∣ R = T ) Pr ( R = T ) = 0.99 × 0.01 × 0.2 = 0.00198. {\displaystyle {\begin{aligned}\Pr(G=T,S=T,R=T)&=\Pr(G=T\mid S=T,R=T)\Pr(S=T\mid R=T)\Pr(R=T)\\&=0.99\times 0.01\times 0.2\\&=0.00198.\end{aligned}}} Then the numerical results (subscripted by the associated variable values) are Pr ( R = T ∣ G = T ) = 0.00198 T T T + 0.1584 T F T 0.00198 T T T + 0.288 T T F + 0.1584 T F T + 0.0 T F F = 891 2491 ≈ 35.77 % . {\displaystyle \Pr(R=T\mid G=T)={\frac {0.00198_{TTT}+0.1584_{TFT}}{0.00198_{TTT}+0.288_{TTF}+0.1584_{TFT}+0.0_{TFF}}}={\frac {891}{2491}}\approx 35.77\%.} To answer an interventional question, such as "What is the probability that it would rain, given that we wet the grass?" the answer is governed by the post-intervention joint distribution function Pr ( S , R ∣ do ( G = T ) ) = Pr ( S ∣ R ) Pr ( R ) {\displaystyle \Pr(S,R\mid {\text{do}}(G=T))=\Pr(S\mid R)\Pr(R)} obtained by removing the factor Pr ( G ∣ S , R ) {\displaystyle \Pr(G\mid S,R)} from the pre-intervention distribution. The do operator forces the value of G to be true. The probability of rain is unaffected by the action: Pr ( R ∣ do ( G = T ) ) = Pr ( R ) . {\displaystyle \Pr(R\mid {\text{do}}(G=T))=\Pr(R).} To predict the impact of turning the sprinkler on: Pr ( R , G ∣ do ( S = T ) ) = Pr ( R ) Pr ( G ∣ R , S = T ) {\displaystyle \Pr(R,G\mid {\text{do}}(S=T))=\Pr(R)\Pr(G\mid R,S=T)} with the term Pr ( S = T ∣ R ) {\displaystyle \Pr(S=T\mid R)} removed, showing that the action affects the grass but not the rain. These predictions may not be feasible given unobserved variables, as in most policy evaluation problems. The effect of the action do ( x ) {\displaystyle {\text{do}}(x)} can still be predicted, however, whenever the back-door criterion is satisfied. It states that, if a set Z of nodes can be observed that d-separates (or blocks) all back-door paths from X to Y then Pr ( Y , Z ∣ do ( x ) ) = Pr ( Y , Z , X = x ) Pr ( X = x ∣ Z ) . {\displaystyle \Pr(Y,Z\mid {\text{do}}(x))={\frac {\Pr(Y,Z,X=x)}{\Pr(X=x\mid Z)}}.} A back-door path is one that ends with an arrow into X. Sets that satisfy the back-door criterion are called "sufficient" or "admissible." For example, the set Z = R is admissible for predicting the effect of S = T on G, because R d-separates the (only) back-door path S ← R → G. However, if S is not observed, no other set d-separates this path and the effect of turning the sprinkler on (S = T) on the grass (G) cannot be predicted from passive observations. In that case P(G | do(S = T)) is not "identified". This reflects the fact that, lacking interventional data, the observed dependence between S and G is due to a causal connection or is spurious (apparent dependence arising from a common cause, R). (see Simpson's paradox) To determine whether a causal relation is identified from an arbitrary Bayesian network with unobserved variables, one can use the three rules of "do-calculus" and test whether all do terms can be removed from the expression of that relation, thus confirming that the desired quantity is estimable from frequency data. Using a Bayesian network can save considerable amounts of memory over exhaustive probability tables, if the dependencies in the joint distribution are sparse. For example, a naive way of storing the conditional probabilities of 10 two-valued variables as a table requires storage space for 2 10 = 1024 {\displaystyle 2^{10}=1024} values. If no variable's local distribution depends on more than three parent variables, the Bayesian network representation stores at most 10 ⋅ 2 3 = 80 {\displaystyle 10\cdot 2^{3}=80} values. One advantage of Bayesian networks is that it is intuitively easier for a human to understand (a sparse set of) direct dependencies and local distributions than complete joint distributions. == Inference and learning == Bayesian networks perform three main inference tasks: Inferring unobserved variables Parameter learning for the probability distributions of each node in the network Structure learning of the graphical network === Inferring unobserved variables === Because a Bayesian network is a complete model for its variables and their relationships, it can be used to answer probabilistic queries about them. For example, the network can be used to update knowledge of the state of a subset of variables when other variables (the evidence variables) are observed. This process of computing the posterior distribution of variables given evidence is called probabilistic inference. The posterior gives a universal sufficient statistic for detection applications, when choosing values for the variable subset that minimize some expected loss function, for instance the probability of decision error. A Bayesian network can thus be considered a mechanism for automatically applying Bayes' theorem to complex problems. The most common exact inference methods are: variable elimination, which eliminates (by integration or summation) the non-observed non-query variables one by one by distributing the sum over the prod

    Read more →
  • Locality-sensitive hashing

    Locality-sensitive hashing

    In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. The number of buckets is much smaller than the universe of possible input items. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items. Hashing-based approximate nearest-neighbor search algorithms generally use one of two main categories of hashing methods: either data-independent methods, such as locality-sensitive hashing (LSH); or data-dependent methods, such as locality-preserving hashing (LPH). Locality-preserving hashing was initially devised as a way to facilitate data pipelining in implementations of massively parallel algorithms that use randomized routing and universal hashing to reduce memory contention and network congestion. == Definitions == A finite family F {\displaystyle {\mathcal {F}}} of functions h : M → S {\displaystyle h\colon M\to S} is defined to be an LSH family for a metric space M = ( M , d ) {\displaystyle {\mathcal {M}}=(M,d)} , a threshold r > 0 {\displaystyle r>0} , an approximation factor c > 1 {\displaystyle c>1} , and probabilities p 1 > p 2 {\displaystyle p_{1}>p_{2}} if it satisfies the following condition. For any two points a , b ∈ M {\displaystyle a,b\in M} and a hash function h {\displaystyle h} chosen uniformly at random from F {\displaystyle {\mathcal {F}}} : If d ( a , b ) ≤ r {\displaystyle d(a,b)\leq r} , then h ( a ) = h ( b ) {\displaystyle h(a)=h(b)} (i.e., a and b collide) with probability at least p 1 {\displaystyle p_{1}} , If d ( a , b ) ≥ c r {\displaystyle d(a,b)\geq cr} , then h ( a ) = h ( b ) {\displaystyle h(a)=h(b)} with probability at most p 2 {\displaystyle p_{2}} . Such a family F {\displaystyle {\mathcal {F}}} is called ( r , c r , p 1 , p 2 ) {\displaystyle (r,cr,p_{1},p_{2})} -sensitive. === LSH with respect to a similarity measure === Alternatively it is possible to define an LSH family on a universe of items U endowed with a similarity function ϕ : U × U → [ 0 , 1 ] {\displaystyle \phi \colon U\times U\to [0,1]} . In this setting, a LSH scheme is a family of hash functions H coupled with a probability distribution D over H such that a function h ∈ H {\displaystyle h\in H} chosen according to D satisfies P r [ h ( a ) = h ( b ) ] = ϕ ( a , b ) {\displaystyle Pr[h(a)=h(b)]=\phi (a,b)} for each a , b ∈ U {\displaystyle a,b\in U} . === Amplification === Given a ( d 1 , d 2 , p 1 , p 2 ) {\displaystyle (d_{1},d_{2},p_{1},p_{2})} -sensitive family F {\displaystyle {\mathcal {F}}} , we can construct new families G {\displaystyle {\mathcal {G}}} by either the AND-construction or OR-construction of F {\displaystyle {\mathcal {F}}} . To create an AND-construction, we define a new family G {\displaystyle {\mathcal {G}}} of hash functions g, where each function g is constructed from k random functions h 1 , … , h k {\displaystyle h_{1},\ldots ,h_{k}} from F {\displaystyle {\mathcal {F}}} . We then say that for a hash function g ∈ G {\displaystyle g\in {\mathcal {G}}} , g ( x ) = g ( y ) {\displaystyle g(x)=g(y)} if and only if all h i ( x ) = h i ( y ) {\displaystyle h_{i}(x)=h_{i}(y)} for i = 1 , 2 , … , k {\displaystyle i=1,2,\ldots ,k} . Since the members of F {\displaystyle {\mathcal {F}}} are independently chosen for any g ∈ G {\displaystyle g\in {\mathcal {G}}} , G {\displaystyle {\mathcal {G}}} is a ( d 1 , d 2 , p 1 k , p 2 k ) {\displaystyle (d_{1},d_{2},p_{1}^{k},p_{2}^{k})} -sensitive family. To create an OR-construction, we define a new family G {\displaystyle {\mathcal {G}}} of hash functions g, where each function g is constructed from k random functions h 1 , … , h k {\displaystyle h_{1},\ldots ,h_{k}} from F {\displaystyle {\mathcal {F}}} . We then say that for a hash function g ∈ G {\displaystyle g\in {\mathcal {G}}} , g ( x ) = g ( y ) {\displaystyle g(x)=g(y)} if and only if h i ( x ) = h i ( y ) {\displaystyle h_{i}(x)=h_{i}(y)} for one or more values of i. Since the members of F {\displaystyle {\mathcal {F}}} are independently chosen for any g ∈ G {\displaystyle g\in {\mathcal {G}}} , G {\displaystyle {\mathcal {G}}} is a ( d 1 , d 2 , 1 − ( 1 − p 1 ) k , 1 − ( 1 − p 2 ) k ) {\displaystyle (d_{1},d_{2},1-(1-p_{1})^{k},1-(1-p_{2})^{k})} -sensitive family. == Applications == LSH has been applied to several problem domains, including: Near-duplicate detection Hierarchical clustering Genome-wide association study Image similarity identification VisualRank Gene expression similarity identification Audio similarity identification Nearest neighbor search Audio fingerprint Digital video fingerprinting Shared memory organization in parallel computing Physical data organization in database management systems Training fully connected neural networks Computer security Machine learning == Methods == === Bit sampling for Hamming distance === One of the easiest ways to construct an LSH family is by bit sampling. This approach works for the Hamming distance over d-dimensional vectors { 0 , 1 } d {\displaystyle \{0,1\}^{d}} . Here, the family F {\displaystyle {\mathcal {F}}} of hash functions is simply the family of all the projections of points on one of the d {\displaystyle d} coordinates, i.e., F = { h : { 0 , 1 } d → { 0 , 1 } ∣ h ( x ) = x i for some i ∈ { 1 , … , d } } {\displaystyle {\mathcal {F}}=\{h\colon \{0,1\}^{d}\to \{0,1\}\mid h(x)=x_{i}{\text{ for some }}i\in \{1,\ldots ,d\}\}} , where x i {\displaystyle x_{i}} is the i {\displaystyle i} th coordinate of x {\displaystyle x} . A random function h {\displaystyle h} from F {\displaystyle {\mathcal {F}}} simply selects a random bit from the input point. This family has the following parameters: P 1 = 1 − R / d {\displaystyle P_{1}=1-R/d} , P 2 = 1 − c R / d {\displaystyle P_{2}=1-cR/d} . That is, any two vectors x , y {\displaystyle x,y} with Hamming distance at most R {\displaystyle R} collide under a random h {\displaystyle h} with probability at least P 1 {\displaystyle P_{1}} . Any x , y {\displaystyle x,y} with Hamming distance at least c R {\displaystyle cR} collide with probability at most P 2 {\displaystyle P_{2}} . === Min-wise independent permutations === Suppose U is composed of subsets of some ground set of enumerable items S and the similarity function of interest is the Jaccard index J. If π is a permutation on the indices of S, for A ⊆ S {\displaystyle A\subseteq S} let h ( A ) = min a ∈ A { π ( a ) } {\displaystyle h(A)=\min _{a\in A}\{\pi (a)\}} . Each possible choice of π defines a single hash function h mapping input sets to elements of S. Define the function family H to be the set of all such functions and let D be the uniform distribution. Given two sets A , B ⊆ S {\displaystyle A,B\subseteq S} the event that h ( A ) = h ( B ) {\displaystyle h(A)=h(B)} corresponds exactly to the event that the minimizer of π over A ∪ B {\displaystyle A\cup B} lies inside A ∩ B {\displaystyle A\cap B} . As h was chosen uniformly at random, P r [ h ( A ) = h ( B ) ] = J ( A , B ) {\displaystyle Pr[h(A)=h(B)]=J(A,B)\,} and ( H , D ) {\displaystyle (H,D)\,} define an LSH scheme for the Jaccard index. Because the symmetric group on n elements has size n!, choosing a truly random permutation from the full symmetric group is infeasible for even moderately sized n. Because of this fact, there has been significant work on finding a family of permutations that is "min-wise independent" — a permutation family for which each element of the domain has equal probability of being the minimum under a randomly chosen π. It has been established that a min-wise independent family of permutations is at least of size lcm ⁡ { 1 , 2 , … , n } ≥ e n − o ( n ) {\displaystyle \operatorname {lcm} \{\,1,2,\ldots ,n\,\}\geq e^{n-o(n)}} , and that this bound is tight. Because min-wise independent families are too big for practical applications, two variant notions of min-wise independence are introduced: restricted min-wise independent permutations families, and approximate min-wise independent families. Restricted min-wise independence is the min-wise independence property restricted to certain sets of cardinality at most k. Approximate min-wise independence differs from the property by at most a fixed ε. === Open source methods === ==== Nilsimsa Hash ==== Nilsimsa is a locality-sensitive hashing algorithm used in anti-spam efforts. The goal of Nilsimsa is to generate a hash digest of an email message such that the digests of two similar messages are similar to each other. The paper suggests that the Nilsimsa satisfies three requirements: The digest identifying each message should not

    Read more →
  • Supper (Spotify)

    Supper (Spotify)

    Supper is a web-based application on the Spotify digital music streaming platform. The Supper app was born from a group of friends who had backgrounds in the music and gastronomy industries. Digital music solutions company Artisan Council later executed it. The app now sits in the top 40 applications on Spotify. == About == The Supper Spotify application matches recipes for all occasions and skill levels with a playlist for both preparation and presentation, as envisioned by the chefs themselves. Supper is credited with being one of the first apps to pair music with food. Playing on the social nature of music and food culture, users can seamlessly experience both for the first time with real time music streaming. == Supper.mx == In May 2014 Supper was launched outside of the Spotify streaming platform. Though still in partnership with Spotify, supper.mx allows users to view Supper's music + food collaborations on mobile, tablet and desktop, without the need to download Spotify directly. == Curators == All of the recipes and playlists featured on the Supper app come straight from a growing network of tastemakers, including chefs, musicians and institutions around the world. Each month the recipes and playlists are updated in conjunction with current holidays, events and seasons. === Launch === Launching in October 2013 the first edition of Supper featured content from a range of eating institutions and culture makers from the US and Australia. Brooklyn Bowl (Brooklyn) Roberta's Pizza (Brooklyn) Fancy Hanks (Melbourne) The Foresters/Queenies Upstairs (Sydney) Hipstamatic Panama House (Bondi) Sweetwater Inn (Melbourne) Soul Clap (Syd record label) Yellow Birds (Melbourne) === November 2013 === Yardbird (Hong Kong) Sonoma Bakery (Sydney) Do or Dine (Brooklyn) Cameo Gallery (Brooklyn) Hypertrak (Blog) Blue Smoke (NYC) The Crepes of Wrath (Blog) Willin Low // Wild Rocket - Wild Oats - Relish === December 2013 === The Copper Mill (Sydney) Thug Kitchen Mamak (Sydney) Tutu's (Brooklyn) Chin Chin (Melbourne) Flat Iron Steak (London) Greasy Spoon (Copenhagen) === January 2014 === Mexicali Taco & Co. (LA) Church & State (LA) Salts Cure (LA) Nopa (SF) L & E Oyster (LA) 4100 bar (LA) Golden Gopher (LA) The Pie Hole (LA) State Bird Provisions (SF) === Momofuku === In February 2014 Supper teamed up with restaurant heavy weights Momofuku. The recipes featured came from their iconic New York, Toronto and Sydney restaurants. Head office also got involved with an instructional from Brand Director Sue Chan on how to paint Momofuku vibes on to any party. === SXSW === March sees the Supper team migrate to Austin, Texas for SXSW, bringing together the best eateries the city has to offer as well as the music that has influenced them. Restaurants and eateries on board in 2014 included: The Backspace Kelis Swifts Attic Uchi Jackalope Paul Qui/East Side King Thai Kun Wonderland Hole in the Wall Justine's Brasserie The Liberty === Kelis === In April 2014 Kelis presented 5 of her recipes paired with a personal playlist for Supper. Kelis shared her recipes for apple farro, jerk ribs, New York vanilla bean cheesecake and Jerk Ribs. The Kelis/Supper collaboration coincided with the release of Kelis' 2014 album titled 'Food'. === Roberta's Pizza === In May 2014 Bushwick's Roberta's Pizza was guest curator on the Supper app and website. Included in their selections were restaurants and bars from across New York including Bun-ker Vietnamese, Old Stanley's Bar, St. Anselm, Chuko, Frank's Cocktail Lounge, Junior's Cheesecake, Xi'an Famous Foods, Xe Lua, 124 Old Rabbit and Yuji Ramen.

    Read more →
  • Modes of variation

    Modes of variation

    In statistics, modes of variation are a continuously indexed set of vectors or functions that are centered at a mean and are used to depict the variation in a population or sample. Typically, variation patterns in the data can be decomposed in descending order of eigenvalues with the directions represented by the corresponding eigenvectors or eigenfunctions. Modes of variation provide a visualization of this decomposition and an efficient description of variation around the mean. Both in principal component analysis (PCA) and in functional principal component analysis (FPCA), modes of variation play an important role in visualizing and describing the variation in the data contributed by each eigencomponent. In real-world applications, the eigencomponents and associated modes of variation aid to interpret complex data, especially in exploratory data analysis (EDA). == Formulation == Modes of variation are a natural extension of PCA and FPCA. === Modes of variation in PCA === If a random vector X = ( X 1 , X 2 , ⋯ , X p ) T {\displaystyle \mathbf {X} =(X_{1},X_{2},\cdots ,X_{p})^{T}} has the mean vector μ p {\displaystyle {\boldsymbol {\mu }}_{p}} , and the covariance matrix Σ p × p {\displaystyle \mathbf {\Sigma } _{p\times p}} with eigenvalues λ 1 ≥ λ 2 ≥ ⋯ ≥ λ p ≥ 0 {\displaystyle \lambda _{1}\geq \lambda _{2}\geq \cdots \geq \lambda _{p}\geq 0} and corresponding orthonormal eigenvectors e 1 , e 2 , ⋯ , e p {\displaystyle \mathbf {e} _{1},\mathbf {e} _{2},\cdots ,\mathbf {e} _{p}} , by eigendecomposition of a real symmetric matrix, the covariance matrix Σ {\displaystyle \mathbf {\Sigma } } can be decomposed as Σ = Q Λ Q T , {\displaystyle \mathbf {\Sigma } =\mathbf {Q} \mathbf {\Lambda } \mathbf {Q} ^{T},} where Q {\displaystyle \mathbf {Q} } is an orthogonal matrix whose columns are the eigenvectors of Σ {\displaystyle \mathbf {\Sigma } } , and Λ {\displaystyle \mathbf {\Lambda } } is a diagonal matrix whose entries are the eigenvalues of Σ {\displaystyle \mathbf {\Sigma } } . By the Karhunen–Loève expansion for random vectors, one can express the centered random vector in the eigenbasis X − μ = ∑ k = 1 p ξ k e k , {\displaystyle \mathbf {X} -{\boldsymbol {\mu }}=\sum _{k=1}^{p}\xi _{k}\mathbf {e} _{k},} where ξ k = e k T ( X − μ ) {\displaystyle \xi _{k}=\mathbf {e} _{k}^{T}(\mathbf {X} -{\boldsymbol {\mu }})} is the principal component associated with the k {\displaystyle k} -th eigenvector e k {\displaystyle \mathbf {e} _{k}} , with the properties E ⁡ ( ξ k ) = 0 , Var ⁡ ( ξ k ) = λ k , {\displaystyle \operatorname {E} (\xi _{k})=0,\operatorname {Var} (\xi _{k})=\lambda _{k},} and E ⁡ ( ξ k ξ l ) = 0 for l ≠ k . {\displaystyle \operatorname {E} (\xi _{k}\xi _{l})=0\ {\text{for}}\ l\neq k.} Then the k {\displaystyle k} -th mode of variation of X {\displaystyle \mathbf {X} } is the set of vectors, indexed by α {\displaystyle \alpha } , m k , α = μ ± α λ k e k , α ∈ [ − A , A ] , {\displaystyle \mathbf {m} _{k,\alpha }={\boldsymbol {\mu }}\pm \alpha {\sqrt {\lambda _{k}}}\mathbf {e} _{k},\alpha \in [-A,A],} where A {\displaystyle A} is typically selected as 2 or 3 {\displaystyle 2\ {\text{or}}\ 3} . === Modes of variation in FPCA === For a square-integrable random function X ( t ) , t ∈ T ⊂ R p {\displaystyle X(t),t\in {\mathcal {T}}\subset R^{p}} , where typically p = 1 {\displaystyle p=1} and T {\displaystyle {\mathcal {T}}} is an interval, denote the mean function by μ ( t ) = E ⁡ ( X ( t ) ) {\displaystyle \mu (t)=\operatorname {E} (X(t))} , and the covariance function by G ( s , t ) = Cov ⁡ ( X ( s ) , X ( t ) ) = ∑ k = 1 ∞ λ k φ k ( s ) φ k ( t ) , {\displaystyle G(s,t)=\operatorname {Cov} (X(s),X(t))=\sum _{k=1}^{\infty }\lambda _{k}\varphi _{k}(s)\varphi _{k}(t),} where λ 1 ≥ λ 2 ≥ ⋯ ≥ 0 {\displaystyle \lambda _{1}\geq \lambda _{2}\geq \cdots \geq 0} are the eigenvalues and { φ 1 , φ 2 , ⋯ } {\displaystyle \{\varphi _{1},\varphi _{2},\cdots \}} are the orthonormal eigenfunctions of the linear Hilbert–Schmidt operator G : L 2 ( T ) → L 2 ( T ) , G ( f ) = ∫ T G ( s , t ) f ( s ) d s . {\displaystyle G:L^{2}({\mathcal {T}})\rightarrow L^{2}({\mathcal {T}}),\,G(f)=\int _{\mathcal {T}}G(s,t)f(s)ds.} By the Karhunen–Loève theorem, one can express the centered function in the eigenbasis, X ( t ) − μ ( t ) = ∑ k = 1 ∞ ξ k φ k ( t ) , {\displaystyle X(t)-\mu (t)=\sum _{k=1}^{\infty }\xi _{k}\varphi _{k}(t),} where ξ k = ∫ T ( X ( t ) − μ ( t ) ) φ k ( t ) d t {\displaystyle \xi _{k}=\int _{\mathcal {T}}(X(t)-\mu (t))\varphi _{k}(t)dt} is the k {\displaystyle k} -th principal component with the properties E ⁡ ( ξ k ) = 0 , Var ⁡ ( ξ k ) = λ k , {\displaystyle \operatorname {E} (\xi _{k})=0,\operatorname {Var} (\xi _{k})=\lambda _{k},} and E ⁡ ( ξ k ξ l ) = 0 for l ≠ k . {\displaystyle \operatorname {E} (\xi _{k}\xi _{l})=0{\text{ for }}l\neq k.} Then the k {\displaystyle k} -th mode of variation of X ( t ) {\displaystyle X(t)} is the set of functions, indexed by α {\displaystyle \alpha } , m k , α ( t ) = μ ( t ) ± α λ k φ k ( t ) , t ∈ T , α ∈ [ − A , A ] {\displaystyle m_{k,\alpha }(t)=\mu (t)\pm \alpha {\sqrt {\lambda _{k}}}\varphi _{k}(t),\ t\in {\mathcal {T}},\ \alpha \in [-A,A]} that are viewed simultaneously over the range of α {\displaystyle \alpha } , usually for A = 2 or 3 {\displaystyle A=2\ {\text{or}}\ 3} . == Estimation == The formulation above is derived from properties of the population. Estimation is needed in real-world applications. The key idea is to estimate mean and covariance. === Modes of variation in PCA === Suppose the data x 1 , x 2 , ⋯ , x n {\displaystyle \mathbf {x} _{1},\mathbf {x} _{2},\cdots ,\mathbf {x} _{n}} represent n {\displaystyle n} independent drawings from some p {\displaystyle p} -dimensional population X {\displaystyle \mathbf {X} } with mean vector μ {\displaystyle {\boldsymbol {\mu }}} and covariance matrix Σ {\displaystyle \mathbf {\Sigma } } . These data yield the sample mean vector x ¯ {\displaystyle {\overline {\mathbf {x} }}} , and the sample covariance matrix S {\displaystyle \mathbf {S} } with eigenvalue-eigenvector pairs ( λ ^ 1 , e ^ 1 ) , ( λ ^ 2 , e ^ 2 ) , ⋯ , ( λ ^ p , e ^ p ) {\displaystyle ({\hat {\lambda }}_{1},{\hat {\mathbf {e} }}_{1}),({\hat {\lambda }}_{2},{\hat {\mathbf {e} }}_{2}),\cdots ,({\hat {\lambda }}_{p},{\hat {\mathbf {e} }}_{p})} . Then the k {\displaystyle k} -th mode of variation of X {\displaystyle \mathbf {X} } can be estimated by m ^ k , α = x ¯ ± α λ ^ k e ^ k , α ∈ [ − A , A ] . {\displaystyle {\hat {\mathbf {m} }}_{k,\alpha }={\overline {\mathbf {x} }}\pm \alpha {\sqrt {{\hat {\lambda }}_{k}}}{\hat {\mathbf {e} }}_{k},\alpha \in [-A,A].} === Modes of variation in FPCA === Consider n {\displaystyle n} realizations X 1 ( t ) , X 2 ( t ) , ⋯ , X n ( t ) {\displaystyle X_{1}(t),X_{2}(t),\cdots ,X_{n}(t)} of a square-integrable random function X ( t ) , t ∈ T {\displaystyle X(t),t\in {\mathcal {T}}} with the mean function μ ( t ) = E ⁡ ( X ( t ) ) {\displaystyle \mu (t)=\operatorname {E} (X(t))} and the covariance function G ( s , t ) = Cov ⁡ ( X ( s ) , X ( t ) ) {\displaystyle G(s,t)=\operatorname {Cov} (X(s),X(t))} . Functional principal component analysis provides methods for the estimation of μ ( t ) {\displaystyle \mu (t)} and G ( s , t ) {\displaystyle G(s,t)} in detail, often involving point wise estimate and interpolation. Substituting estimates for the unknown quantities, the k {\displaystyle k} -th mode of variation of X ( t ) {\displaystyle X(t)} can be estimated by m ^ k , α ( t ) = μ ^ ( t ) ± α λ ^ k φ ^ k ( t ) , t ∈ T , α ∈ [ − A , A ] . {\displaystyle {\hat {m}}_{k,\alpha }(t)={\hat {\mu }}(t)\pm \alpha {\sqrt {{\hat {\lambda }}_{k}}}{\hat {\varphi }}_{k}(t),t\in {\mathcal {T}},\alpha \in [-A,A].} == Applications == Modes of variation are useful to visualize and describe the variation patterns in the data sorted by the eigenvalues. In real-world applications, modes of variation associated with eigencomponents allow to interpret complex data, such as the evolution of function traits and other infinite-dimensional data. To illustrate how modes of variation work in practice, two examples are shown in the graphs to the right, which display the first two modes of variation. The solid curve represents the sample mean function. The dashed, dot-dashed, and dotted curves correspond to modes of variation with α = ± 1 , ± 2 , {\displaystyle \alpha =\pm 1,\pm 2,} and ± 3 {\displaystyle \pm 3} , respectively. The first graph displays the first two modes of variation of female mortality data from 41 countries in 2003. The object of interest is log hazard function between ages 0 and 100 years. The first mode of variation suggests that the variation of female mortality is smaller for ages around 0 or 100, and larger for ages around 25. An appropriate and intuitive interpretation is that mortality around 25 is driven by accidental death, while around 0 or 100, mortality is related to congenital disease or natural death. Compared to female mortality

    Read more →
  • Randomized weighted majority algorithm

    Randomized weighted majority algorithm

    The randomized weighted majority algorithm is an algorithm in machine learning theory for aggregating expert predictions to a series of decision problems. It is a simple and effective method based on weighted voting which improves on the mistake bound of the deterministic weighted majority algorithm. In fact, in the limit, its prediction rate can be arbitrarily close to that of the best-predicting expert. == Example == Imagine that every morning before the stock market opens, we get a prediction from each of our "experts" about whether the stock market will go up or down. Our goal is to somehow combine this set of predictions into a single prediction that we then use to make a buy or sell decision for the day. The principal challenge is that we do not know which experts will give better or worse predictions. The RWMA gives us a way to do this combination such that our prediction record will be nearly as good as that of the single expert which, in hindsight, gave the most accurate predictions. == Motivation == In machine learning, the weighted majority algorithm (WMA) is a deterministic meta-learning algorithm for aggregating expert predictions. In pseudocode, the WMA is as follows: initialize all experts to weight 1 for each round: add each expert's weight to the option they predicted predict the option with the largest weighted sum multiply the weights of all experts who predicted wrongly by 1 2 {\displaystyle {\frac {1}{2}}} Suppose there are n {\displaystyle n} experts and the best expert makes m {\displaystyle m} mistakes. Then, the weighted majority algorithm (WMA) makes at most 2.4 ( log 2 ⁡ n + m ) {\displaystyle 2.4(\log _{2}n+m)} mistakes. This bound is highly problematic in the case of highly error-prone experts. Suppose, for example, the best expert makes a mistake 20% of the time; that is, in N = 100 {\displaystyle N=100} rounds using n = 10 {\displaystyle n=10} experts, the best expert makes m = 20 {\displaystyle m=20} mistakes. Then, the weighted majority algorithm only guarantees an upper bound of 2.4 ( log 2 ⁡ 10 + 20 ) ≈ 56 {\displaystyle 2.4(\log _{2}10+20)\approx 56} mistakes. As this is a known limitation of the weighted majority algorithm, various strategies have been explored in order to improve the dependence on m {\displaystyle m} . In particular, we can do better by introducing randomization. Drawing inspiration from the Multiplicative Weights Update Method algorithm, we will probabilistically make predictions based on how the experts have performed in the past. Similarly to the WMA, every time an expert makes a wrong prediction, we will decrement their weight. Mirroring the MWUM, we will then use the weights to make a probability distribution over the actions and draw our action from this distribution (instead of deterministically picking the majority vote as the WMA does). == Randomized weighted majority algorithm (RWMA) == The randomized weighted majority algorithm is an attempt to improve the dependence of the mistake bound of the WMA on m {\displaystyle m} . Instead of predicting based on majority vote, the weights, are used as probabilities for choosing the experts in each round and are updated over time (hence the name randomized weighted majority). Precisely, if w i {\displaystyle w_{i}} is the weight of expert i {\displaystyle i} , let W = ∑ i w i {\displaystyle W=\sum _{i}w_{i}} . We will follow expert i {\displaystyle i} with probability w i W {\displaystyle {\frac {w_{i}}{W}}} . This results in the following algorithm: initialize all experts to weight 1. for each round: add all experts' weights together to obtain the total weight W {\displaystyle W} choose expert i {\displaystyle i} randomly with probability w i W {\displaystyle {\frac {w_{i}}{W}}} predict as the chosen expert predicts multiply the weights of all experts who predicted wrongly by β {\displaystyle \beta } The goal is to bound the worst-case expected number of mistakes, assuming that the adversary has to select one of the answers as correct before we make our coin toss. This is a reasonable assumption in, for instance, the stock market example provided above: the variance of a stock price should not depend on the opinions of experts that influence private buy or sell decisions, so we can treat the price change as if it was decided before the experts gave their recommendations for the day. The randomized algorithm is better in the worst case than the deterministic algorithm (weighted majority algorithm): in the latter, the worst case was when the weights were split 50/50. But in the randomized version, since the weights are used as probabilities, there would still be a 50/50 chance of getting it right. In addition, generalizing to multiplying the weights of the incorrect experts by β < 1 {\displaystyle \beta <1} instead of strictly 1 2 {\displaystyle {\frac {1}{2}}} allows us to trade off between dependence on m {\displaystyle m} and log 2 ⁡ n {\displaystyle \log _{2}n} . This trade-off will be quantified in the analysis section. == Analysis == Let W t {\displaystyle W_{t}} denote the total weight of all experts at round t {\displaystyle t} . Also let F t {\displaystyle F_{t}} denote the fraction of weight placed on experts which predict the wrong answer at round t {\displaystyle t} . Finally, let N {\displaystyle N} be the total number of rounds in the process. By definition, F t {\displaystyle F_{t}} is the probability that the algorithm makes a mistake on round t {\displaystyle t} . It follows from the linearity of expectation that if M {\displaystyle M} denotes the total number of mistakes made during the entire process, E [ M ] = ∑ t = 1 N F t {\displaystyle E[M]=\sum _{t=1}^{N}F_{t}} . After round t {\displaystyle t} , the total weight is decreased by ( 1 − β ) F t W t {\displaystyle \ (1-\beta )F_{t}W_{t}} , since all weights corresponding to a wrong answer are multiplied by β < 1 {\displaystyle \ \beta <1} . It then follows that W t + 1 = W t ( 1 − ( 1 − β ) F t ) {\displaystyle W_{t+1}=W_{t}(1-(1-\beta )F_{t})} . By telescoping, since W 1 = n {\displaystyle W_{1}=n} , it follows that the total weight after the process concludes is On the other hand, suppose that m {\displaystyle \ m} is the number of mistakes made by the best-performing expert. At the end, this expert has weight β m {\displaystyle \ \beta ^{m}} . It follows, then, that the total weight is at least this much; in other words, W ≥ β m {\displaystyle \ W\geq \beta ^{m}} . This inequality and the above result imply Taking the natural logarithm of both sides yields Now, the Taylor series of the natural logarithm is In particular, it follows that ln ⁡ ( 1 − ( 1 − β ) F t ) < − ( 1 − β ) F t {\displaystyle \ \ln(1-(1-\beta )F_{t})<-(1-\beta )F_{t}} . Thus, Recalling that E [ M ] = ∑ t = 1 N F t {\displaystyle E[M]=\sum _{t=1}^{N}F_{t}} and rearranging, it follows that Now, as β → 1 {\displaystyle \beta \to 1} from below, the first constant tends to 1 {\displaystyle 1} ; however, the second constant tends to + ∞ {\displaystyle +\infty } . To quantify this tradeoff, define ε = 1 − β {\displaystyle \varepsilon =1-\beta } to be the penalty associated with getting a prediction wrong. Then, again applying the Taylor series of the natural logarithm, It then follows that the mistake bound, for small ε {\displaystyle \varepsilon } , can be written in the form ( 1 + ϵ 2 + O ( ε 2 ) ) m + ϵ − 1 ln ⁡ ( n ) {\displaystyle \ \left(1+{\frac {\epsilon }{2}}+O(\varepsilon ^{2})\right)m+\epsilon ^{-1}\ln(n)} . In English, the less that we penalize experts for their mistakes, the more that additional experts will lead to initial mistakes but the closer we get to capturing the predictive accuracy of the best expert as time goes on. In particular, given a sufficiently low value of ε {\displaystyle \varepsilon } and enough rounds, the randomized weighted majority algorithm can get arbitrarily close to the correct prediction rate of the best expert. In particular, as long as m {\displaystyle m} is sufficiently large compared to ln ⁡ ( n ) {\displaystyle \ln(n)} (so that their ratio is sufficiently small), we can assign we can obtain an upper bound on the number of mistakes equal to This implies that the "regret bound" on the algorithm (that is, how much worse it performs than the best expert) is sublinear, at O ( m ln ⁡ ( n ) ) {\displaystyle O({\sqrt {m\ln(n)}})} . == Revisiting the motivation == Recall that the motivation for the randomized weighted majority algorithm was given by an example where the best expert makes a mistake 20% of the time. Precisely, in N = 100 {\displaystyle N=100} rounds, with n = 10 {\displaystyle n=10} experts, where the best expert makes m = 20 {\displaystyle m=20} mistakes, the deterministic weighted majority algorithm only guarantees an upper bound of 2.4 ( log 2 ⁡ 10 + 20 ) ≈ 56 {\displaystyle 2.4(\log _{2}10+20)\approx 56} . By the analysis above, it follows that minimizing the number of worst-case expected mistakes is equivalent to minimizing the fun

    Read more →
  • Sharpness aware minimization

    Sharpness aware minimization

    Sharpness Aware Minimization (SAM) is an optimization algorithm used in machine learning that aims to improve model generalization. The method seeks to find model parameters that are located in regions of the loss landscape with uniformly low loss values, rather than parameters that only achieve a minimal loss value at a single point. This approach is described as finding "flat" minima instead of "sharp" ones. The rationale is that models trained this way are less sensitive to variations between training and test data, which can lead to better performance on unseen data. The algorithm was introduced in a 2020 paper by a team of researchers including Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. == Underlying Principle == SAM modifies the standard training objective by minimizing a "sharpness-aware" loss. This is formulated as a minimax problem where the inner objective seeks to find the highest loss value in the immediate neighborhood of the current model weights, and the outer objective minimizes this value: min w max ‖ ϵ ‖ p ≤ ρ L train ( w + ϵ ) + λ ‖ w ‖ 2 2 {\displaystyle \min _{w}\max _{\|\epsilon \|_{p}\leq \rho }L_{\text{train}}(w+\epsilon )+\lambda \|w\|_{2}^{2}} In this formulation: w {\displaystyle w} represents the model's parameters (weights). L train {\displaystyle L_{\text{train}}} is the loss calculated on the training data. ϵ {\displaystyle \epsilon } is a perturbation applied to the weights. ρ {\displaystyle \rho } is a hyperparameter that defines the radius of the neighborhood (an L p {\displaystyle L_{p}} ball) to search for the highest loss. An optional L2 regularization term, scaled by λ {\displaystyle \lambda } , can be included. A direct solution to the inner maximization problem is computationally expensive. SAM approximates it by taking a single gradient ascent step to find the perturbation ϵ {\displaystyle \epsilon } . This is calculated as: ϵ ( w ) = ρ ∇ L train ( w ) ‖ ∇ L train ( w ) ‖ 2 {\displaystyle \epsilon (w)=\rho {\frac {\nabla L_{\text{train}}(w)}{\|\nabla L_{\text{train}}(w)\|_{2}}}} The optimization process for each training step involves two stages. First, an "ascent step" computes a perturbed set of weights, w adv = w + ϵ ( w ) {\displaystyle w_{\text{adv}}=w+\epsilon (w)} , by moving towards the direction of the highest local loss. Second, a "descent step" updates the original weights w {\displaystyle w} using the gradient calculated at these perturbed weights, ∇ L train ( w adv ) {\displaystyle \nabla L_{\text{train}}(w_{\text{adv}})} . This update is typically performed using a standard optimizer like SGD or Adam. == Application and Performance == SAM has been applied in various machine learning contexts, primarily in computer vision. Research has shown it can improve generalization performance in models such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) on image datasets including ImageNet, CIFAR-10, and CIFAR-100. The algorithm has also been found to be effective in training models with noisy labels, where it performs comparably to methods designed specifically for this problem. Some studies indicate that SAM and its variants can improve out-of-distribution (OOD) generalization, which is a model's ability to perform well on data from distributions not seen during training. Other areas where it has been applied include gradual domain adaptation and mitigating overfitting in scenarios with repeated exposure to training examples. == Limitations == A primary limitation of SAM is its computational cost. By requiring two gradient computations (one for the ascent and one for the descent) per optimization step, it approximately doubles the training time compared to standard optimizers. The theoretical convergence properties of SAM are still under investigation. Some research suggests that with a constant step size, SAM may not converge to a stationary point. The accuracy of the single gradient step approximation for finding the worst-case perturbation may also decrease during the training process. The effectiveness of SAM can also be domain-dependent. While it has shown benefits for computer vision tasks, its impact on other areas, such as GPT-style language models where each training example is seen only once, has been reported as limited in some studies. Furthermore, while SAM seeks flat minima, some research suggests that not all flat minima necessarily lead to good generalization. The algorithm also introduces the neighborhood size ρ {\displaystyle \rho } as a new hyperparameter, which requires tuning. == Research, Variants, and Enhancements == Active research on SAM focuses on reducing its computational overhead and improving its performance. Several variants have been proposed to make the algorithm more efficient. These include methods that attempt to parallelize the two gradient computations, apply the perturbation to only a subset of parameters, or reduce the number of computation steps required. Other approaches use historical gradient information or apply SAM steps intermittently to lower the computational burden. To improve performance and robustness, variants have been developed that adapt the neighborhood size based on model parameter scales (Adaptive SAM or ASAM) or incorporate information about the curvature of the loss landscape (Curvature Regularized SAM or CR-SAM). Other research explores refining the perturbation step by focusing on specific components of the gradient or combining SAM with techniques like random smoothing. Theoretical work continues to analyze the algorithm's behavior, including its implicit bias towards flatter minima and the development of broader frameworks for sharpness-aware optimization that use different measures of sharpness.

    Read more →
  • Granular computing

    Granular computing

    Granular computing is an emerging computing paradigm of information processing that concerns the processing of complex information entities called "information granules", which arise in the process of data abstraction and derivation of knowledge from information or data. Generally speaking, information granules are collections of entities that usually originate at the numeric level and are arranged together due to their similarity, functional or physical adjacency, indistinguishability, coherency, or the like. At present, granular computing is more a theoretical perspective than a coherent set of methods or principles. As a theoretical perspective, it encourages an approach to data that recognizes and exploits the knowledge present in data at various levels of resolution or scales. In this sense, it encompasses all methods which provide flexibility and adaptability in the resolution at which knowledge or information is extracted and represented. == Types of granulation == As mentioned above, granular computing is not an algorithm or process; there is no particular method that is called "granular computing". It is rather an approach to looking at data that recognizes how different and interesting regularities in the data can appear at different levels of granularity, much as different features become salient in satellite images of greater or lesser resolution. On a low-resolution satellite image, for example, one might notice interesting cloud patterns representing cyclones or other large-scale weather phenomena, while in a higher-resolution image, one misses these large-scale atmospheric phenomena but instead notices smaller-scale phenomena, such as the interesting pattern that is the streets of Manhattan. The same is generally true of all data: At different resolutions or granularities, different features and relationships emerge. The aim of granular computing is to try to take advantage of this fact in designing more effective machine-learning and reasoning systems. There are several types of granularity that are often encountered in data mining and machine learning, and we review them below: === Value granulation (discretization/quantization) === One type of granulation is the quantization of variables. It is very common that in data mining or machine-learning applications the resolution of variables needs to be decreased in order to extract meaningful regularities. An example of this would be a variable such as "outside temperature" (temp), which in a given application might be recorded to several decimal places of precision (depending on the sensing apparatus). However, for purposes of extracting relationships between "outside temperature" and, say, "number of health-club applications" (club), it will generally be advantageous to quantize "outside temperature" into a smaller number of intervals. ==== Motivations ==== There are several interrelated reasons for granulating variables in this fashion: Based on prior domain knowledge, there is no expectation that minute variations in temperature (e.g., the difference between 80–80.7 °F (26.7–27.1 °C)) could have an influence on behaviors driving the number of health-club applications. For this reason, any "regularity" which our learning algorithms might detect at this level of resolution would have to be spurious, as an artifact of overfitting. By coarsening the temperature variable into intervals the difference between which we do anticipate (based on prior domain knowledge) might influence number of health-club applications, we eliminate the possibility of detecting these spurious patterns. Thus, in this case, reducing resolution is a method of controlling overfitting. By reducing the number of intervals in the temperature variable (i.e., increasing its grain size), we increase the amount of sample data indexed by each interval designation. Thus, by coarsening the variable, we increase sample sizes and achieve better statistical estimation. In this sense, increasing granularity provides an antidote to the so-called curse of dimensionality, which relates to the exponential decrease in statistical power with increase in number of dimensions or variable cardinality. Independent of prior domain knowledge, it is often the case that meaningful regularities (i.e., which can be detected by a given learning methodology, representational language, etc.) may exist at one level of resolution and not at another. For example, a simple learner or pattern recognition system may seek to extract regularities satisfying a conditional probability threshold such as p ( Y = y j | X = x i ) ≥ α . {\displaystyle p(Y=y_{j}|X=x_{i})\geq \alpha .} In the special case where α = 1 , {\displaystyle \alpha =1,} this recognition system is essentially detecting logical implication of the form X = x i → Y = y j {\displaystyle X=x_{i}\rightarrow Y=y_{j}} or, in words, "if X = x i , {\displaystyle X=x_{i},} then Y = y j {\displaystyle Y=y_{j}} ". The system's ability to recognize such implications (or, in general, conditional probabilities exceeding threshold) is partially contingent on the resolution with which the system analyzes the variables. As an example of this last point, consider the feature space shown to the right. The variables may each be regarded at two different resolutions. Variable X {\displaystyle X} may be regarded at a high (quaternary) resolution wherein it takes on the four values { x 1 , x 2 , x 3 , x 4 } {\displaystyle \{x_{1},x_{2},x_{3},x_{4}\}} or at a lower (binary) resolution wherein it takes on the two values { X 1 , X 2 } . {\displaystyle \{X_{1},X_{2}\}.} Similarly, variable Y {\displaystyle Y} may be regarded at a high (quaternary) resolution or at a lower (binary) resolution, where it takes on the values { y 1 , y 2 , y 3 , y 4 } {\displaystyle \{y_{1},y_{2},y_{3},y_{4}\}} or { Y 1 , Y 2 } , {\displaystyle \{Y_{1},Y_{2}\},} respectively. At the high resolution, there are no detectable implications of the form X = x i → Y = y j , {\displaystyle X=x_{i}\rightarrow Y=y_{j},} since every x i {\displaystyle x_{i}} is associated with more than one y j , {\displaystyle y_{j},} and thus, for all x i , {\displaystyle x_{i},} p ( Y = y j | X = x i ) < 1. {\displaystyle p(Y=y_{j}|X=x_{i})<1.} However, at the low (binary) variable resolution, two bilateral implications become detectable: X = X 1 ↔ Y = Y 1 {\displaystyle X=X_{1}\leftrightarrow Y=Y_{1}} and X = X 2 ↔ Y = Y 2 {\displaystyle X=X_{2}\leftrightarrow Y=Y_{2}} , since every X 1 {\displaystyle X_{1}} occurs iff Y 1 {\displaystyle Y_{1}} and X 2 {\displaystyle X_{2}} occurs iff Y 2 . {\displaystyle Y_{2}.} Thus, a pattern recognition system scanning for implications of this kind would find them at the binary variable resolution, but would fail to find them at the higher quaternary variable resolution. ==== Issues and methods ==== It is not feasible to exhaustively test all possible discretization resolutions on all variables in order to see which combination of resolutions yields interesting or significant results. Instead, the feature space must be preprocessed (often by an entropy analysis of some kind) so that some guidance can be given as to how the discretization process should proceed. Moreover, one cannot generally achieve good results by naively analyzing and discretizing each variable independently, since this may obliterate the very interactions that we had hoped to discover. A sample of papers that address the problem of variable discretization in general, and multiple-variable discretization in particular, is as follows: Chiu, Wong & Cheung (1991), Bay (2001), Liu et al. (2002), Wang & Liu (1998), Zighed, Rabaséda & Rakotomalala (1998), Catlett (1991), Dougherty, Kohavi & Sahami (1995), Monti & Cooper (1999), Fayyad & Irani (1993), Chiu, Cheung & Wong (1990), Nguyen & Nguyen (1998), Grzymala-Busse & Stefanowski (2001), Ting (1994), Ludl & Widmer (2000), Pfahringer (1995), An & Cercone (1999), Chiu & Cheung (1989), Chmielewski & Grzymala-Busse (1996), Lee & Shin (1994), Liu & Wellman (2002), Liu & Wellman (2004). === Variable granulation (clustering/aggregation/transformation) === Variable granulation is a term that could describe a variety of techniques, most of which are aimed at reducing dimensionality, redundancy, and storage requirements. We briefly describe some of the ideas here, and present pointers to the literature. ==== Variable transformation ==== A number of classical methods, such as principal component analysis, multidimensional scaling, factor analysis, and structural equation modeling, and their relatives, fall under the genus of "variable transformation." Also in this category are more modern areas of study such as dimensionality reduction, projection pursuit, and independent component analysis. The common goal of these methods in general is to find a representation of the data in terms of new variables, which are a linear or nonlinear transformation of the original variables, and in which important stati

    Read more →
  • Almeida–Pineda recurrent backpropagation

    Almeida–Pineda recurrent backpropagation

    Almeida–Pineda recurrent backpropagation is an extension to the backpropagation algorithm that is applicable to recurrent neural networks. It is a type of supervised learning. It was described somewhat cryptically in Richard Feynman's senior thesis, and rediscovered independently in the context of artificial neural networks by both Fernando Pineda and Luis B. Almeida. A recurrent neural network for this algorithm consists of some input units, some output units and eventually some hidden units. For a given set of (input, target) states, the network is trained to settle into a stable activation state with the output units in the target state, based on a given input state clamped on the input units.

    Read more →
  • Premature convergence

    Premature convergence

    Premature convergence is an unwanted effect in evolutionary algorithms (EA), a metaheuristic that mimics the basic principles of biological evolution as a computer algorithm for solving an optimization problem. The effect means that the population of an EA has converged too early, resulting in being suboptimal. In this context, the parental solutions, through the aid of genetic operators, are not able to generate offspring that are superior to, or outperform, their parents. Premature convergence is a common problem found in evolutionary algorithms, as it leads to a loss, or convergence of, a large number of alleles, subsequently making it very difficult to search for a specific gene in which the alleles were present. An allele is considered lost if, in a population, a gene is present, where all individuals are sharing the same value for that particular gene. An allele is, as defined by De Jong, considered to be a converged allele, when 95% of a population share the same value for a certain gene. == Strategies for preventing premature convergence == Strategies to regain genetic variation can be: a mating strategy called incest prevention, uniform crossover, mimicking sexual selection, favored replacement of similar individuals (preselection or crowding), segmentation of individuals of similar fitness (fitness sharing), increasing population size niche and specie The genetic variation can also be regained by mutation though this process is highly random. A general strategy to reduce the risk of premature convergence is to use structured populations instead of the commonly used panmictic ones. == Identification of the occurrence of premature convergence == It is hard to determine when premature convergence has occurred, and it is equally hard to predict its presence in the future. One measure is to use the difference between the average and maximum fitness values, as used by Patnaik & Srinivas, to then vary the crossover and mutation probabilities. Population diversity is another measure which has been extensively used in studies to measure premature convergence. However, although it has been widely accepted that a decrease in the population diversity directly leads to premature convergence, there have been little studies done on the analysis of population diversity. In other words, by using the term population diversity, the argument for a study in preventing premature convergence lacks robustness, unless specified what their definition of population diversity is. There are models to counter the effect and risk of premature convergence that do not compromise core GA parameters like population size, mutation rate, and other core mechanisms. These models were inspired by biological ecology, where genetic interactions are limited by external mechanisms such as spatial topologies or speciation. These ecological models, such as the Eco-GA, adopt diffusion-based strategies to improve the robustness of GA runs and increase the likelihood of reaching near-global optima. == Causes for premature convergence == There are a number of presumed or hypothesized causes for the occurrence of premature convergence. === Self-adaptive mutations === Rechenberg introduced the idea of self-adaptation of mutation distributions in evolution strategies. According to Rechenberg, the control parameters for these mutation distributions evolved internally through self-adaptation, rather than predetermination. He called it the 1/5-success rule of evolution strategies (1 + 1)-ES: The step size control parameter would be increased by some factor if the relative frequency of positive mutations through a determined period of time is larger than 1/5, vice versa if it is smaller than 1/5. Self-adaptive mutations may very well be one of the causes for premature convergence. Accurately locating of optima can be enhanced by self-adaptive mutation, as well as accelerating the search for this optima. This has been widely recognized, though the mechanism's underpinnings of this have been poorly studied, as it is often unclear whether the optima is found locally or globally. Self-adaptive methods can cause global convergence to global optimum, provided that the selection methods used are using elitism, as well as that the rule of self-adaptation doesn't interfere with the mutation distribution, which has the property of ensuring a positive minimum probability when hitting a random subset. This is for non-convex objective functions with sets that include bounded lower levels of non-zero measurements. A study by Rudolph suggests that self-adaption mechanisms among elitist evolution strategies do resemble the 1/5-success rule, and could very well get caught by a local optimum that include a positive probability. === Panmictic populations === Most EAs use unstructured or panmictic populations where basically every individual in the population is eligible for mate selection based on fitness. Thus, The genetic information of an only slightly better individual can spread in a population within a few generations, provided that no better other offspring is produced during this time. Especially in comparatively small populations, this can quickly lead to a loss of genotypic diversity and thus to premature convergence. A well-known countermeasure is to switch to alternative population models which introduce substructures into the population that preserve genotypic diversity over a longer period of time and thus counteract the tendency towards premature convergence. This has been shown for various EAs such as genetic algorithms, the evolution strategy, other EAs or memetic algorithms.

    Read more →
  • Elastic map

    Elastic map

    Elastic maps provide a tool for nonlinear dimensionality reduction. By their construction, they are a system of elastic springs embedded in the data space. This system approximates a low-dimensional manifold. The elastic coefficients of this system allow the switch from completely unstructured k-means clustering (zero elasticity) to the estimators located closely to linear PCA manifolds (for high bending and low stretching modules). With some intermediate values of the elasticity coefficients, this system effectively approximates non-linear principal manifolds. This approach is based on a mechanical analogy between principal manifolds, that are passing through "the middle" of the data distribution, and elastic membranes and plates. The method was developed by A.N. Gorban, A.Y. Zinovyev and A.A. Pitenko in 1996–1998. == Energy of elastic map == Let S {\displaystyle {\mathcal {S}}} be a data set in a finite-dimensional Euclidean space. Elastic map is represented by a set of nodes w j {\displaystyle {\bf {w}}_{j}} in the same space. Each datapoint s ∈ S {\displaystyle s\in {\mathcal {S}}} has a host node, namely the closest node w j {\displaystyle {\bf {w}}_{j}} (if there are several closest nodes then one takes the node with the smallest number). The data set S {\displaystyle {\mathcal {S}}} is divided into classes K j = { s | w j is a host of s } {\displaystyle K_{j}=\{s\ |\ {\bf {w}}_{j}{\mbox{ is a host of }}s\}} . The approximation energy D is the distortion D = 1 2 ∑ j = 1 k ∑ s ∈ K j ‖ s − w j ‖ 2 {\displaystyle D={\frac {1}{2}}\sum _{j=1}^{k}\sum _{s\in K_{j}}\|s-{\bf {w}}_{j}\|^{2}} , which is the energy of the springs with unit elasticity which connect each data point with its host node. It is possible to apply weighting factors to the terms of this sum, for example to reflect the standard deviation of the probability density function of any subset of data points { s i } {\displaystyle \{s_{i}\}} . On the set of nodes an additional structure is defined. Some pairs of nodes, ( w i , w j ) {\displaystyle ({\bf {w}}_{i},{\bf {w}}_{j})} , are connected by elastic edges. Call this set of pairs E {\displaystyle E} . Some triplets of nodes, ( w i , w j , w k ) {\displaystyle ({\bf {w}}_{i},{\bf {w}}_{j},{\bf {w}}_{k})} , form bending ribs. Call this set of triplets G {\displaystyle G} . The stretching energy is U E = 1 2 λ ∑ ( w i , w j ) ∈ E ‖ w i − w j ‖ 2 {\displaystyle U_{E}={\frac {1}{2}}\lambda \sum _{({\bf {w}}_{i},{\bf {w}}_{j})\in E}\|{\bf {w}}_{i}-{\bf {w}}_{j}\|^{2}} , The bending energy is U G = 1 2 μ ∑ ( w i , w j , w k ) ∈ G ‖ w i − 2 w j + w k ‖ 2 {\displaystyle U_{G}={\frac {1}{2}}\mu \sum _{({\bf {w}}_{i},{\bf {w}}_{j},{\bf {w}}_{k})\in G}\|{\bf {w}}_{i}-2{\bf {w}}_{j}+{\bf {w}}_{k}\|^{2}} , where λ {\displaystyle \lambda } and μ {\displaystyle \mu } are the stretching and bending moduli respectively. The stretching energy is sometimes referred to as the membrane, while the bending energy is referred to as the thin plate term. For example, on the 2D rectangular grid the elastic edges are just vertical and horizontal edges (pairs of closest vertices) and the bending ribs are the vertical or horizontal triplets of consecutive (closest) vertices. The total energy of the elastic map is thus U = D + U E + U G . {\displaystyle U=D+U_{E}+U_{G}.} The position of the nodes { w j } {\displaystyle \{{\bf {w}}_{j}\}} is determined by the mechanical equilibrium of the elastic map, i.e. its location is such that it minimizes the total energy U {\displaystyle U} . == Expectation-maximization algorithm == For a given splitting of dataset S {\displaystyle {\mathcal {S}}} in classes K j {\displaystyle K_{j}} , minimization of the quadratic functional U {\displaystyle U} is a linear problem with the sparse matrix of coefficients. Therefore, similar to principal component analysis or k-means, a splitting method is used: For given { w j } {\displaystyle \{{\bf {w}}_{j}\}} find { K j } {\displaystyle \{K_{j}\}} ; For given { K j } {\displaystyle \{K_{j}\}} minimize U {\displaystyle U} and find { w j } {\displaystyle \{{\bf {w}}_{j}\}} ; If no change, terminate. This expectation-maximization algorithm guarantees a local minimum of U {\displaystyle U} . For improving the approximation various additional methods are proposed. For example, the softening strategy is used. This strategy starts with a rigid grids (small length, small bending and large elasticity modules λ {\displaystyle \lambda } and μ {\displaystyle \mu } coefficients) and finishes with soft grids (small λ {\displaystyle \lambda } and μ {\displaystyle \mu } ). The training goes in several epochs, each epoch with its own grid rigidness. Another adaptive strategy is growing net: one starts from a small number of nodes and gradually adds new nodes. Each epoch goes with its own number of nodes. == Applications == Most important applications of the method and free software are in bioinformatics for exploratory data analysis and visualisation of multidimensional data, for data visualisation in economics, social and political sciences, as an auxiliary tool for data mapping in geographic informational systems and for visualisation of data of various nature. The method is applied in quantitative biology for reconstructing the curved surface of a tree leaf from a stack of light microscopy images. This reconstruction is used for quantifying the geodesic distances between trichomes and their patterning, which is a marker of the capability of a plant to resist to pathogenes. Recently, the method is adapted as a support tool in the decision process underlying the selection, optimization, and management of financial portfolios. The method of elastic maps has been systematically tested and compared with several machine learning methods on the applied problem of identification of the flow regime of a gas-liquid flow in a pipe. There are various regimes: Single phase water or air flow, Bubbly flow, Bubbly-slug flow, Slug flow, Slug-churn flow, Churn flow, Churn-annular flow, and Annular flow. The simplest and most common method used to identify the flow regime is visual observation. This approach is, however, subjective and unsuitable for relatively high gas and liquid flow rates. Therefore, the machine learning methods are proposed by many authors. The methods are applied to differential pressure data collected during a calibration process. The method of elastic maps provided a 2D map, where the area of each regime is represented. The comparison with some other machine learning methods is presented in Table 1 for various pipe diameters and pressure. Here, ANN stands for the backpropagation artificial neural networks, SVM stands for the support vector machine, SOM for the self-organizing maps. The hybrid technology was developed for engineering applications. In this technology, elastic maps are used in combination with Principal Component Analysis (PCA), Independent Component Analysis (ICA) and backpropagation ANN. The textbook provides a systematic comparison of elastic maps and self-organizing maps (SOMs) in applications to economic and financial decision-making.

    Read more →
  • Blackmagic Design

    Blackmagic Design

    Blackmagic Design Pty Ltd is an Australian company that develops digital cinema technology and manufactures professional video production hardware and software. Headquartered in South Melbourne, it is known for producing high-end digital movie cameras and a range of broadcast and post-production equipment. The company also develops software applications, including the DaVinci Resolve application for non-linear video editing, color correction, color grading, visual effects, and audio post-production. == History == Blackmagic Design Pty Ltd was founded on 7 September 2001 by Grant Petty. Its first product, DeckLink, introduced in 2002, was a video capture card for macOS that supported uncompressed 10-bit video, marking a shift toward professional-grade yet affordable video workflows. Subsequent versions—including the DeckLink 2, Pro SDI, HD Plus, and Multibridge—added capabilities such as color correction, Windows support, and compatibility with major editing software like Adobe Premiere Pro, to broaden the product's appeal. At the 2012 NAB Show, Blackmagic announced its first Cinema Camera, a digital movie camera. Blackmagic made several acquisitions over the next decade. In 2009, it acquired da Vinci Systems, known for its color-grading tools. In 2010, it acquired Echolab's ATEM switcher line, in 2014, it added eyeon Software (developer of the Blackmagic Fusion compositing software) and London's Cintel (film scanning and restoration), and in 2016, it acquired Fairlight, an audio technology company known for its CMI synthesizers as well as mixing consoles. == Products == List of all products developed by the company. Editing, Color Correction and Audio Post Production DaVinci Resolve (free version) and DaVinci Resolve Studio (paid version), computer software for non-linear video editing, color correction, color grading, visual effects, and audio post-production. Audio/Video Controller Consoles: Editor Keyboard, Speed Editor, DaVinci Resolve Replay Editor, Micro Panel, Mini Panel, DaVinci Resolve Micro Color Panel, Advanced Panel, Fairlight Console Channel Fader, Fairlight Console Channel Control, Fairlight Console LCD Monitor, Fairlight Console Audio Editor, Fairlight Desktop Audio Editor, Fairlight Desktop Console, Fairlight Audio Interface Cintel Film Scanner (Generations 1-3) Live Production Home Streaming: ATEM Mini, ATEM Mini Pro/ISO, ATEM Mini Extreme, ATEM Mini Extreme ISO (The ATEM Mini series has both HDMI and SDI variants) Production Switchers: ATEM 1,2 & 4 M/E Constellation HD, ATEM 1,2 & 4 M/E Constellation 4K, ATEM Constellation 8K, ATEM 1,2 & 4 M/E Production Studio 4K, ATEM Television Studio HD8 & HD8 ISO Switcher & Camera Controllers: ATEM Camera Control Panel, ATEM 1 M/E Advanced Panel, ATEM 2 M/E Advanced Panel, ATEM 4 M/E Advanced Panel Chroma Keyers: Ultimatte 12 HD Mini, Ultimatte 12 HD, Ultimatte 12 4K, Ultimatte 12 8K Recording and Storage: HyperDeck Studio HD Mini, HyperDeck Studio HD Plus, HyperDeck Studio HD Plus, HyperDeck Studio 4K Pro, HyperDeck Extreme 8K HDR, HyperDeck Extreme 4K HDR, HyperDeck Extreme Control, HyperDeck Shuttle HD, Duplicator 4K, MultiDock 10G, Video Assist 7" 12G HDR, Video Assist 5" 12G HDR Capture and Playback UltraStudio: 3G, HD Mini, 4K Mini, 4K Extreme 3 DeckLink (PCIe cards): Mini Recorder, Mini Monitor, Mini Monitor 4K, Mini Recorder 4K, Duo 2 Mini, Duo 2, Quad 2, SDI 4K, Studio 4K, 4K Extreme 12G, 8K Pro, Quad HDMI Recorder Network Storage Cloud Store Cloud Pod Broadcast Converters Micro Converter: BiDirectional SDI/HDMI 3G wPSU, HDMI to SDI 3G wPSU, SDI to HDMI 3G wPSU, BiDirectional SDI/HDMI 3G, HDMI to SDI 3G, SDI to HDMI 3G Mini Converters: Audio to SDI, Optical Fiber 12G, SDI Multiplex 4K, Quad SDI to HDMI 4K, SDI Distribution 4K, SDI to Analog 4K, Audio to SDI 4K, SDI to Audio 4K, HDMI to SDI 6G, SDI to HDMI 6G Teranex Mini: SDI Distribution 12G, SDI to HDMI 12G, Audio to SDI 12G, SDI to Analog 12G, SDI to HDMI 8K HDR, SDI to DisplayPort 8K HDR 2110 IP Converters Routing and Distribution Videohub

    Read more →
  • Artificial development

    Artificial development

    Artificial development, also known as artificial embryogeny or machine intelligence or computational development, is an area of computer science and engineering concerned with computational models motivated by genotype–phenotype mappings in biological systems. Artificial development is often considered a sub-field of evolutionary computation, although the principles of artificial development have also been used within stand-alone computational models. Within evolutionary computation, the need for artificial development techniques was motivated by the perceived lack of scalability and evolvability of direct solution encodings (Tufte, 2008). Artificial development entails indirect solution encoding. Rather than describing a solution directly, an indirect encoding describes (either explicitly or implicitly) the process by which a solution is constructed. Often, but not always, these indirect encodings are based upon biological principles of development such as morphogen gradients, cell division and cellular differentiation (e.g. Doursat 2008), gene regulatory networks (e.g. Guo et al., 2009), degeneracy (Whitacre et al., 2010), grammatical evolution (de Salabert et al., 2006), or analogous computational processes such as re-writing, iteration, and time. The influences of interaction with the environment, spatiality and physical constraints on differentiated multi-cellular development have been investigated more recently (e.g. Knabe et al. 2008). Artificial development approaches have been applied to a number of computational and design problems, including electronic circuit design (Miller and Banzhaf 2003), robotic controllers (e.g. Taylor 2004), and the design of physical structures (e.g. Hornby 2004).

    Read more →
  • Tensor product network

    Tensor product network

    A tensor product network, in artificial neural networks, is a network that exploits the properties of tensors to model associative concepts such as variable assignment. Orthonormal vectors are chosen to model the ideas (such as variable names and target assignments), and the tensor product of these vectors construct a network whose mathematical properties allow the user to easily extract the association from it.

    Read more →