The Quantum Thief is the debut science fiction novel by Finnish writer Hannu Rajaniemi and the first novel in a trilogy featuring the character of Jean le Flambeur; the sequels are The Fractal Prince (2012) and The Causal Angel (2014). The novel was published in Britain by Gollancz in 2010, and by Tor in 2011 in the US. It is a heist story, set in a futuristic Solar System, that features a protagonist modeled on Arsène Lupin, the gentleman thief of Maurice Leblanc. The novel was nominated for the 2011 Locus Award for Best First Novel, and was second runner-up for the 2011 Campbell Memorial Award. == Setting == Several centuries after the technological singularity largely destroyed Earth, various posthuman factions compete for dominance in the Solar System. Though sentient superintelligent AGI has never been successfully developed, civilization has been greatly transformed by the proliferation of Hansonian brain emulations (termed "gogols" in reference to Nikolai Gogol, and in particular his novel Dead Souls). An alliance of powerful gogol copies rule the inner system from computronium megastructures housing trillions of virtual minds, laboring to resurrect the dead in religious devotion to the philosophy of Nikolai Fedorov. This alliance, the Sobornost, has been in conflict with a community of quantum entangled minds who adhere to the "no-cloning" principle of quantum information theory, and so do not see the Sobornost's ultimate goal as resurrection, but death. Most of this community, the Zoku, was devastated when Jupiter was destroyed with a weaponized gravitational singularity. Among the last remnants of near-baseline humanity exist on the mobile cities of Mars, where advanced cryptography and an obsessive privacy culture ensure that the Sobornost cannot upload their citizens' minds. The most notable of these cities is the Oubliette, where time is used as a currency. When a citizen's balance reaches zero their mind is transferred to a robotic body to serve the needs of the city for a set period, before being returned to their original body with a restored balance of time. == Plot summary == Countless gogols of the legendary gentleman thief Jean Le Flambeur are trapped in a virtual Sobornost prison in orbit around Neptune, playing an iterated prisoner's dilemma until his mind learns to cooperate. A warrior from the Oort Cloud, which has been settled by Finnish colonists, successfully retrieves one of the Le Flambeur gogols and uploads it into a real-space body. Acting on behalf of a competing Sobornost authority, this Oortian, Mieli, ferries the thief to the Martian city known as The Oubliette, where he has stored his memories for later recovery. The two intend to recover his memories so that he may return to an operating capacity sufficient to serve his Sobornost benefactor in a theft and repay his liberation. On the Oubliette, the young detective Isidore Beautrelet helps vigilantes catch Sobornost agents illicitly uploading human minds. These vigilantes are revealed to be in the service of a local colony of Zoku. Beautrelet is employed to investigate the arrival of Le Flambeur, and in the process becomes aware that the Oubliette's cryptographic security was always compromised. The memories of its citizens are fabrications, and the "King of Mars" long believed ousted in a revolution, still reigns behind the scenes. This King, who is another copy of Jean Le Flambeur, is defeated in the ensuing conflict. Le Flambeur fails to recover all of his memories, which he had locked with a quantum entangled revolver that required him to kill several of his old friends to open his stored memory. He and Mieli escape a liberated Mars having recovered only a mysterious "Schrödinger’s Box" from the Memory Palace. == Themes == Themes central to The Quantum Thief are the unreliability and malleability of memory and the effects of extreme longevity on an individual's perspective and personality. Prisons, surveillance and control in society are also major themes. In the book, the people living in the Oubliette society on Mars have two types of memory; in addition to a traditional, personal memory, there is the exomemory, which can be accessed by other people, from anywhere in the city. Memories about personal experiences can be stored in the exomemory and partitioned, with different levels of access granted to different people. These memories can be used, among other things, as an expedient form of communication. The Oubliette society has an economy where time is used as currency. When an individual's time is expended, their consciousness is uploaded into a "Quiet". The Quiet are mute machine servants who maintain and protect the city. Although the quiet seem to have little interest in the world outside their occupations, they do seem to retain some traces of their former personalities and memories. The conspiracy central to the plot involves the hidden rulers, called the "cryptarchs", manipulating and abusing the exomemory and through the citizens' transformations to quiet and back, the traditional memory as well. In the book, the Oubliette society is compared to a panopticon; a prison, where every action of the dwellers can be scrutinized. == History and influences == The first chapter of The Quantum Thief was presented by Rajaniemi's literary agent, John Jarrold, to Gollancz as the basis for the three-book deal that was eventually secured. Rajaniemi has stated that he had "come up with an outline that had every single idea I could cram into it, because I wanted to be worthy of what had happened." The outline eventually expanded into three parts, and the first part became The Quantum Thief. The novel's plot was inspired by one of Rajaniemi's favorite characters in fiction, Maurice Leblanc's gentleman thief Arsène Lupin, who operates on both sides of the law. What intrigued Rajaniemi were the cycles of redemption and relapse Lupin goes through as he tries to go straight, always falling short. Besides LeBlanc, Rajaniemi mentioned Roger Zelazny as a strong influence. Ian McDonald was the other science fiction author he mentioned as influential, plus Frances A.Yates's book The Art of Memory, for memory palaces. In an interview, Rajaniemi said he wasn't trying to write the novel as hard science fiction: "For me, the more important consequence of having a scientific background is a degree of speculative rigour: trying hard to work out the consequences of the assumptions one begins with." == Reception == The novel has received generally positive reviews. Gary K. Wolfe writes in his Locus review that Rajaniemi has "spectacularly delivered on the promise that this is likely the most important debut SF novel we'll see this year". James Lovegrove, reviewing the book in his Financial Times column, notes that "many an anglophone author would kill to turn out prose half as good as this, especially on their maiden effort." Eric Brown, reviewing for The Guardian, finds the novel to be "a brilliant debut", while alluding to the "apocryphal" (and incorrect) myth that "this novel sold on the strength of its first line." Sam Bandah, at SciFiNow, praises the novel for "its engaging narrative and characters backed by often almost intimidatingly good sci-fi concepts." Criticism for the novel has generally centred on Rajaniemi's sparse "show, don't tell" writing style. Brown notes that "the author makes no concessions to the lazy reader with info-dumps or convenient explanations." Niall Alexander, of the Speculative Scotsman, states that "had there been some sort of index, [he] would have gladly (and repeatedly) referred to it during the mind-boggling first third of The Quantum Thief", while proclaiming the novel to be "the sci-fi debut of 2010." == Awards == Nominee for the 2011 Locus Award for Best First Novel. Third place for the 2011 John W. Campbell Memorial Award for Best Science Fiction Novel
Clubdjpro
ClubDJPro (often referred to as ClubDJ) is a DJ console and video mixing tool developed by Cube Software Solutions Inc. software. It was released in June 2005. == User interface == ClubDJPro has a GUI that was designed to allow aesthetic revisions via Skins. The skin engine that ClubDJPro uses allows for the ability to expand the software to take up the entire screen. As of 4.4.3.3 there are 3 user changeable skins included in the program which are changeable in the preferences tab. They are called 'AquaLung', 'Eleanor', and 'Grabber'. == Editions == ClubDJPro is available in two different editions, with separate features depending upon their target consumer group. DJ Edition - Can play audio files only. VJ Edition - Contains all of the features of the DJ Edition, in addition to support for video, karaoke, and visualizations. == Supported MIDI Controllers == Supported since version 2.0: Hercules Console Hercules Console MK2 Hercules Control MP3 PCDJ DAC-2 Controller == History == The initial "final release" of ClubDJPro was released on June 24, 2005. On June 26, 2009, the 4th iteration of the ClubDJPro software was released. The development of the software and website appears to have halted. As of March 2018 the website continues to show a new version "Coming Spring 2016".
Information gain (decision tree)
In the context of decision trees in information theory and machine learning, information gain refers to the conditional expected value of the Kullback–Leibler divergence of the univariate probability distribution of one variable from the conditional distribution of this variable given the other one. (In broader contexts, information gain can also be used as a synonym for either Kullback–Leibler divergence or mutual information, but the focus of this article is on the more narrow meaning below.) Explicitly, the information gain of a random variable X {\displaystyle X} obtained from an observation of a random variable A {\displaystyle A} taking value a {\displaystyle a} is defined as: I G ( X , a ) = D KL ( P X ∣ a ∥ P X ) {\displaystyle {\mathit {IG}}(X,a)=D_{\text{KL}}{\bigl (}P_{X\mid a}\parallel P_{X}{\bigr )}} In other words, it is the Kullback–Leibler divergence of P X ( x ) {\displaystyle P_{X}(x)} (the prior distribution for X {\displaystyle X} ) from P X ∣ a ( x ) {\displaystyle P_{X\mid a}(x)} (the posterior distribution for X {\displaystyle X} given A = a {\displaystyle A=a} ). The expected value of the information gain is the mutual information I ( X ; A ) {\displaystyle I(X;A)} : E A [ I G ( X , A ) ] = I ( X ; A ) {\displaystyle \operatorname {E} _{A}[{\mathit {IG}}(X,A)]=I(X;A)} i.e. the reduction in the entropy of X {\displaystyle X} achieved by learning the state of the random variable A {\displaystyle A} . In machine learning, this concept can be used to define a preferred sequence of attributes to investigate to most rapidly narrow down the state of X. Such a sequence (which depends on the outcome of the investigation of previous attributes at each stage) is called a decision tree, and when applied in the area of machine learning is known as decision tree learning. Usually an attribute with high mutual information should be preferred to other attributes. == General definition == In general terms, the expected information gain is the reduction in information entropy Η from a prior state to a state that takes some information as given: I G ( T , a ) = H ( T ) − H ( T | a ) , {\displaystyle IG(T,a)=\mathrm {H} {(T)}-\mathrm {H} {(T|a)},} where H ( T | a ) {\displaystyle \mathrm {H} {(T|a)}} is the conditional entropy of T {\displaystyle T} given the value of attribute a {\displaystyle a} . This is intuitively plausible when interpreting entropy Η as a measure of uncertainty of a random variable T {\displaystyle T} : by learning (or assuming) a {\displaystyle a} about T {\displaystyle T} , our uncertainty about T {\displaystyle T} is reduced (i.e. I G ( T , a ) {\displaystyle IG(T,a)} is positive), unless of course T {\displaystyle T} is independent of a {\displaystyle a} , in which case H ( T | a ) = H ( T ) {\displaystyle \mathrm {H} (T|a)=\mathrm {H} (T)} , meaning I G ( T , a ) = 0 {\displaystyle IG(T,a)=0} . == Formal definition == Let T denote a set of training examples, each of the form ( x , y ) = ( x 1 , x 2 , x 3 , . . . , x k , y ) {\displaystyle ({\textbf {x}},y)=(x_{1},x_{2},x_{3},...,x_{k},y)} where x a ∈ v a l s ( a ) {\displaystyle x_{a}\in \mathrm {vals} (a)} is the value of the a th {\displaystyle a^{\text{th}}} attribute or feature of example x {\displaystyle {\textbf {x}}} and y is the corresponding class label. The information gain for an attribute a is defined in terms of Shannon entropy H ( − ) {\displaystyle \mathrm {H} (-)} as follows. For a value v taken by attribute a, let S a ( v ) = { x ∈ T | x a = v } {\displaystyle S_{a}{(v)}=\{{\textbf {x}}\in T|x_{a}=v\}} be defined as the set of training inputs of T for which attribute a is equal to v. Then the information gain of T for attribute a is the difference between the a priori Shannon entropy H ( T ) {\displaystyle \mathrm {H} (T)} of the training set and the conditional entropy H ( T | a ) {\displaystyle \mathrm {H} {(T|a)}} . H ( T | a ) = ∑ v ∈ v a l s ( a ) | S a ( v ) | | T | ⋅ H ( S a ( v ) ) . {\displaystyle \mathrm {H} (T|a)=\sum _{v\in \mathrm {vals} (a)}{{\frac {|S_{a}{(v)}|}{|T|}}\cdot \mathrm {H} \left(S_{a}{\left(v\right)}\right)}.} I G ( T , a ) = H ( T ) − H ( T | a ) {\displaystyle IG(T,a)=\mathrm {H} (T)-\mathrm {H} (T|a)} The mutual information is equal to the total entropy for an attribute if for each of the attribute values a unique classification can be made for the result attribute. In this case, the relative entropies subtracted from the total entropy are 0. In particular, the values v ∈ v a l s ( a ) {\displaystyle v\in vals(a)} defines a partition of the training set data T into mutually exclusive and all-inclusive subsets, inducing a categorical probability distribution P a ( v ) {\textstyle P_{a}{(v)}} on the values v ∈ v a l s ( a ) {\textstyle v\in vals(a)} of attribute a. The distribution is given P a ( v ) := | S a ( v ) | | T | {\textstyle P_{a}{(v)}:={\frac {|S_{a}{(v)}|}{|T|}}} . In this representation, the information gain of T given a can be defined as the difference between the unconditional Shannon entropy of T and the expected entropy of T conditioned on a, where the expectation value is taken with respect to the induced distribution on the values of a. I G ( T , a ) = H ( T ) − ∑ v ∈ v a l s ( a ) P a ( v ) H ( S a ( v ) ) = H ( T ) − E P a [ H ( S a ( v ) ) ] = H ( T ) − H ( T | a ) . {\displaystyle {\begin{alignedat}{2}IG(T,a)&=\mathrm {H} (T)-\sum _{v\in \mathrm {vals} (a)}{P_{a}{(v)}\mathrm {H} \left(S_{a}{(v)}\right)}\\&=\mathrm {H} (T)-\mathbb {E} _{P_{a}}{\left[\mathrm {H} {(S_{a}{(v)})}\right]}\\&=\mathrm {H} (T)-\mathrm {H} {(T|a)}.\end{alignedat}}} == Example == In engineering applications, information is analogous to signal, and entropy is analogous to noise. It determines how a decision tree chooses to split data. The leftmost figure below is very impure and has high entropy corresponding to higher disorder and lower information value. As we go to the right, the entropy decreases, and the information value increases. Now, it is clear that information gain is the measure of how much information a feature provides about a class. Let's visualize information gain in a decision tree as shown in the right: The node t is the parent node, and the sub-nodes tL and tR are child nodes. In this case, the parent node t has a collection of cancer and non-cancer samples denoted as C and NC respectively. We can use information gain to determine how good the splitting of nodes is in a decision tree. In terms of entropy, information gain is defined as: To understand this idea, let's start by an example in which we create a simple dataset and want to see if gene mutations could be related to patients with cancer. Given four different gene mutations, as well as seven samples, the training set for a decision can be created as follows: In this dataset, a 1 means the sample has the mutation (True), while a 0 means the sample does not (False). A sample with C denotes that it has been confirmed to be cancerous, while NC means it is non-cancerous. Using this data, a decision tree can be created with information gain used to determine the candidate splits for each node. For the next step, the entropy at parent node t of the above simple decision tree is computed as:H(t) = −[pC,t log2(pC,t) + pNC,t log2(pNC,t)] where, probability of selecting a class ‘C’ sample at node t, pC,t = n(t, C) / n(t), probability of selecting a class ‘NC’ sample at node t, pNC,t = n(t, NC) / n(t), n(t), n(t, C), and n(t, NC) are the number of total samples, ‘C’ samples and ‘NC’ samples at node t respectively.Using this with the example training set, the process for finding information gain beginning with H ( t ) {\displaystyle \mathrm {H} {(t)}} for Mutation 1 is as follows: pC, t = 4/7 pNC, t = 3/7 H ( t ) {\displaystyle \mathrm {H} {(t)}} = −(4/7 × log2(4/7) + 3/7 × log2(3/7)) = 0.985 Note: H ( t ) {\displaystyle \mathrm {H} {(t)}} will be the same for all mutations at the root. The relatively high value of entropy H ( t ) = 0.985 {\displaystyle \mathrm {H} {(t)}=0.985} (1 is the optimal value) suggests that the root node is highly impure and the constituents of the input at the root node would look like the leftmost figure in the above Entropy Diagram. However, such a set of data is good for learning the attributes of the mutations used to split the node. At a certain node, when the homogeneity of the constituents of the input occurs (as shown in the rightmost figure in the above Entropy Diagram), the dataset would no longer be good for learning. Moving on, the entropy at left and right child nodes of the above decision tree is computed using the formulae:H(tL) = −[pC,L log2(pC,L) + pNC,L log2(pNC,L)]H(tR) = −[pC,R log2(pC,R) + pNC,R log2(pNC,R)]where, probability of selecting a class ‘C’ sample at the left child node, pC,L = n(tL, C) / n(tL), probability of selecting a class ‘NC’ sample at the left child node, pNC,L = n(tL, NC) / n(tL), probability of selecting a class ‘C’ sample at the right child node, pC,R = n(tR, C) / n(tR), prob
Sum of absolute transformed differences
The sum of absolute transformed differences (SATD) is a block matching criterion widely used in fractional motion estimation for video compression. It works by taking a frequency transform, usually a Hadamard transform, of the differences between the pixels in the original block and the corresponding pixels in the block being used for comparison. The transform itself is often of a small block rather than the entire macroblock. For example, in x264, a series of 4×4 blocks are transformed rather than doing the more processor-intensive 16×16 transform. == Comparison to other metrics == SATD is slower than the sum of absolute differences (SAD), both due to its increased complexity and the fact that SAD-specific MMX and SSE2 instructions exist, while there are no such instructions for SATD. However, SATD can still be optimized considerably with SIMD instructions on most modern CPUs. The benefit of SATD is that it more accurately models the number of bits required to transmit the residual error signal. As such, it is often used in video compressors, either as a way to drive and estimate rate explicitly, such as in the Theora encoder (since 1.1 alpha2), as an optional metric used in wide motion searches, such as in the Microsoft VC-1 encoder, or as a metric used in sub-pixel refinement, such as in x264.
Spiking neural network
Spiking neural networks (SNNs) are artificial neural networks (ANN) that mimic natural neural networks. These models leverage timing of discrete spikes as the main information carrier. In addition to neuronal and synaptic state, SNNs incorporate the concept of time into their operating model. The idea is that neurons in the SNN do not transmit information at each propagation cycle (as it happens with typical multi-layer perceptron networks), but rather transmit information only when a membrane potential—an intrinsic quality of the neuron related to its membrane electrical charge—reaches a specific value, called the threshold. When the membrane potential reaches the threshold, the neuron fires, and generates a signal that travels to other neurons which, in turn, increase or decrease their potentials in response to this signal. A neuron model that fires at the moment of threshold crossing is also called a spiking neuron model. While spike rates can be considered the analogue of the variable output of a traditional ANN, neurobiology research indicated that high speed processing cannot be performed solely through a rate-based scheme. For example humans can perform an image recognition task requiring no more than 10ms of processing time per neuron through the successive layers (going from the retina to the temporal lobe). This time window is too short for rate-based encoding. The precise spike timings in a small set of spiking neurons also has a higher information coding capacity compared with a rate-based approach. The most prominent spiking neuron model is the leaky integrate-and-fire model. In that model, the momentary activation level (modeled as a differential equation) is normally considered to be the neuron's state, with incoming spikes pushing this value higher or lower, until the state eventually either decays or—if the firing threshold is reached—the neuron fires. After firing, the state variable is reset to a lower value. Various decoding methods exist for interpreting the outgoing spike train as a real-value number, relying on either the frequency of spikes (rate-code), the time-to-first-spike after stimulation, or the interval between spikes. == History == Many multi-layer artificial neural networks are fully connected, receiving input from every neuron in the previous layer and signalling every neuron in the subsequent layer. Although these networks have achieved breakthroughs, they do not match biological networks and do not mimic neurons. The biology-inspired Hodgkin–Huxley model of a spiking neuron was proposed in 1952. This model described how action potentials are initiated and propagated. Communication between neurons, which requires the exchange of chemical neurotransmitters in the synaptic gap, is described in models such as the integrate-and-fire model, FitzHugh–Nagumo model (1961–1962), and Hindmarsh–Rose model (1984). The leaky integrate-and-fire model (or a derivative) is commonly used as it is easier to compute than Hodgkin–Huxley. While the notion of an artificial spiking neural network became popular only in the twenty-first century, studies between 1980 and 1995 supported the concept. The first models of this type of ANN appeared to simulate non-algorithmic intelligent information processing systems. However, the notion of the spiking neural network as a mathematical model was first worked on in the early 1970s. As of 2019 SNNs lagged behind ANNs in accuracy, but the gap is decreasing, and has vanished on some tasks. == Underpinnings == Information in the brain is represented as action potentials (neuron spikes), which may group into spike trains or coordinated waves. A fundamental question of neuroscience is to determine whether neurons communicate by a rate or temporal code. Temporal coding implies that a single spiking neuron can replace hundreds of hidden units on a conventional neural net. SNNs define a neuron's current state as its potential (possibly modeled as a differential equation). An input pulse causes the potential to rise and then gradually decline. Encoding schemes can interpret these pulse sequences as a number, considering pulse frequency and pulse interval. Using the precise time of pulse occurrence, a neural network can consider more information and offer better computing properties. SNNs compute in the continuous domain. Such neurons test for activation only when their potentials reach a certain value. When a neuron is activated, it produces a signal that is passed to connected neurons, accordingly raising or lowering their potentials. The SNN approach produces a continuous output instead of the binary output of traditional ANNs. Pulse trains are not easily interpretable, hence the need for encoding schemes. However, a pulse train representation may be more suited for processing spatiotemporal data (or real-world sensory data classification). SNNs connect neurons only to nearby neurons so that they process input blocks separately (similar to CNN using filters). They consider time by encoding information as pulse trains so as not to lose information. This avoids the complexity of a recurrent neural network (RNN). Impulse neurons are more powerful computational units than traditional artificial neurons. SNNs are theoretically more powerful than so called "second-generation networks" defined as ANNs "based on computational units that apply activation function with a continuous set of possible output values to a weighted sum (or polynomial) of the inputs"; however, SNN training issues and hardware requirements limit their use. Although unsupervised biologically inspired learning methods are available such as Hebbian learning and STDP, no effective supervised training method is suitable for SNNs that can provide better performance than second-generation networks. Spike-based activation of SNNs is not differentiable, thus gradient descent-based backpropagation (BP) is not available. SNNs have much larger computational costs for simulating realistic neural models than traditional ANNs. Pulse-coupled neural networks (PCNN) are often confused with SNNs. A PCNN can be seen as a kind of SNN. Researchers are actively working on various topics. The first concerns differentiability. The expressions for both the forward- and backward-learning methods contain the derivative of the neural activation function which is not differentiable because a neuron's output is either 1 when it spikes, and 0 otherwise. This all-or-nothing behavior disrupts gradients and makes these neurons unsuitable for gradient-based optimization. Approaches to resolving it include: resorting to entirely biologically inspired local learning rules for the hidden units translating conventionally trained "rate-based" NNs to SNNs smoothing the network model to be continuously differentiable defining an SG (Surrogate Gradient) as a continuous relaxation of the real gradients The second concerns the optimization algorithm. Standard BP can be expensive in terms of computation, memory, and communication and may be poorly suited to the hardware that implements it (e.g., a computer, brain, or neuromorphic device). Incorporating additional neuron dynamics such as Spike Frequency Adaptation (SFA) is a notable advance, enhancing efficiency and computational power. These neurons sit between biological complexity and computational complexity. Originating from biological insights, SFA offers significant computational benefits by reducing power usage, especially in cases of repetitive or intense stimuli. This adaptation improves signal/noise clarity and introduces an elementary short-term memory at the neuron level, which in turn, improves accuracy and efficiency. This was mostly achieved using compartmental neuron models. The simpler versions are of neuron models with adaptive thresholds, are an indirect way of achieving SFA. It equips SNNs with improved learning capabilities, even with constrained synaptic plasticity, and elevates computational efficiency. This feature lessens the demand on network layers by decreasing the need for spike processing, thus lowering computational load and memory access time—essential aspects of neural computation. Moreover, SNNs utilizing neurons capable of SFA achieve levels of accuracy that rival those of conventional ANNs, while also requiring fewer neurons for comparable tasks. This efficiency streamlines the computational workflow and conserves space and energy, while maintaining technical integrity. High-performance deep spiking neural networks can operate with 0.3 spikes per neuron. == Applications == SNNs can in principle be applied to the same applications as traditional ANNs. In addition, SNNs can model the central nervous system of biological organisms, such as an insect seeking food without prior knowledge of the environment. Due to their relative realism, they can be used to study biological neural circuits. Starting with a hypothesis about the topology of a biological neuronal circuit and its functi
Solomonoff's theory of inductive inference
Solomonoff's theory of inductive inference proves that, under its common sense assumptions (axioms), the best possible scientific model is the shortest algorithm that generates the empirical data under consideration. In addition to the choice of data, other assumptions are that, to avoid the post-hoc fallacy, the programming language must be chosen prior to the data and that the environment being observed is generated by an unknown algorithm. This is also called a theory of induction. Due to its basis in the dynamical (state-space model) character of Algorithmic Information Theory, it encompasses statistical as well as dynamical information criteria for model selection. It was introduced by Ray Solomonoff, based on probability theory and theoretical computer science. In essence, Solomonoff's induction derives the posterior probability of any computable theory, given a sequence of observed data. This posterior probability is derived from Bayes' rule and some universal prior, that is, a prior that assigns a positive probability to any computable theory. Solomonoff proved that this induction is incomputable (or more precisely, lower semi-computable), but noted that "this incomputability is of a very benign kind", and that it "in no way inhibits its use for practical prediction" (as it can be approximated from below more accurately with more computational resources). It is only "incomputable" in the benign sense that no scientific consensus is able to prove that the best current scientific theory is the best of all possible theories. However, Solomonoff's theory does provide an objective criterion for deciding among the current scientific theories explaining a given set of observations. Solomonoff's induction naturally formalizes Occam's razor by assigning larger prior credences to theories that require a shorter algorithmic description. == Origin == === Philosophical === The theory is based in philosophical foundations, and was founded by Ray Solomonoff around 1960. It is a mathematically formalized combination of Occam's razor and the Principle of Multiple Explanations. All computable theories which perfectly describe previous observations are used to calculate the probability of the next observation, with more weight put on the shorter computable theories. Marcus Hutter's universal artificial intelligence builds upon this to calculate the expected value of an action. === Principle === Solomonoff's induction has been argued to be the computational formalization of pure Bayesianism. To understand, recall that Bayesianism derives the posterior probability P [ T | D ] {\displaystyle \mathbb {P} [T|D]} of a theory T {\displaystyle T} given data D {\displaystyle D} by applying Bayes rule, which yields P [ T | D ] = P [ D | T ] P [ T ] P [ D | T ] P [ T ] + ∑ A ≠ T P [ D | A ] P [ A ] {\displaystyle \mathbb {P} [T|D]={\frac {\mathbb {P} [D|T]\mathbb {P} [T]}{\mathbb {P} [D|T]\mathbb {P} [T]+\sum _{A\neq T}\mathbb {P} [D|A]\mathbb {P} [A]}}} where theories A {\displaystyle A} are alternatives to theory T {\displaystyle T} . For this equation to make sense, the quantities P [ D | T ] {\displaystyle \mathbb {P} [D|T]} and P [ D | A ] {\displaystyle \mathbb {P} [D|A]} must be well-defined for all theories T {\displaystyle T} and A {\displaystyle A} . In other words, any theory must define a probability distribution over observable data D {\displaystyle D} . Solomonoff's induction essentially boils down to demanding that all such probability distributions be computable. Interestingly, the set of computable probability distributions is a subset of the set of all programs, which is countable. Similarly, the sets of observable data considered by Solomonoff were finite. Without loss of generality, we can thus consider that any observable data is a finite bit string. As a result, Solomonoff's induction can be defined by only invoking discrete probability distributions. Solomonoff's induction then allows to make probabilistic predictions of future data F {\displaystyle F} , by simply obeying the laws of probability. Namely, we have P [ F | D ] = E T [ P [ F | T , D ] ] = ∑ T P [ F | T , D ] P [ T | D ] {\displaystyle \mathbb {P} [F|D]=\mathbb {E} _{T}[\mathbb {P} [F|T,D]]=\sum _{T}\mathbb {P} [F|T,D]\mathbb {P} [T|D]} . This quantity can be interpreted as the average predictions P [ F | T , D ] {\displaystyle \mathbb {P} [F|T,D]} of all theories T {\displaystyle T} given past data D {\displaystyle D} , weighted by their posterior credences P [ T | D ] {\displaystyle \mathbb {P} [T|D]} . === Mathematical === The proof of the "razor" is based on the known mathematical properties of a probability distribution over a countable set. These properties are relevant because the infinite set of all programs is a denumerable set. The sum S of the probabilities of all programs must be exactly equal to one (as per the definition of probability) thus the probabilities must roughly decrease as we enumerate the infinite set of all programs, otherwise S will be strictly greater than one. To be more precise, for every ϵ {\displaystyle \epsilon } > 0, there is some length l such that the probability of all programs longer than l is at most ϵ {\displaystyle \epsilon } . This does not, however, preclude very long programs from having very high probability. Fundamental ingredients of the theory are the concepts of algorithmic probability and Kolmogorov complexity. The universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs (for a universal computer) that compute something starting with p. Given some p and any computable but unknown probability distribution from which x is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of x in optimal fashion. == Mathematical guarantees == === Solomonoff's completeness === The remarkable property of Solomonoff's induction is its completeness. In essence, the completeness theorem guarantees that the expected cumulative errors made by the predictions based on Solomonoff's induction are upper-bounded by the Kolmogorov complexity of the (stochastic) data generating process. The errors can be measured using the Kullback–Leibler divergence or the square of the difference between the induction's prediction and the probability assigned by the (stochastic) data generating process. === Solomonoff's uncomputability === Unfortunately, Solomonoff also proved that Solomonoff's induction is uncomputable. In fact, he showed that computability and completeness are mutually exclusive: any complete theory must be uncomputable. The proof of this is derived from a game between the induction and the environment. Essentially, any computable induction can be tricked by a computable environment, by choosing the computable environment that negates the computable induction's prediction. This fact can be regarded as an instance of the no free lunch theorem. == Modern applications == === Artificial intelligence === Though Solomonoff's inductive inference is not computable, several AIXI-derived algorithms approximate it in order to make it run on a modern computer. The more computing power they are given, the closer their predictions are to the predictions of inductive inference (their mathematical limit is Solomonoff's inductive inference). Another direction of inductive inference is based on E. Mark Gold's model of learning in the limit from 1967 and has developed since then more and more models of learning. The general scenario is the following: Given a class S of computable functions, is there a learner (that is, recursive functional) which for any input of the form (f(0),f(1),...,f(n)) outputs a hypothesis (an index e with respect to a previously agreed on acceptable numbering of all computable functions; the indexed function may be required consistent with the given values of f). A learner M learns a function f if almost all its hypotheses are the same index e, which generates the function f; M learns S if M learns every f in S. Basic results are that all recursively enumerable classes of functions are learnable while the class REC of all computable functions is not learnable. Many related models have been considered and also the learning of classes of recursively enumerable sets from positive data is a topic studied from Gold's pioneering paper in 1967 onwards. A far reaching extension of the Gold’s approach is developed by Schmidhuber's theory of generalized Kolmogorov complexities, which are kinds of super-recursive algorithms.
Crossover (evolutionary algorithm)
Crossover in evolutionary algorithms and evolutionary computation, also called recombination, is a genetic operator used to combine the genetic information of two parents to generate new offspring. It is one way to stochastically generate new solutions from an existing population, and is analogous to the crossover that happens during sexual reproduction in biology. New solutions can also be generated by cloning an existing solution, which is analogous to asexual reproduction. Newly generated solutions may be mutated before being added to the population. The aim of recombination is to transfer good characteristics from two different parents to one child. Different algorithms in evolutionary computation may use different data structures to store genetic information, and each genetic representation can be recombined with different crossover operators. Typical data structures that can be recombined with crossover are bit arrays, vectors of real numbers, or trees. The list of operators presented below is by no means complete and serves mainly as an exemplary illustration of this dyadic genetic operator type. More operators and more details can be found in the literature. == Crossover for binary arrays == Traditional genetic algorithms store genetic information in a chromosome represented by a bit array. Crossover methods for bit arrays are popular and an illustrative example of genetic recombination. === One-point crossover === A point on both parents' chromosomes is picked randomly, and designated a 'crossover point'. Bits to the right of that point are swapped between the two parent chromosomes. This results in two offspring, each carrying some genetic information from both parents. === Two-point and k-point crossover === In two-point crossover, two crossover points are picked randomly from the parent chromosomes. The bits in between the two points are swapped between the parent organisms. Two-point crossover is equivalent to performing two single-point crossovers with different crossover points. This strategy can be generalized to k-point crossover for any positive integer k, picking k crossover points. === Uniform crossover === In uniform crossover, typically, each bit is chosen from either parent with equal probability. Other mixing ratios are sometimes used, resulting in offspring which inherit more genetic information from one parent than the other. In a uniform crossover, we don’t divide the chromosome into segments, rather we treat each gene separately. In this, we essentially flip a coin for each chromosome to decide whether or not it will be included in the off-spring. == Crossover for integer or real-valued genomes == For the crossover operators presented above and for most other crossover operators for bit strings, it holds that they can also be applied accordingly to integer or real-valued genomes whose genes each consist of an integer or real-valued number. Instead of individual bits, integer or real-valued numbers are then simply copied into the child genome. The offspring lie on the remaining corners of the hyperbody spanned by the two parents P 1 = ( 1.5 , 6 , 8 ) {\displaystyle P_{1}=(1.5,6,8)} and P 2 = ( 7 , 2 , 1 ) {\displaystyle P_{2}=(7,2,1)} , as exemplified in the accompanying image for the three-dimensional case. === Discrete recombination === If the rules of the uniform crossover for bit strings are applied during the generation of the offspring, this is also called discrete recombination. === Intermediate recombination === In this recombination operator, the allele values of the child genome a i {\displaystyle a_{i}} are generated by mixing the alleles of the two parent genomes a i , P 1 {\displaystyle a_{i,P_{1}}} and a i , P 2 {\displaystyle a_{i,P_{2}}} : α i = α i , P 1 ⋅ β i + α i , P 2 ⋅ ( 1 − β i ) w i t h β i ∈ [ − d , 1 + d ] {\displaystyle \alpha _{i}=\alpha _{i,P_{1}}\cdot \beta _{i}+\alpha _{i,P_{2}}\cdot \left(1-\beta _{i}\right)\quad {\mathsf {with}}\quad \beta _{i}\in \left[-d,1+d\right]} randomly equally distributed per gene i {\displaystyle i} The choice of the interval [ − d , 1 + d ] {\displaystyle [-d,1+d]} causes that besides the interior of the hyperbody spanned by the allele values of the parent genes additionally a certain environment for the range of values of the offspring is in question. A value of 0.25 {\displaystyle 0.25} is recommended for d {\displaystyle d} to counteract the tendency to reduce the allele values that otherwise exists at d = 0 {\displaystyle d=0} . The adjacent figure shows for the two-dimensional case the range of possible new alleles of the two exemplary parents P 1 = ( 3 , 6 ) {\displaystyle P_{1}=(3,6)} and P 2 = ( 9 , 2 ) {\displaystyle P_{2}=(9,2)} in intermediate recombination. The offspring of discrete recombination C 1 {\displaystyle C_{1}} and C 2 {\displaystyle C_{2}} are also plotted. Intermediate recombination satisfies the arithmetic calculation of the allele values of the child genome required by virtual alphabet theory. Discrete and intermediate recombination are used as a standard in the evolution strategy. == Crossover for permutations == For combinatorial tasks, permutations are usually used that are specifically designed for genomes that are themselves permutations of a set. The underlying set is usually a subset of N {\displaystyle \mathbb {N} } or N 0 {\displaystyle \mathbb {N} _{0}} . If 1- or n-point or uniform crossover for integer genomes is used for such genomes, a child genome may contain some values twice and others may be missing. This can be remedied by genetic repair, e.g. by replacing the redundant genes in positional fidelity for missing ones from the other child genome. In order to avoid the generation of invalid offspring, special crossover operators for permutations have been developed which fulfill the basic requirements of such operators for permutations, namely that all elements of the initial permutation are also present in the new one and only the order is changed. It can be distinguished between combinatorial tasks, where all sequences are admissible, and those where there are constraints in the form of inadmissible partial sequences. A well-known representative of the first task type is the traveling salesman problem (TSP), where the goal is to visit a set of cities exactly once on the shortest tour. An example of the constrained task type is the scheduling of multiple workflows. Workflows involve sequence constraints on some of the individual work steps. For example, a thread cannot be cut until the corresponding hole has been drilled in a workpiece. Such problems are also called order-based permutations. In the following, two crossover operators are presented as examples, the partially mapped crossover (PMX) motivated by the TSP and the order crossover (OX1) designed for order-based permutations. A second offspring can be produced in each case by exchanging the parent chromosomes. === Partially mapped crossover (PMX) === The PMX operator was designed as a recombination operator for TSP like Problems. The explanation of the procedure is illustrated by an example: === Order crossover (OX1) === The order crossover goes back to Davis in its original form and is presented here in a slightly generalized version with more than two crossover points. It transfers information about the relative order from the second parent to the offspring. First, the number and position of the crossover points are determined randomly. The resulting gene sequences are then processed as described below: Among other things, order crossover is well suited for scheduling multiple workflows, when used in conjunction with 1- and n-point crossover. === Further crossover operators for permutations === Over time, a large number of crossover operators for permutations have been proposed, so the following list is only a small selection. For more information, the reader is referred to the literature. cycle crossover (CX) order-based crossover (OX2) position-based crossover (POS) edge recombination voting recombination (VR) alternating-positions crossover (AP) maximal preservative crossover (MPX) merge crossover (MX) sequential constructive crossover operator (SCX) The usual approach to solving TSP-like problems by genetic or, more generally, evolutionary algorithms, presented earlier, is either to repair illegal descendants or to adjust the operators appropriately so that illegal offspring do not arise in the first place. Alternatively, Riazi suggests the use of a double chromosome representation, which avoids illegal offspring.