AI Content Is Getting Out Of Hand

AI Content Is Getting Out Of Hand — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Ericom Connect

    Ericom Connect

    Ericom Connect is a remote access/application publishing solution produced by Ericom Software that provides secure, centrally managed access to physical or hosted desktops and applications running on Microsoft Windows and Linux systems. == Product overview == Ericom Connect is desktop virtualization and application virtualization software that allows users to run applications remotely, without installing them on the local computer or device. The software is noted for its scalability, ease of deployment, and compatibility with any type of infrastructure, cloud or physical. Ericom Connect uses AccessPad (native client for desktops), AccessToGo (native client for mobile), or AccessNow, one of the first HTML5 RDP solutions to support clientless access to Windows desktops and applications from any device with an HTML5-compatible browser, including Macintosh computers, mobile devices, and Google Chromebooks. Other notable features include performance monitoring, built-in real-time analytics & BI, support for two-factor authentication (using RSA SecurID), multi-tenancy and multi-datacenter support via a single unified web interface, and a “Launch Simulation” feature that allows users to visualize and simulate actual step-by-step user processes directly from within the administration console. In addition to scalability, by distributing configurations, logs, etc., across multiple servers there is no single point of failure, as can be the case if all configuration information is stored on one server. == History == Ericom Connect was introduced in 2015. Ericom Connect is a successor to Ericom PowerTerm Web Connect. PowerTerm Web Connect used an architecture similar to what was then current with Citrix and VMWare, relying on a centralized SQL server, a connection broker, image management for different hypervisors, and a variety of clients. Ericom Connect uses a new grid architecture that provides more scalability, reliability, and flexibility than before.

    Read more →
  • Jpred

    Jpred

    Jpred v.4 is the latest version of the JPred Protein Secondary Structure Prediction Server which provides predictions by the JNet algorithm, one of the most accurate methods for secondary structure prediction, that has existed since 1998 in different versions. In addition to protein secondary structure, JPred also makes predictions of solvent accessibility and coiled-coil regions. The JPred service runs up to 134 000 jobs per month and has carried out over 2 million predictions in total for users in 179 countries. == JPred 2 == The static HTML pages of JPred 2 are still available for reference. == JPred 3 == The JPred v3 followed on from previous versions of JPred developed and maintained by James Cuff and Jonathan Barber (see JPred References). This release added new functionality and fixed many bugs. The highlights are: New, friendlier user interface Retrained and optimised version of Jnet (v2) - mean secondary structure prediction accuracy of >81% Batch submission of jobs Better error checking of input sequences/alignments Predictions now (optionally) returned via e-mail Users may provide their own query names for each submission JPred now makes a prediction even when there are no PSI-BLAST hits to the query PS/PDF output now incorporates all the predictions == JPred 4 == The current version of JPred (v4) has the following improvements and updates incorporated: Retrained on the latest UniRef90 and SCOPe/ASTRAL version of Jnet (v2.3.1) - mean secondary structure prediction accuracy of >82%. Upgraded the Web Server to the latest technologies (Bootstrap framework, JavaScript) and updating the web pages – improving the design and usability through implementing responsive technologies. Added RESTful API and mass-submission and results retrieval scripts - resulting in peak throughput above 20,000 predictions per day. Added prediction jobs monitoring tools. Upgraded the results reporting – both, on the web-site, and through the optional email summary reports: improved batch submission, added results summary preview through Jalview results visualization summary in SVG and adding full multiple sequence alignments into the reports. Improved help-pages, incorporating tool-tips, and adding one-page step-by-step tutorials. Sequence residues are categorised or assigned to one of the secondary structure elements, such as alpha-helix, beta-sheet and coiled-coil. Jnet uses two neural networks for its prediction. The first network is fed with a window of 17 residues over each amino acid in the alignment plus a conservation number. It uses a hidden layer of nine nodes and has three output nodes, one for each secondary structure element. The second network is fed with a window of 19 residues (the result of first network) plus the conservation number. It has a hidden layer with nine nodes and has three output nodes.

    Read more →
  • Synaptic weight

    Synaptic weight

    In neuroscience and computer science, synaptic weight refers to the strength or amplitude of a connection between two nodes, corresponding in biology to the amount of influence the firing of one neuron has on another. The term is typically used in artificial and biological neural network research. == Computation == In a computational neural network, a vector or set of inputs x {\displaystyle {\textbf {x}}} and outputs y {\displaystyle {\textbf {y}}} , or pre- and post-synaptic neurons respectively, are interconnected with synaptic weights represented by the matrix w {\displaystyle w} , where for a linear neuron y j = ∑ i w i j x i or y = w x {\displaystyle y_{j}=\sum _{i}w_{ij}x_{i}~~{\textrm {or}}~~{\textbf {y}}=w{\textbf {x}}} . where the rows of the synaptic matrix represent the vector of synaptic weights for the output indexed by j {\displaystyle j} . The synaptic weight is changed by using a learning rule, the most basic of which is Hebb's rule, which is usually stated in biological terms as Neurons that fire together, wire together. Computationally, this means that if a large signal from one of the input neurons results in a large signal from one of the output neurons, then the synaptic weight between those two neurons will increase. The rule is unstable, however, and is typically modified using such variations as Oja's rule, radial basis functions or the backpropagation algorithm. == Biology == For biological networks, the effect of synaptic weights is not as simple as for linear neurons or Hebbian learning. However, biophysical models such as BCM theory have seen some success in mathematically describing these networks. In the mammalian central nervous system, signal transmission is carried out by interconnected networks of nerve cells, or neurons. For the basic pyramidal neuron, the input signal is carried by the axon, which releases neurotransmitter chemicals into the synapse which is picked up by the dendrites of the next neuron, which can then generate an action potential which is analogous to the output signal in the computational case. The synaptic weight in this process is determined by several variable factors: How well the input signal propagates through the axon (see myelination), The amount of neurotransmitter released into the synapse and the amount that can be absorbed in the following cell (determined by the number of AMPA and NMDA receptors on the cell membrane and the amount of intracellular calcium and other ions), The number of such connections made by the axon to the dendrites, How well the signal propagates and integrates in the postsynaptic cell. The changes in synaptic weight that occur is known as synaptic plasticity, and the process behind long-term changes (long-term potentiation and depression) is still poorly understood. Hebb's original learning rule was originally applied to biological systems, but has had to undergo many modifications as a number of theoretical and experimental problems came to light.

    Read more →
  • Sigmoid function

    Sigmoid function

    A sigmoid function is any mathematical function whose graph has a characteristic S-shaped or sigmoid curve. A common example of a sigmoid function is the logistic function. Other sigmoid functions are given in the Examples section. In some fields, most notably in the context of artificial neural networks, the term "sigmoid function" is used as a synonym for "logistic function". Special cases of sigmoid functions include the Gompertz curve (used in modeling systems that saturate at large values of x) and the ogee curve (used in the spillway of some dams). Sigmoid functions have domain of all real numbers, with return (response) value commonly monotonically increasing but could be decreasing. Sigmoid functions most often show a return value (y axis) in the range 0 to 1. Another commonly used range is from −1 to 1. There is also the Heaviside step function, which instantaneously transitions between 0 and 1. A wide variety of sigmoid functions including the logistic and hyperbolic tangent functions have been used as the activation function of artificial neurons. Sigmoid curves are also common in statistics as cumulative distribution functions (which go from 0 to 1), such as the integrals of the logistic density, the normal density, and Student's t probability density functions. The logistic sigmoid function is invertible, and its inverse is the logit function. == Theory == In mathematics, a unitary sigmoid function is a bounded sigmoid-type function normalized to the unit range, typically with lower and upper asymptotes at 0 and 1. The theory proposed by Grebenc distinguishes three kinds of unitary sigmoid functions according to their asymptotic behavior and the presence or absence of oscillation near the asymptotes. A general form of a unitary sigmoid function is y = A S ( f ( x ) ) + B , {\displaystyle y=A\,S(f(x))+B,} where S {\displaystyle S} is an increasing sigmoid function, f ( x ) {\displaystyle f(x)} is a transformation of the independent variable, and A {\displaystyle A} and B {\displaystyle B} are constants controlling scaling and translation. === Classification === ==== 1st kind ==== A unitary sigmoid function of the first kind is a bounded increasing function that approaches its lower and upper asymptotes monotonically, without oscillation. This class includes many of the standard sigmoid functions used in statistics, biomathematics, and engineering, such as the logistic function and related generalizations. ==== 2nd kind ==== A unitary sigmoid function of the second kind is a bounded increasing function that oscillates near the upper asymptote while preserving an overall sigmoid transition. ==== 3rd kind ==== A unitary sigmoid function of the third kind is a bounded increasing function that oscillates near both the lower and upper asymptotes. These functions retain the global shape of a sigmoid curve but exhibit oscillatory behavior in the vicinity of both limiting states. === Taxonomy === The tables below show the taxonomy of unitary sigmoid functions of all three kinds. Table 1. Taxonomy matrix with examples of sigmoid functions of the 1st kind Table 2. Taxonomy matrix with examples of sigmoid functions of the 2nd kind on the unbounded interval Table 3. Taxonomy matrix with examples of sigmoid functions of the 3rd kind === Construction methods === The same theory presents a list of 30 methods for constructing sigmoid functions.. These include algebraic transformations, integration and convolution methods, constructions from bell-shaped functions, solutions of ordinary and partial differential equations, recursive schemes, stochastic differential equations, feedback systems, and chaotic systems. M0: Construction method for sigmoid functions not evident or intuitive M1: Inverse of singularity functions M2: Sigmoid functions of embedded positive functions M3: Rising a sigmoid function to the power M4: Exponentiating a sigmoid function M5: Symmetric sigmoid functions derived from asymmetric ones M6: Sigmoid functions of the reciprocal independent variable M7: Embedding a sigmoid function into other function M8: Sum of sigmoid functions M9: Multiplication of sigmoid functions M10: Integral of the product of an increasing and a decreasing function M11: Derivation from lambda (bell-shaped) functions M12: Integration of lambda (bell-shaped) function M13: Integration of the sum of lambda (bell-shaped) functions M14: Integration of the product of two lambda (bell-shaped) functions M15: Integration of the difference of two shifted sigmoid functions M16: Integration of the product of two shifted sigmoid functions M17: Convolution of sigmoid functions M18: Integration of the product of lambda and sigmoid function M19: Solutions of ordinary differential equations M20: Solutions of partial differential equation (PDE) M21: Solutions of functional differential equation (FDE) M22: Sum of a sigmoid function and some derivatives M23: Combination of sigmoid functions, its derivative and integral M24: Filtering sigmoid functions M25: Special cases of Gauss hypergeometric functions M26: Feedback closed-loop systems M27: Recursive functions M28: Recursive time-delayed feed-forward loops M29: Solutions of stochastic differential equation M30: Chaotic sigmoid functions Consult reference for more details. == Definition == A sigmoid function is a bounded, differentiable, real function that is defined for all real input values and has a positive derivative at each point. == Properties == In general, a sigmoid function is monotonic, and has a first derivative which is bell shaped. Conversely, the integral of any continuous, non-negative, bell-shaped function (with one local maximum and no local minimum, unless degenerate) will be sigmoidal. Thus the cumulative distribution functions for many common probability distributions are sigmoidal. One such example is the error function, which is related to the cumulative distribution function of a normal distribution; another is the arctan function, which is related to the cumulative distribution function of a Cauchy distribution. A sigmoid function is constrained by a pair of horizontal asymptotes as x → ± ∞ {\displaystyle x\rightarrow \pm \infty } . A sigmoid function is convex for values less than a particular point, and it is concave for values greater than that point: in many of the examples here, that point is 0. == Examples == Logistic function f ( x ) = 1 1 + e − x {\displaystyle f(x)={\frac {1}{1+e^{-x}}}} Hyperbolic tangent (shifted and scaled version of the logistic function, above) f ( x ) = tanh ⁡ x = e x − e − x e x + e − x {\displaystyle f(x)=\tanh x={\frac {e^{x}-e^{-x}}{e^{x}+e^{-x}}}} Arctangent function f ( x ) = arctan ⁡ x {\displaystyle f(x)=\arctan x} Gudermannian function f ( x ) = gd ⁡ ( x ) = ∫ 0 x d t cosh ⁡ t = 2 arctan ⁡ ( tanh ⁡ ( x 2 ) ) {\displaystyle f(x)=\operatorname {gd} (x)=\int _{0}^{x}{\frac {dt}{\cosh t}}=2\arctan \left(\tanh \left({\frac {x}{2}}\right)\right)} Error function f ( x ) = erf ⁡ ( x ) = 2 π ∫ 0 x e − t 2 d t {\displaystyle f(x)=\operatorname {erf} (x)={\frac {2}{\sqrt {\pi }}}\int _{0}^{x}e^{-t^{2}}\,dt} Generalised logistic function f ( x ) = ( 1 + e − x ) − α , α > 0 {\displaystyle f(x)=\left(1+e^{-x}\right)^{-\alpha },\quad \alpha >0} Smoothstep function f ( x ) = { ( ∫ 0 1 ( 1 − u 2 ) N d u ) − 1 ∫ 0 x ( 1 − u 2 ) N d u , | x | ≤ 1 sgn ⁡ ( x ) | x | ≥ 1 N ∈ Z ≥ 1 {\displaystyle f(x)={\begin{cases}{\displaystyle \left(\int _{0}^{1}\left(1-u^{2}\right)^{N}du\right)^{-1}\int _{0}^{x}\left(1-u^{2}\right)^{N}\ du},&|x|\leq 1\\\\\operatorname {sgn}(x)&|x|\geq 1\\\end{cases}}\quad N\in \mathbb {Z} \geq 1} Some algebraic functions, for example f ( x ) = x 1 + x 2 {\displaystyle f(x)={\frac {x}{\sqrt {1+x^{2}}}}} and in a more general form f ( x ) = x ( 1 + | x | k ) 1 / k {\displaystyle f(x)={\frac {x}{\left(1+|x|^{k}\right)^{1/k}}}} Up to shifts and scaling, many sigmoids are special cases of f ( x ) = φ ( φ ( x , β ) , α ) , {\displaystyle f(x)=\varphi (\varphi (x,\beta ),\alpha ),} where φ ( x , λ ) = { ( 1 − λ x ) 1 / λ λ ≠ 0 e − x λ = 0 {\displaystyle \varphi (x,\lambda )={\begin{cases}(1-\lambda x)^{1/\lambda }&\lambda \neq 0\\e^{-x}&\lambda =0\\\end{cases}}} is the inverse of the negative Box–Cox transformation, and α < 1 {\displaystyle \alpha <1} and β < 1 {\displaystyle \beta <1} are shape parameters. Smooth transition function normalized to (−1,1): f ( x ) = { 2 1 + e − 2 m x 1 − x 2 − 1 , | x | < 1 sgn ⁡ ( x ) | x | ≥ 1 = { tanh ⁡ ( m x 1 − x 2 ) , | x | < 1 sgn ⁡ ( x ) | x | ≥ 1 {\displaystyle {\begin{aligned}f(x)&={\begin{cases}{\displaystyle {\frac {2}{1+e^{-2m{\frac {x}{1-x^{2}}}}}}-1},&|x|<1\\\\\operatorname {sgn}(x)&|x|\geq 1\\\end{cases}}\\&={\begin{cases}{\displaystyle \tanh \left(m{\frac {x}{1-x^{2}}}\right)},&|x|<1\\\\\operatorname {sgn}(x)&|x|\geq 1\\\end{cases}}\end{aligned}}} using the hyperbolic tangent mentioned above. Here, m {\displaystyle m} is a free parameter encoding the slope at x = 0 {\displaystyle x=0} , which must be great

    Read more →
  • Confidential computing

    Confidential computing

    Confidential computing is a security and privacy-enhancing computational technique focused on protecting data in use. Confidential computing can be used in conjunction with storage and network encryption, which protect data at rest and data in transit respectively. It is designed to address software, protocol, cryptographic, and basic physical and supply-chain attacks, although some critics have demonstrated architectural and side-channel attacks effective against the technology. The technology protects data in use by performing computations in a hardware-based trusted execution environment (TEE). Confidential data is released to the TEE only once it is assessed to be trustworthy. Different types of confidential computing define the level of data isolation used, whether virtual machine, application, or function, and the technology can be deployed in on-premise data centers, edge locations, or the public cloud. It is often compared with other privacy-enhancing computational techniques such as fully homomorphic encryption, secure multi-party computation, and Trusted Computing. Confidential computing is promoted by the Confidential Computing Consortium (CCC) industry group, whose membership includes major providers of the technology. == Properties == Trusted execution environments (TEEs) "prevent unauthorized access or modification of applications and data while they are in use, thereby increasing the security level of organizations that manage sensitive and regulated data". Trusted execution environments can be instantiated on a computer's processing components such as a central processing unit (CPU) or a graphics processing unit (GPU). In their various implementations, TEEs can provide different levels of isolation including virtual machine, individual application, or compute functions. Typically, data in use in a computer's compute components and memory exists in a decrypted state and can be vulnerable to examination or tampering by unauthorized software or administrators. According to the CCC, confidential computing protects data in use through a minimum of three properties: Data confidentiality: "Unauthorized entities cannot view data while it is in use within the TEE". Data integrity: "Unauthorized entities cannot add, remove, or alter data while it is in use within the TEE". Code integrity: "Unauthorized entities cannot add, remove, or alter code executing in the TEE". In addition to trusted execution environments, remote cryptographic attestation is an essential part of confidential computing. The attestation process assesses the trustworthiness of a system and helps ensure that confidential data is released to a TEE only after it presents verifiable evidence that it is genuine and operating with an acceptable security posture. It allows the verifying party to assess the trustworthiness of a confidential computing environment through an "authentic, accurate, and timely report about the software and data state" of that environment. "Hardware-based attestation schemes rely on a trusted hardware component and associated firmware to execute attestation routines in a secure environment". Without attestation, a compromised system could deceive others into trusting it, claim it is running certain software in a TEE, and potentially compromise the confidentiality or integrity of the data being processed or the integrity of the trusted code. == Technical approaches == Technical approaches to confidential computing may vary in which software, infrastructure and administrator elements are allowed to access confidential data. The "trust boundary," which circumscribes a trusted computing base (TCB), defines which elements have the potential to access confidential data, whether they are acting benignly or maliciously. Confidential computing implementations enforce the defined trust boundary at a specific level of data isolation. The three main types of confidential computing are: Virtual machine isolation Application isolation, also known as process isolation Function isolation, also known as library isolation Virtual machine isolation removes the elements controlled by the computer infrastructure or cloud provider, but allows potential data access by elements inside a virtual machine running on the infrastructure. Application or process isolation permits data access only by authorized software applications or processes. Function or library isolation is designed to permit data access only by authorized subroutines or modules within a larger application, blocking access by any other system element, including unauthorized code in the larger application. == Threat model == As confidential computing is concerned with the protection of data in use, only certain threat models can be addressed by this technique. Other types of attacks are better addressed by other privacy-enhancing technologies. === In scope === The following threat vectors are generally considered in scope for confidential computing: Software attacks: including attacks on the host’s software and firmware. This may include the operating system, hypervisor, BIOS, other software and workloads. Protocol attacks: including "attacks on protocols associated with attestation as well as workload and data transport". This includes vulnerabilities in the "provisioning or placement of the workload" or data that could cause a compromise. Cryptographic attacks: including "vulnerabilities found in ciphers and algorithms due to a number of factors, including mathematical breakthroughs, availability of computing power and new computing approaches such as quantum computing". The CCC notes several caveats in this threat vector, including relative difficulty of upgrading cryptographic algorithms in hardware and recommendations that software and firmware be kept up-to-date. A multi-faceted, defense-in-depth strategy is recommended as a best practice. Basic physical attacks: including cold boot attacks, bus and cache snooping and plugging attack devices into an existing port, such as a PCI Express slot or USB port. Basic upstream supply-chain attacks: including attacks that would compromise TEEs through changes such as added debugging ports. The degree and mechanism of protection against these threats varies with specific confidential computing implementations. === Out of scope === Threats generally defined as out of scope for confidential computing include: Sophisticated physical attacks: including physical attacks that "require long-term and/or invasive access to hardware" such as chip scraping techniques and electron microscope probes. Upstream hardware supply-chain attacks: including attacks on the CPU manufacturing process, CPU supply chain in key injection/generation during manufacture. Attacks on components of a host system that are not directly providing the capabilities of the trusted execution environment are also generally out-of-scope. Availability attacks: confidential computing is designed to protect the confidentiality and integrity of protected data and code. It does not address availability attacks such as Denial of Service or Distributed Denial of Service attacks. == Use cases == Confidential computing can be deployed in the public cloud, on-premise data centers, or distributed "edge" locations, including network nodes, branch offices, industrial systems and others. === Data privacy and security === Confidential computing protects the confidentiality and integrity of data and code from the infrastructure provider, unauthorized or malicious software and system administrators, and other cloud tenants, which may be a concern for organizations seeking control over sensitive or regulated data. The additional security capabilities offered by confidential computing can help accelerate the transition of more sensitive workloads to the cloud or edge locations. === Multi-party analytics === Confidential computing can enable multiple parties to engage in joint analysis using confidential or regulated data inside a TEE while preserving privacy and regulatory compliance. In this case, all parties benefit from the shared analysis, but no party's sensitive data or confidential code is exposed to the other parties or system host. Examples include multiple healthcare organizations contributing data to medical research, or multiple banks collaborating to identify financial fraud or money laundering. Oxford University researchers proposed the alternative paradigm called "Confidential Remote Computing" (CRC), which supports confidential operations in Trusted Execution Environments across endpoint computers considering multiple stakeholders as mutually distrustful data, algorithm and hardware providers. === Confidential generative AI === Confidential computing technologies can be applied to various stages of a generative AI deployments to help increase data or model privacy, security, and regulatory compliance. TEEs and remote attestation can protect the integrity of data during AI model training, keep

    Read more →
  • Markov blanket

    Markov blanket

    In statistics and machine learning, a Markov blanket of a random variable is a set of variables that renders the variable conditionally independent of all other variables in the system. This concept is central in probabilistic graphical models and feature selection. If a Markov blanket is minimal—meaning that no variable in it can be removed without losing this conditional independence—it is called a Markov boundary. Identifying a Markov blanket or boundary allows for efficient inference and helps isolate relevant variables for prediction or causal reasoning. The terms Markov blanket and Markov boundary were coined by Judea Pearl in 1988. A Markov blanket may be derived from the structure of a probabilistic graphical model such as a Bayesian network or Markov random field. == Definition == A Markov blanket of a random variable Y {\displaystyle Y} in a random variable set S = { X 1 , … , X n } {\displaystyle {\mathcal {S}}=\{X_{1},\ldots ,X_{n}\}} is any subset S 1 {\displaystyle {\mathcal {S}}_{1}} of S {\displaystyle {\mathcal {S}}} , conditioned on which other variables are independent with Y {\displaystyle Y} : Y ⊥ ⊥ S ∖ S 1 ∣ S 1 {\displaystyle Y\perp \!\!\!\perp {\mathcal {S}}\smallsetminus {\mathcal {S}}_{1}\mid {\mathcal {S}}_{1}} It means that S 1 {\displaystyle {\mathcal {S}}_{1}} contains at least all the information one needs to infer Y {\displaystyle Y} , where the variables in S ∖ S 1 {\displaystyle {\mathcal {S}}\smallsetminus {\mathcal {S}}_{1}} are redundant. In general, a given Markov blanket is not unique. Any set in S {\displaystyle {\mathcal {S}}} that contains a Markov blanket is also a Markov blanket itself. Specifically, S {\displaystyle {\mathcal {S}}} is a Markov blanket of Y {\displaystyle Y} in S {\displaystyle {\mathcal {S}}} . === Example === In a Bayesian network, the Markov blanket of a node consists of its parents, its children, and its children's other parents (i.e., co-parents). Knowing the values of these nodes makes the target node conditionally independent of the rest of the network. In a Markov random field, the Markov blanket of a node is simply its immediate neighbors. == Markov condition == The concept of a Markov blanket is rooted in the Markov condition, which states that in a probabilistic graphical model, each variable is conditionally independent of its non-descendants given its parents. This condition implies the existence of a minimal separating set — the Markov blanket — that shields a variable from the rest of the network. For instance, when a person holds an object stationary against gravity, the object’s acceleration is fully determined by its direct causes—namely, the upward force from the hand and the downward gravitational pull. Other variables such as air pressure or temperature are causally irrelevant. == Markov boundary == A Markov boundary of Y {\displaystyle Y} in S {\displaystyle {\mathcal {S}}} is a subset S 2 {\displaystyle {\mathcal {S}}_{2}} of S {\displaystyle {\mathcal {S}}} , such that S 2 {\displaystyle {\mathcal {S}}_{2}} itself is a Markov blanket of Y {\displaystyle Y} , but any proper subset of S 2 {\displaystyle {\mathcal {S}}_{2}} is not a Markov blanket of Y {\displaystyle Y} . In other words, a Markov boundary is a minimal Markov blanket. The Markov boundary of a node A {\displaystyle A} in a Bayesian network is the set of nodes composed of A {\displaystyle A} 's parents, A {\displaystyle A} 's children, and A {\displaystyle A} 's children's other parents. In a Markov random field, the Markov boundary for a node is the set of its neighboring nodes. In a dependency network, the Markov boundary for a node is the set of its parents. === Uniqueness of Markov boundary === The Markov boundary always exists. Under some mild conditions, the Markov boundary is unique. However, for most practical and theoretical scenarios multiple Markov boundaries may provide alternative solutions. When there are multiple Markov boundaries, quantities measuring causal effect could fail. == In cognitive science == In the study of consciousness, brain function, and complex adaptive systems, Markov blankets are proposed as a mathematical mechanism which delimits the extent of cognitive entities, whether it be physical or causal.

    Read more →
  • One-class classification

    One-class classification

    In machine learning, one-class classification (OCC), also known as unary classification or class-modelling, is an approach to the training of binary classifiers in which only examples of one of the two classes are used. Examples include the monitoring of helicopter gearboxes, motor failure prediction, or assessing the operational status of a nuclear plant as 'normal': In such scenarios, there are few, if any, examples of the catastrophic system states – rare outliers – that comprise the second class. Alternatively, the class that is being focused on may cover a small, coherent subset of the data and the training may rely on an information bottleneck approach. In practice, counter-examples from the second class may be used in later rounds of training to further refine the algorithm. == Overview == The term one-class classification (OCC) was coined by Moya & Hush (1996) and many applications can be found in scientific literature, for example outlier detection, anomaly detection, novelty detection. A feature of OCC is that it uses only sample points from the assigned class, so that a representative sampling is not strictly required for non-target classes. == Introduction == SVM based one-class classification (OCC) relies on identifying the smallest hypersphere (with radius r, and center c) consisting of all the data points. This method is called Support Vector Data Description (SVDD). Formally, the problem can be defined in the following constrained optimization form, min r , c r 2 subject to, | | Φ ( x i ) − c | | 2 ≤ r 2 ∀ i = 1 , 2 , . . . , n {\displaystyle \min _{r,c}r^{2}{\text{ subject to, }}||\Phi (x_{i})-c||^{2}\leq r^{2}\;\;\forall i=1,2,...,n} However, the above formulation is highly restrictive, and is sensitive to the presence of outliers. Therefore, a flexible formulation, that allow for the presence of outliers is formulated as shown below, min r , c , ζ r 2 + 1 ν n ∑ i = 1 n ζ i {\displaystyle \min _{r,c,\zeta }r^{2}+{\frac {1}{\nu n}}\sum _{i=1}^{n}\zeta _{i}} subject to, | | Φ ( x i ) − c | | 2 ≤ r 2 + ζ i ∀ i = 1 , 2 , . . . , n {\displaystyle {\text{subject to, }}||\Phi (x_{i})-c||^{2}\leq r^{2}+\zeta _{i}\;\;\forall i=1,2,...,n} From the Karush–Kuhn–Tucker conditions for optimality, we get c = ∑ i = 1 n α i Φ ( x i ) , {\displaystyle c=\sum _{i=1}^{n}\alpha _{i}\Phi (x_{i}),} where the α i {\displaystyle \alpha _{i}} 's are the solution to the following optimization problem: max α ∑ i = 1 n α i κ ( x i , x i ) − ∑ i , j = 1 n α i α j κ ( x i , x j ) {\displaystyle \max _{\alpha }\sum _{i=1}^{n}\alpha _{i}\kappa (x_{i},x_{i})-\sum _{i,j=1}^{n}\alpha _{i}\alpha _{j}\kappa (x_{i},x_{j})} subject to, ∑ i = 1 n α i = 1 and 0 ≤ α i ≤ 1 ν n for all i = 1 , 2 , . . . , n . {\displaystyle \sum _{i=1}^{n}\alpha _{i}=1{\text{ and }}0\leq \alpha _{i}\leq {\frac {1}{\nu n}}{\text{for all }}i=1,2,...,n.} The introduction of kernel function provide additional flexibility to the One-class SVM (OSVM) algorithm. === PU (Positive Unlabeled) learning === A similar problem is PU learning, in which a binary classifier is constructed by semi-supervised learning from only positive and unlabeled sample points. In PU learning, two sets of examples are assumed to be available for training: the positive set P {\displaystyle P} and a mixed set U {\displaystyle U} , which is assumed to contain both positive and negative samples, but without these being labeled as such. This contrasts with other forms of semisupervised learning, where it is assumed that a labeled set containing examples of both classes is available in addition to unlabeled samples. A variety of techniques exist to adapt supervised classifiers to the PU learning setting, including variants of the EM algorithm. PU learning has been successfully applied to text, time series, bioinformatics tasks, and remote sensing data. == Approaches == Several approaches have been proposed to solve one-class classification (OCC). The approaches can be distinguished into three main categories, density estimation, boundary methods, and reconstruction methods. === Density estimation methods === Density estimation methods rely on estimating the density of the data points, and set the threshold. These methods rely on assuming distributions, such as Gaussian, or a Poisson distribution. Following which discordancy tests can be used to test the new objects. These methods are robust to scale variance. Gaussian model is one of the simplest methods to create one-class classifiers. Due to Central Limit Theorem (CLT), these methods work best when large number of samples are present, and they are perturbed by small independent error values. The probability distribution for a d-dimensional object is given by: p N ( z ; μ ; Σ ) = 1 ( 2 π ) d 2 | Σ | 1 2 exp ⁡ { − 1 2 ( z − μ ) T Σ − 1 ( z − μ ) } {\displaystyle p_{\mathcal {N}}(z;\mu ;\Sigma )={\frac {1}{(2\pi )^{\frac {d}{2}}|\Sigma |^{\frac {1}{2}}}}\exp \left\{-{\frac {1}{2}}(z-\mu )^{T}\Sigma ^{-1}(z-\mu )\right\}} Where, μ {\displaystyle \mu } is the mean and Σ {\displaystyle \Sigma } is the covariance matrix. Computing the inverse of covariance matrix ( Σ − 1 {\displaystyle \Sigma ^{-1}} ) is the costliest operation, and in the cases where the data is not scaled properly, or data has singular directions pseudo-inverse Σ + {\displaystyle \Sigma ^{+}} is used to approximate the inverse, and is calculated as Σ T ( Σ Σ T ) − 1 {\displaystyle \Sigma ^{T}(\Sigma \Sigma ^{T})^{-1}} . === Boundary methods === Boundary methods focus on setting boundaries around a few set of points, called target points. These methods attempt to optimize the volume. Boundary methods rely on distances, and hence are not robust to scale variance. K-centers method, NN-d, and SVDD are some of the key examples. K-centers In K-center algorithm, k {\displaystyle k} small balls with equal radius are placed to minimize the maximum distance of all minimum distances between training objects and the centers. Formally, the following error is minimized, ε k − c e n t e r = max i ( min k | | x i − μ k | | 2 ) {\displaystyle \varepsilon _{k-center}=\max _{i}(\min _{k}||x_{i}-\mu _{k}||^{2})} The algorithm uses forward search method with random initialization, where the radius is determined by the maximum distance of the object, any given ball should capture. After the centers are determined, for any given test object z {\displaystyle z} the distance can be calculated as, d k − c e n t r ( z ) = min k | | z − μ k | | 2 {\displaystyle d_{k-centr}(z)=\min _{k}||z-\mu _{k}||^{2}} === Reconstruction methods === Reconstruction methods use prior knowledge and generating process to build a generating model that best fits the data. New objects can be described in terms of a state of the generating model. Some examples of reconstruction methods for OCC are, k-means clustering, learning vector quantization, self-organizing maps, etc. == Applications == === Document classification === The basic Support Vector Machine (SVM) paradigm is trained using both positive and negative examples, however studies have shown there are many valid reasons for using only positive examples. When the SVM algorithm is modified to only use positive examples, the process is considered one-class classification. One situation where this type of classification might prove useful to the SVM paradigm is in trying to identify a web browser's sites of interest based only off of the user's browsing history. === Biomedical studies === One-class classification can be particularly useful in biomedical studies where often data from other classes can be difficult or impossible to obtain. In studying biomedical data it can be difficult and/or expensive to obtain the set of labeled data from the second class that would be necessary to perform a two-class classification. A study from The Scientific World Journal found that the typicality approach is the most useful in analysing biomedical data because it can be applied to any type of dataset (continuous, discrete, or nominal). The typicality approach is based on the clustering of data by examining data and placing it into new or existing clusters. To apply typicality to one-class classification for biomedical studies, each new observation, y 0 {\displaystyle y_{0}} , is compared to the target class, C {\displaystyle C} , and identified as an outlier or a member of the target class. === Unsupervised Concept Drift Detection === One-class classification has similarities with unsupervised concept drift detection, where both aim to identify whether the unseen data share similar characteristics to the initial data. A concept is referred to as the fixed probability distribution which data is drawn from. In unsupervised concept drift detection, the goal is to detect if the data distribution changes without utilizing class labels. In one-class classification, the flow of data is not important. Unseen data is classified as typical or outlier depending on its characteristics, whether it is from the initi

    Read more →
  • Tanagra (machine learning)

    Tanagra (machine learning)

    Tanagra is a free suite of machine learning software for research and academic purposes developed by Ricco Rakotomalala at the Lumière University Lyon 2, France. Tanagra supports several standard data mining tasks such as: Visualization, Descriptive statistics, Instance selection, feature selection, feature construction, regression, factor analysis, clustering, classification and association rule learning. Tanagra is an academic project. It is widely used in French-speaking universities. Tanagra is frequently used in real studies and in software comparison papers. == History == The development of Tanagra was started in June 2003. The first version was distributed in December 2003. Tanagra is the successor of Sipina, another free data mining tool which is intended only for supervised learning tasks (classification), especially the interactive and visual construction of decision trees. Sipina is still available online and is maintained. Tanagra is an "open source project" as every researcher can access the source code and add their own algorithms, as long as they agree and conform to the software distribution license. The main purpose of the Tanagra project is to give researchers and students a user-friendly data mining software, conforming to the present norms of the software development in this domain (especially in the design of its GUI and the way to use it), and allowing the analyzation of either real or synthetic data. From 2006, Ricco Rakotomalala made an important documentation effort. A large number of tutorials are published on a dedicated website. They describe the statistical and machine learning methods and their implementation with Tanagra on real case studies. The use of other free data mining tools on the same problems is also widely described. The comparison of the tools enables readers to understand the possible differences in the presentation of results. == Description == Tanagra works similarly to current data mining tools. The user can design visually a data mining process in a diagram. Each node is a statistical or machine learning technique, the connection between two nodes represents the data transfer. But unlike the majority of tools which are based on the workflow paradigm, Tanagra is very simplified. The treatments are represented in a tree diagram. The results are displayed in an HTML format. This makes it is easy to export the outputs in order to visualize the results in a browser. It is also possible to copy the result tables to a spreadsheet. Tanagra makes a good compromise between statistical approaches (e.g. parametric and nonparametric statistical tests), multivariate analysis methods (e.g. factor analysis, correspondence analysis, cluster analysis, regression) and machine learning techniques (e.g. neural network, support vector machine, decision trees, random forest).

    Read more →
  • Pocket (service)

    Pocket (service)

    Pocket, formerly known as Read It Later, was a social bookmarking service for storing, sharing and discovering web bookmarks, first released in 2007. Mozilla, the developer of Pocket, announced in May 2025 that it was discontinuing the service and would shut it down in July of that year. == History == Pocket was introduced in August 2007 as a Mozilla Firefox browser extension named Read It Later by Nathan (Nate) Weiner. Once his product was used by millions of people, he moved his office to Silicon Valley and four other people joined the Read It Later team. Weiner's intention was for the application to be like a TiVo directory for web content and to give users access to that content on any device. Read It Later obtained venture capital investments of US$2.5 million in 2011 and $5.0 million in 2012. The 2011 funding came from Foundation Capital, Baseline Ventures, Google Ventures, Founder Collective and unnamed angel investors. The company rejected an acquisition offer by Evernote after showing concerns that Evernote intended to shut down the Read It Later service and amalgamate its functionality into Evernote's main service. Initially, the Read It Later app was available in a free version and a paid version that included additional features. After the rebranding to Pocket, all paid features were made available in a free and advertisement-free app. In May 2014, a paid subscription service called Pocket Premium was introduced, adding server-side storage of articles and more powerful search tools. In June 2015, Pocket was included in Firefox, via a toolbar button and link to a user's Pocket list in the bookmark's menu. The integration was controversial, as users displayed concerns for the direct integration of a proprietary service into an open source application, and that it could not be completely disabled without editing advanced settings, unlike other third-party extensions. A Mozilla spokesperson stated that the feature was meant to leverage the service's popularity among Firefox users and clarified that all code related to the integration was open source. The spokesperson added that "[Mozilla had] gotten lots of positive feedback about the integration from users". On February 27, 2017, Pocket announced that it had been acquired by Mozilla Corporation, the commercial arm of Firefox's non-profit development group. Mozilla staff stated that Pocket would continue to operate as an independent subsidiary but that it would be leveraged as part of an ongoing "Context Graph" project. There were plans to open-source the server-side code of Pocket, though only parts of the project had been open-sourced as of 2024. On May 22, 2025, Mozilla announced that it would shut down Pocket on July 8, 2025. Exports of user data would be available until October 8, 2025, when accounts would be deleted. The email newsletter Pocket Hits was rebranded as Ten Tabs on June 12 as part of the closure, with it being changed to release only on weekdays. == Functions == The application allows the user to save an article or web page to remote servers for later reading. The article is sent to the user's Pocket list (synced to all of their devices) for offline reading. Pocket makes the article more readable by removing clutter and enabling the user to add tags and adjust text settings. == User base == The application had 17 million users and 1 billion saves, as of September 2015. Pocket was listed among Time magazine's 50 Best Android Applications for 2013. == Reception == Kent German of CNET said that "Read It Later is oh so incredibly useful for saving all the articles and news stories I find while commuting or waiting in line." Erez Zukerman of PC World said that supporting the developer is enough reason to buy what he deemed a "handy app". Bill Barol of Forbes said that although Read It Later works less well than Instapaper, "it makes my beloved Instapaper look and feel a little stodgy." In 2015, Pocket was awarded a Material Design Award for Adaptive Layout by Google for their Android application.

    Read more →
  • Latent class model

    Latent class model

    In statistics, a latent class model (LCM) is a model for clustering multivariate discrete data. It assumes that the data arise from a mixture of discrete distributions, within each of which the variables are independent. It is called a latent class model because the class to which each data point belongs is unobserved (or latent). Latent class analysis (LCA) is a subset of structural equation modeling used to find groups or subtypes of cases in multivariate categorical data. These groups or subtypes of cases are called "latent classes". When faced with the following situation, a researcher might opt to use LCA to better understand the data: Symptoms a, b, c, and d have been recorded in a variety of patients diagnosed with diseases X, Y, and Z. Disease X is associated with symptoms a, b, and c; disease Y is linked to symptoms b, c, and d; and disease Z is connected to symptoms a, c, and d. In this context, the LCA would attempt to detect the presence of latent classes (i.e., the disease entities), thus creating patterns of association in the symptoms. As in factor analysis, LCA can also be used to classify cases according to their maximum likelihood class membership probability. The key criterion for resolving the LCA is identifying latent classes in which the observed symptom associations are effectively rendered null. This is because within each class, the diseases responsible for the symptoms create a structure of dependencies. As a result, the symptoms become conditionally independent, meaning that, given the class a case belongs to, the symptoms are no longer related to one another. == Model == Within each latent class, the observed variables are statistically independent—an essential aspect of latent class modeling. Usually, the observed variables are statistically dependent. By introducing the latent variable, independence is restored in the sense that within classes, variables are independent (local independence). Therefore, the association between the observed variables is explained by the classes of the latent variable (McCutcheon, 1987). In one form, the LCM is written as p i 1 , i 2 , … , i N ≈ ∑ t T p t ∏ n N p i n , t n , {\displaystyle p_{i_{1},i_{2},\ldots ,i_{N}}\approx \sum _{t}^{T}p_{t}\,\prod _{n}^{N}p_{i_{n},t}^{n},} where T {\displaystyle T} is the number of latent classes and p t {\displaystyle p_{t}} are the so-called recruitment or unconditional probabilities that should sum to one. p i n , t n {\displaystyle p_{i_{n},t}^{n}} are the marginal or conditional probabilities. For a two-way latent class model, the form is p i j ≈ ∑ t T p t p i t p j t . {\displaystyle p_{ij}\approx \sum _{t}^{T}p_{t}\,p_{it}\,p_{jt}.} This two-way model is related to probabilistic latent semantic analysis and non-negative matrix factorization. The probability model used in LCA is closely related to the Naive Bayes classifier. The main difference is that in LCA, the class membership of an individual is a latent variable, whereas in Naive Bayes classifiers, the class membership is an observed label. == Related methods == There are a number of methods with distinct names and uses that share a common relationship. Cluster analysis is, like LCA, used to discover taxon-like groups of cases in data. Multivariate mixture estimation (MME) is applicable to continuous data and assumes that such data arise from a mixture of distributions, such as a set of heights arising from a mixture of men and women. If a multivariate mixture estimation is constrained so that measures must be uncorrelated within each distribution, it is termed latent profile analysis. Modified to handle discrete data, this constrained analysis is known as LCA. Discrete latent trait models further constrain the classes to form from segments of a single dimension, allocating members to classes based on that dimension. An example would be assigning cases to social classes based on ability or merit. In a practical instance, the variables could be multiple choice items of a political questionnaire. In this case, the data consists of an N-way contingency table with answers to the items for a number of respondents. In this example, the latent variable refers to political opinion, and the latent classes to political groups. Given group membership, the conditional probabilities specify the chance that certain answers are chosen. == Application == LCA may be used in many fields, such as: collaborative filtering, Behavior Genetics and Evaluation of diagnostic tests.

    Read more →
  • Types of artificial neural networks

    Types of artificial neural networks

    Types of neural networks (NN) include a family of techniques. The simplest types have static components, including number of units, number of layers, unit weights and topology. Dynamic NNs evolve via learning. Some types allow/require learning to be "supervised" by the operator, while others operate independently. Some types operate purely in hardware, while others are purely software and run on general purpose computers. The main types are: Transformers: these use attention to analyze every token in the input stream against every other token in the stream. That technique has enabled neural networks to reach the general public via chatbots, code generators and many other forms. Convolutional neural networks (CNN): a FNN that uses kernels and regularization to evade problems in prior generations of NNs. They are typically used to analyze visual and other two-dimensional data. Generative adversarial networks set networks (of varying structure) against each other, each trying to push the other(s) to produce better results such as winning a game or to deceive the opponent about the authenticity of an input. == Feedforward == In feedforward neural networks the information moves from the input to output directly in every layer. There can be hidden layers with or without cycles/loops to sequence inputs. Feedforward networks can be constructed with various types of units, such as binary McCulloch–Pitts neurons, the simplest of which is the perceptron. Continuous neurons, frequently with sigmoidal activation, are used in the context of backpropagation. == Group method of data handling == The Group Method of Data Handling (GMDH) features fully automatic structural and parametric model optimization. The node activation functions are Kolmogorov–Gabor polynomials that permit additions and multiplications. It uses a deep multilayer perceptron with eight layers. It is a supervised learning network that grows layer by layer, where each layer is trained by regression analysis. Useless items are detected using a validation set, and pruned through regularization. The size and depth of the resulting network depends on the task. == Autoencoder == An autoencoder, autoassociator or Diabolo network is similar to the multilayer perceptron (MLP) – with an input layer, an output layer and one or more hidden layers connecting them. However, the output layer has the same number of units as the input layer. Its purpose is to reconstruct its own inputs (instead of emitting a target value). Therefore, autoencoders are unsupervised learning models. An autoencoder is used for unsupervised learning of efficient codings, typically for the purpose of dimensionality reduction and for learning generative models of data. == Probabilistic == A probabilistic neural network (PNN) is a four-layer feedforward neural network. The layers are Input, hidden pattern, hidden summation, and output. In the PNN algorithm, the parent probability distribution function (PDF) of each class is approximated by a Parzen window and a non-parametric function. Then, using PDF of each class, the class probability of a new input is estimated and Bayes’ rule is employed to allocate it to the class with the highest posterior probability. It was derived from the Bayesian network and a statistical algorithm called Kernel Fisher discriminant analysis. It is used for classification and pattern recognition. == Time delay == A time delay neural network (TDNN) is a feedforward architecture for sequential data that recognizes features independent of sequence position. In order to achieve time-shift invariance, delays are added to the input so that multiple data points (points in time) are analyzed together. It usually forms part of a larger pattern recognition system. It has been implemented using a perceptron network whose connection weights were trained with back propagation (supervised learning). == Convolutional == A convolutional neural network (CNN, or ConvNet or shift invariant or space invariant) is a class of deep network, composed of one or more convolutional layers with fully connected layers (matching those in typical ANNs) on top. It uses tied weights and pooling layers. In particular, max-pooling. It is often structured via Fukushima's convolutional architecture. They are variations of multilayer perceptrons that use minimal preprocessing. This architecture allows CNNs to take advantage of the 2D structure of input data. Its unit connectivity pattern is inspired by the organization of the visual cortex. Units respond to stimuli in a restricted region of space known as the receptive field. Receptive fields partially overlap, over-covering the entire visual field. Unit response can be approximated mathematically by a convolution operation. CNNs are suitable for processing visual and other two-dimensional data. They have shown superior results in both image and speech applications. They can be trained with standard backpropagation. CNNs are easier to train than other regular, deep, feed-forward neural networks and have many fewer parameters to estimate. Capsule Neural Networks (CapsNet) add structures called capsules to a CNN and reuse output from several capsules to form more stable (with respect to various perturbations) representations. Examples of applications in computer vision include DeepDream and robot navigation. They have wide applications in image and video recognition, recommender systems and natural language processing. == Deep stacking network == A deep stacking network (DSN) (deep convex network) is based on a hierarchy of blocks of simplified neural network modules. It was introduced in 2011 by Deng and Yu. It formulates the learning as a convex optimization problem with a closed-form solution, emphasizing the mechanism's similarity to stacked generalization. Each DSN block is a simple module that is easy to train by itself in a supervised fashion without backpropagation for the entire blocks. Each block consists of a simplified multi-layer perceptron (MLP) with a single hidden layer. The hidden layer h has logistic sigmoidal units, and the output layer has linear units. Connections between these layers are represented by weight matrix U; input-to-hidden-layer connections have weight matrix W. Target vectors t form the columns of matrix T, and the input data vectors x form the columns of matrix X. The matrix of hidden units is H = σ ( W T X ) {\displaystyle {\boldsymbol {H}}=\sigma ({\boldsymbol {W}}^{T}{\boldsymbol {X}})} . Modules are trained in order, so lower-layer weights W are known at each stage. The function performs the element-wise logistic sigmoid operation. Each block estimates the same final label class y, and its estimate is concatenated with original input X to form the expanded input for the next block. Thus, the input to the first block contains the original data only, while downstream blocks' input adds the output of preceding blocks. Then learning the upper-layer weight matrix U given other weights in the network can be formulated as a convex optimization problem: min U T f = ‖ U T H − T ‖ F 2 , {\displaystyle \min _{U^{T}}f=\|{\boldsymbol {U}}^{T}{\boldsymbol {H}}-{\boldsymbol {T}}\|_{F}^{2},} which has a closed-form solution. Unlike other deep architectures, such as DBNs, the goal is not to discover the transformed feature representation. The structure of the hierarchy of this kind of architecture makes parallel learning straightforward, as a batch-mode optimization problem. In purely discriminative tasks, DSNs outperform conventional DBNs. === Tensor deep stacking networks === This architecture is a DSN extension. It offers two important improvements: it uses higher-order information from covariance statistics, and it transforms the non-convex problem of a lower-layer to a convex sub-problem of an upper-layer. TDSNs use covariance statistics in a bilinear mapping from each of two distinct sets of hidden units in the same layer to predictions, via a third-order tensor. While parallelization and scalability are not considered seriously in conventional DNNs, all learning for DSNs and TDSNs is done in batch mode, to allow parallelization. Parallelization allows scaling the design to larger (deeper) architectures and data sets. The basic architecture is suitable for diverse tasks such as classification and regression. == Physics-informed == Such a neural network is designed for the numerical solution of mathematical equations, such as differential, integral, delay, fractional and others. As input parameters, PINN accepts variables (spatial, temporal, and others), transmits them through the network block. At the output, it produces an approximate solution and substitutes it into the mathematical model, considering the initial and boundary conditions. If the solution does not satisfy the required accuracy, one uses the backpropagation and rectify the solution. Besides PINN, other architectures have been developed to produce surrogate models for scientific comput

    Read more →
  • Multiple correspondence analysis

    Multiple correspondence analysis

    In statistics, multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a low-dimensional Euclidean space. The procedure thus appears to be the counterpart of principal component analysis for categorical data. MCA can be viewed as an extension of simple correspondence analysis (CA) in that it is applicable to a large set of categorical variables. == As an extension of correspondence analysis == MCA is performed by applying the CA algorithm to either an indicator matrix (also called complete disjunctive table – CDT) or a Burt table formed from these variables. An indicator matrix is an individuals × variables matrix, where the rows represent individuals and the columns are dummy variables representing categories of the variables. Analyzing the indicator matrix allows the direct representation of individuals as points in geometric space. The Burt table is the symmetric matrix of all two-way cross-tabulations between the categorical variables, and has an analogy to the covariance matrix of continuous variables. Analyzing the Burt table is a more natural generalization of simple correspondence analysis, and individuals or the means of groups of individuals can be added as supplementary points to the graphical display. In the indicator matrix approach, associations between variables are uncovered by calculating the chi-square distance between different categories of the variables and between the individuals (or respondents). These associations are then represented graphically as "maps", which eases the interpretation of the structures in the data. Oppositions between rows and columns are then maximized, in order to uncover the underlying dimensions best able to describe the central oppositions in the data. As in factor analysis or principal component analysis, the first axis is the most important dimension, the second axis the second most important, and so on, in terms of the amount of variance accounted for. The number of axes to be retained for analysis is determined by calculating modified eigenvalues. == Details == Since MCA is adapted to draw statistical conclusions from categorical variables (such as multiple choice questions), the first thing one needs to do is to transform quantitative data (such as age, size, weight, day time, etc) into categories (using for instance statistical quantiles). When the dataset is completely represented as categorical variables, one is able to build the corresponding so-called complete disjunctive table. We denote this table X {\displaystyle X} . If I {\displaystyle I} persons answered a survey with J {\displaystyle J} multiple choices questions with 4 answers each, X {\displaystyle X} will have I {\displaystyle I} rows and 4 J {\displaystyle 4J} columns. More theoretically, assume X {\displaystyle X} is the completely disjunctive table of I {\displaystyle I} observations of K {\displaystyle K} categorical variables. Assume also that the k {\displaystyle k} -th variable have J k {\displaystyle J_{k}} different levels (categories) and set J = ∑ k = 1 K J k {\displaystyle J=\sum _{k=1}^{K}J_{k}} . The table X {\displaystyle X} is then a I × J {\displaystyle I\times J} matrix with all coefficient being 0 {\displaystyle 0} or 1 {\displaystyle 1} . Set the sum of all entries of X {\displaystyle X} to be N {\displaystyle N} and introduce Z = X / N {\displaystyle Z=X/N} . In an MCA, there are also two special vectors: first r {\displaystyle r} , that contains the sums along the rows of Z {\displaystyle Z} , and c {\displaystyle c} , that contains the sums along the columns of Z {\displaystyle Z} . Note D r = diag ( r ) {\displaystyle D_{r}={\text{diag}}(r)} and D c = diag ( c ) {\displaystyle D_{c}={\text{diag}}(c)} , the diagonal matrices containing r {\displaystyle r} and c {\displaystyle c} respectively as diagonal. With these notations, computing an MCA consists essentially in the singular value decomposition of the matrix: M = D r − 1 / 2 ( Z − r c T ) D c − 1 / 2 {\displaystyle M=D_{r}^{-1/2}(Z-rc^{T})D_{c}^{-1/2}} The decomposition of M {\displaystyle M} gives you P {\displaystyle P} , Δ {\displaystyle \Delta } and Q {\displaystyle Q} such that M = P Δ Q T {\displaystyle M=P\Delta Q^{T}} with P, Q two unitary matrices and Δ {\displaystyle \Delta } is the generalized diagonal matrix of the singular values (with the same shape as Z {\displaystyle Z} ). The positive coefficients of Δ 2 {\displaystyle \Delta ^{2}} are the eigenvalues of Z {\displaystyle Z} . The interest of MCA comes from the way observations (rows) and variables (columns) in Z {\displaystyle Z} can be decomposed. This decomposition is called a factor decomposition. The coordinates of the observations in the factor space are given by F = D r − 1 / 2 P Δ {\displaystyle F=D_{r}^{-1/2}P\Delta } The i {\displaystyle i} -th rows of F {\displaystyle F} represent the i {\displaystyle i} -th observation in the factor space. And similarly, the coordinates of the variables (in the same factor space as observations!) are given by G = D c − 1 / 2 Q Δ {\displaystyle G=D_{c}^{-1/2}Q\Delta } == Recent works and extensions == In recent years, several students of Jean-Paul Benzécri have refined MCA and incorporated it into a more general framework of data analysis known as geometric data analysis. This involves the development of direct connections between simple correspondence analysis, principal component analysis and MCA with a form of cluster analysis known as Euclidean classification. Two extensions have great practical use. It is possible to include, as active elements in the MCA, several quantitative variables. This extension is called factor analysis of mixed data (see below). Very often, in questionnaires, the questions are structured in several issues. In the statistical analysis it is necessary to take into account this structure. This is the aim of multiple factor analysis which balances the different issues (i.e. the different groups of variables) within a global analysis and provides, beyond the classical results of factorial analysis (mainly graphics of individuals and of categories), several results (indicators and graphics) specific of the group structure. == Application fields == In the social sciences, MCA is arguably best known for its application by Pierre Bourdieu, notably in his books La Distinction, Homo Academicus and The State Nobility. Bourdieu argued that there was an internal link between his vision of the social as spatial and relational --– captured by the notion of field, and the geometric properties of MCA. Sociologists following Bourdieu's work most often opt for the analysis of the indicator matrix, rather than the Burt table, largely because of the central importance accorded to the analysis of the 'cloud of individuals'. == Multiple correspondence analysis and principal component analysis == MCA can also be viewed as a PCA applied to the complete disjunctive table. To do this, the CDT must be transformed as follows. Let y i k {\displaystyle y_{ik}} denote the general term of the CDT. y i k {\displaystyle y_{ik}} is equal to 1 if individual i {\displaystyle i} possesses the category k {\displaystyle k} and 0 if not. Let denote p k {\displaystyle p_{k}} , the proportion of individuals possessing the category k {\displaystyle k} . The transformed CDT (TCDT) has as general term: x i k = y i k / p k − 1 {\displaystyle x_{ik}=y_{ik}/p_{k}-1} The unstandardized PCA applied to TCDT, the column k {\displaystyle k} having the weight p k {\displaystyle p_{k}} , leads to the results of MCA. This equivalence is fully explained in a book by Jérôme Pagès. It plays an important theoretical role because it opens the way to the simultaneous treatment of quantitative and qualitative variables. Two methods simultaneously analyze these two types of variables: factor analysis of mixed data and, when the active variables are partitioned in several groups: multiple factor analysis. This equivalence does not mean that MCA is a particular case of PCA as it is not a particular case of CA. It only means that these methods are closely linked to one another, as they belong to the same family: the factorial methods. == Software == There are numerous software of data analysis that include MCA, such as STATA and SPSS. The R package FactoMineR also features MCA. This software is related to a book describing the basic methods for performing MCA . There is also a Python package for [1] which works with numpy array matrices; the package has not been implemented yet for Spark dataframes.

    Read more →
  • 2024–present global memory supply shortage

    2024–present global memory supply shortage

    A global computer memory supply shortage started in 2024 due to supply constraints and rapid price escalation in the semiconductor memory market, particularly affecting DRAM and NAND flash memory. This shortage is sometimes labelled by tech media outlets as "RAMmageddon" or the "RAMpocalypse". Unlike the 2020–2023 global chip shortage, which stemmed primarily from pandemic-related supply chain disruptions from COVID-19, this shortage is driven by a structural reallocation of manufacturing capacity toward high-margin products for artificial intelligence infrastructure, creating scarcity of computer memory in consumer and enterprise PC markets. According to a 2026 Kearney's PERLab analysis, the shortage is expected to last at least until 2030, with CEOs agreeing with the timelines. == Background == Following a severe market downturn in 2022–2023, major memory manufacturers—Samsung Electronics, SK Hynix, and Micron Technology—implemented strategic production cuts to stabilize pricing. By mid-2024, the rapid expansion of generative AI services triggered unprecedented demand for specialized memory products, particularly High Bandwidth Memory (HBM) used in AI accelerators and data center GPUs. Specialized components of semiconductor technology are also experiencing supply constraints due to high demand in AI application. For example, glass cloth, a high-performance glass fiber substrate used for power efficient high speed data transfer and a crucial component of semiconductor manufacturing, is experiencing a supply crisis. Nitto Boseki, a Japanese firm having overwhelming monopoly in its production, is not able to meet increased demands, making chip-makers such as Qualcomm, Apple, Nvidia and AMD compete for securing supply. There are also reports of smaller electronics companies struggling to find suppliers for components such as NAND flash. Memory suppliers are adapting to increased demands and market unpredictability by requiring prepayment or shorter time-frame of payment, which makes it more difficult for smaller firms to acquire capital to survive. By 2026, due to steadily increased demand on resources, CPUs are also experiencing shortage issues due to low fabrication capacity, prioritisation of server CPUs, and increased demand, with CPU prices also being forecast to increase by as much as 15%. The demand on memory has also increased strain on other electronic components such as hard disk devices, with reports such as Western Digital's hard disk supply for 2026 being booked for enterprise applications before February 2026. A 2024 McKinsey analysis projected that global demand for AI-ready data center capacity would grow at approximately 33% annually through 2030, with AI workloads consuming roughly 70% of total data center capacity by the decade's end. In addition, according to Kearney's State of Semiconductor 2025 Report, executives were already expecting a shortage in the <8nm wafer size with memory chips being mentioned as an acute source of concern. Multiple companies mentioned being prepared for it through long-term agreements with RAM suppliers or amassing additional inventory. On 24 March 2026, Google announced TurboQuant, a memory compression technology focused on large language models (LLM) and vector search engines, which it claimed achieves 6x lower memory consumption in tested local LLMs and 8x performance enhancement in tests running on H100 accelerators. The technology is also a drop in enhancement for existing inference pipeline. Amid speculation about memory demand trends, memory manufacturers, SanDisk, Micron, Western Digital and Seagate, among other companies involved in memory manufacture experienced stock price declines. Prices of memory kits also reduced in the following months, although still at inflated prices. == Causes == === HBM production displacement === HBM manufacturing requires significantly more wafer capacity per bit than standard DRAM modules. Industry sources reported that as manufacturers allocated increasing wafer capacity to HBM production to meet contracts with AI infrastructure providers, the supply of conventional DDR4 and DDR5 modules for consumer PCs and smartphones contracted sharply. By September 2025, Samsung Electronics had reportedly expanded its 1c DRAM capacity to target 60,000 wafers per month specifically for HBM4 production, further diverting resources from consumer memory lines. === Geopolitical and trade barriers === The supply chain was further constrained by escalating trade tensions between the United States and China. Throughout 2025, fears of U.S. regulatory backlash and new tariff structures led major manufacturers like Samsung and SK Hynix to halt sales of older semiconductor manufacturing equipment to Chinese entities, effectively capping production capacity in the region. Additionally, proposed tariff policies by the U.S. administration in late 2025 prompted supply chain realignments, with Apple reportedly accelerating plans to source all U.S.-bound iPhones from India to avoid potential levies. === NAND flash capacity constraints === In the NAND flash segment, manufacturers prioritized higher-margin enterprise SSDs for data center applications while phasing out older process nodes more rapidly than anticipated. In November 2025, contract prices for NAND wafers increased by more than 60% month-over-month for certain product categories, with 512GB TLC experiencing the steepest rise as legacy manufacturing capacity was retired. == Impact on industry and consumers == === Manufacturer responses === Major PC manufacturers responded to component cost increases with significant price adjustments and supply chain strategies. Dell Technologies Chief Operating Officer Jeff Clarke stated during a November 2025 analyst call that the company had "never witnessed costs escalating at the current pace," describing tighter availability across DRAM, hard drives, and NAND flash memory. Analysts at Morgan Stanley downgraded Dell Technologies stock from "Overweight" to "Underweight" in late 2025, citing the company's heavy exposure to rising server memory costs. The firm warned that skyrocketing memory prices could significantly erode margins for server and PC OEMs. Conversely, Apple Inc. was reportedly less affected than its competitors, having secured long-term supply agreements for DRAM through the first quarter of 2026. Lenovo Chief Financial Officer Winston Cheng described the cost surge as "unprecedented" and disclosed that the company's memory inventories were approximately 50% above normal levels in anticipation of further price increases. === Consumer electronics sector === The shortage particularly affected smartphone manufacturers and other consumer electronics producers. DRAM prices reportedly rose by 172% throughout 2025, leading manufacturers like Samsung to halt new orders for DDR5 modules to reassess pricing structures and Micron to exit its 'Crucial' brand of consumer products. In Tokyo's Akihabara electronics district, retailers began limiting purchases of memory products to prevent hoarding, with prices for popular DDR5 memory modules more than doubling in some cases. Despite the broad trend of rising hardware costs, some companies engaged in aggressive pricing strategies to maintain market share; for example, Sony reduced the price of the PlayStation 5 by $100 for Black Friday 2025, potentially absorbing increased component costs to stimulate software ecosystem growth. Due to memory prices more than doubling in a single quarter, HP revealed in its Q1 2026 earnings call that memory costs account for 35% of PC build materials up from 15-18% previous quarter. Despite showing strong Q1 2026 earning driven by Windows 11 upgrade cycle and AI PC adoption, HP warned investors of low operating margins and up to double digit percentage decline for coming quarter. Trendforce, an IT analytics company, updated its forecast from 1.7% year-over-year growth in PC market to 2.6% year-over-year decline for 2026, amid backdrop of steadily increasing prices and supply crisis. Research and analytics firms, Gartner and IDC expect worldwide PC market to decline 10-11% and smartphone market to decline 8-9% in 2026. Gartner also projects that rising memory prices will make low-margin entry level laptops under 500 USD financially unviable in two years. The RAM shortage has delayed the release of Valve's second Steam Machine due to increased memory prices. The device was originally set to launch in early 2026. === AI infrastructure competition === Technology companies including Google, Amazon, Microsoft, and Meta Platforms placed open-ended orders with memory suppliers, indicating they would accept as much supply as available regardless of cost, according to Reuters sources. The limited supply of AI chips has been cited as a reason for the slow down in compute growth. In October 2025, OpenAI formally announced a strategic partnership using letters of intent with Samsung Electronics and SK Hynix

    Read more →
  • Naive Bayes classifier

    Naive Bayes classifier

    In statistics, naive (sometimes simple or idiot's) Bayes classifiers are a family of "probabilistic classifiers" which assume that the features are conditionally independent, given the target class. In other words, a naive Bayes model assumes the information about the class provided by each variable is unrelated to the information from the others, with no information shared between the predictors. The highly unrealistic nature of this assumption, called the naive independence assumption, is what gives the classifier its name. These classifiers are some of the simplest Bayesian network models. Naive Bayes classifiers generally perform worse than more advanced models like logistic regressions, especially at quantifying uncertainty (with naive Bayes models often producing wildly overconfident probabilities). However, they are highly scalable, requiring only one parameter for each feature or predictor in a learning problem. Maximum-likelihood training can be done by evaluating a closed-form expression (simply by counting observations in each group), rather than the expensive iterative approximation algorithms required by most other models. Despite the use of Bayes' theorem in the classifier's decision rule, naive Bayes is not (necessarily) a Bayesian method, and naive Bayes models can be fit to data using either Bayesian or frequentist methods. == Introduction == Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without accepting Bayesian probability or using any Bayesian methods. Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations. In 2004, an analysis of the Bayesian classification problem showed that there are sound theoretical reasons for the apparently implausible efficacy of naive Bayes classifiers. Still, a comprehensive comparison with other classification algorithms in 2006 showed that Bayes classification is outperformed by other approaches, such as boosted trees or random forests. An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification. == Probabilistic model == Abstractly, naive Bayes is a conditional probability model: it assigns probabilities p ( C k ∣ x 1 , … , x n ) {\displaystyle p(C_{k}\mid x_{1},\ldots ,x_{n})} for each of the K possible outcomes or classes C k {\displaystyle C_{k}} given a problem instance to be classified, represented by a vector x = ( x 1 , … , x n ) {\displaystyle \mathbf {x} =(x_{1},\ldots ,x_{n})} encoding some n features (independent variables). The problem with the above formulation is that if the number of features n is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible. The model must therefore be reformulated to make it more tractable. Using Bayes' theorem, the conditional probability can be decomposed as: p ( C k ∣ x ) = p ( C k ) p ( x ∣ C k ) p ( x ) {\displaystyle p(C_{k}\mid \mathbf {x} )={\frac {p(C_{k})\ p(\mathbf {x} \mid C_{k})}{p(\mathbf {x} )}}\,} In plain English, using Bayesian probability terminology, the above equation can be written as posterior = prior × likelihood evidence {\displaystyle {\text{posterior}}={\frac {{\text{prior}}\times {\text{likelihood}}}{\text{evidence}}}\,} In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on C {\displaystyle C} and the values of the features x i {\displaystyle x_{i}} are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model p ( C k , x 1 , … , x n ) {\displaystyle p(C_{k},x_{1},\ldots ,x_{n})\,} which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability: p ( C k , x 1 , … , x n ) = p ( x 1 , … , x n , C k ) = p ( x 1 ∣ x 2 , … , x n , C k ) p ( x 2 , … , x n , C k ) = p ( x 1 ∣ x 2 , … , x n , C k ) p ( x 2 ∣ x 3 , … , x n , C k ) p ( x 3 , … , x n , C k ) = ⋯ = p ( x 1 ∣ x 2 , … , x n , C k ) p ( x 2 ∣ x 3 , … , x n , C k ) ⋯ p ( x n − 1 ∣ x n , C k ) p ( x n ∣ C k ) p ( C k ) {\displaystyle {\begin{aligned}p(C_{k},x_{1},\ldots ,x_{n})&=p(x_{1},\ldots ,x_{n},C_{k})\\&=p(x_{1}\mid x_{2},\ldots ,x_{n},C_{k})\ p(x_{2},\ldots ,x_{n},C_{k})\\&=p(x_{1}\mid x_{2},\ldots ,x_{n},C_{k})\ p(x_{2}\mid x_{3},\ldots ,x_{n},C_{k})\ p(x_{3},\ldots ,x_{n},C_{k})\\&=\cdots \\&=p(x_{1}\mid x_{2},\ldots ,x_{n},C_{k})\ p(x_{2}\mid x_{3},\ldots ,x_{n},C_{k})\cdots p(x_{n-1}\mid x_{n},C_{k})\ p(x_{n}\mid C_{k})\ p(C_{k})\\\end{aligned}}} Now the "naive" conditional independence assumptions come into play: assume that all features in x {\displaystyle \mathbf {x} } are mutually independent, conditional on the category C k {\displaystyle C_{k}} . Under this assumption, p ( x i ∣ x i + 1 , … , x n , C k ) = p ( x i ∣ C k ) . {\displaystyle p(x_{i}\mid x_{i+1},\ldots ,x_{n},C_{k})=p(x_{i}\mid C_{k})\,.} Thus, the joint model can be expressed as p ( C k ∣ x 1 , … , x n ) ∝ p ( C k , x 1 , … , x n ) = p ( C k ) p ( x 1 ∣ C k ) p ( x 2 ∣ C k ) p ( x 3 ∣ C k ) ⋯ = p ( C k ) ∏ i = 1 n p ( x i ∣ C k ) , {\displaystyle {\begin{aligned}p(C_{k}\mid x_{1},\ldots ,x_{n})\varpropto \ &p(C_{k},x_{1},\ldots ,x_{n})\\&=p(C_{k})\ p(x_{1}\mid C_{k})\ p(x_{2}\mid C_{k})\ p(x_{3}\mid C_{k})\ \cdots \\&=p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})\,,\end{aligned}}} where ∝ {\displaystyle \varpropto } denotes proportionality since the denominator p ( x ) {\displaystyle p(\mathbf {x} )} is omitted. This means that under the above independence assumptions, the conditional distribution over the class variable C {\displaystyle C} is: p ( C k ∣ x 1 , … , x n ) = 1 Z p ( C k ) ∏ i = 1 n p ( x i ∣ C k ) {\displaystyle p(C_{k}\mid x_{1},\ldots ,x_{n})={\frac {1}{Z}}\ p(C_{k})\prod _{i=1}^{n}p(x_{i}\mid C_{k})} where the evidence Z = p ( x ) = ∑ k p ( C k ) p ( x ∣ C k ) {\displaystyle Z=p(\mathbf {x} )=\sum _{k}p(C_{k})\ p(\mathbf {x} \mid C_{k})} is a scaling factor dependent only on x 1 , … , x n {\displaystyle x_{1},\ldots ,x_{n}} , that is, a constant if the values of the feature variables are known. Often, it is only necessary to discriminate between classes. In that case, the scaling factor is irrelevant, and it is sufficient to calculate the log-probability up to a factor: ln ⁡ p ( C k ∣ x 1 , … , x n ) = ln ⁡ p ( C k ) + ∑ i = 1 n ln ⁡ p ( x i ∣ C k ) − ln ⁡ Z ⏟ irrelevant {\displaystyle \ln p(C_{k}\mid x_{1},\ldots ,x_{n})=\ln p(C_{k})+\sum _{i=1}^{n}\ln p(x_{i}\mid C_{k})\underbrace {-\ln Z} _{\text{irrelevant}}} The scaling factor is irrelevant, since discrimination subtracts it away: ln ⁡ p ( C k ∣ x 1 , … , x n ) p ( C l ∣ x 1 , … , x n ) = ( ln ⁡ p ( C k ) + ∑ i = 1 n ln ⁡ p ( x i ∣ C k ) ) − ( ln ⁡ p ( C l ) + ∑ i = 1 n ln ⁡ p ( x i ∣ C l ) ) {\displaystyle \ln {\frac {p(C_{k}\mid x_{1},\ldots ,x_{n})}{p(C_{l}\mid x_{1},\ldots ,x_{n})}}=\left(\ln p(C_{k})+\sum _{i=1}^{n}\ln p(x_{i}\mid C_{k})\right)-\left(\ln p(C_{l})+\sum _{i=1}^{n}\ln p(x_{i}\mid C_{l})\right)} There are two benefits of using log-probability. One is that it allows an interpretation in information theory, where log-probabilities are units of information in nats. Another is that it avoids arithmetic underflow. === Constructing a classifier from the probability model === The discussion so far has derived the independent feature model, that is, the naive Bayes probability model. The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable so as to minimize the probability of misclassification; this is known as the maximum a posteriori or MAP decision rule. The corresponding classifier, a Bayes classifier, is the function that assigns a class label y ^ = C k {\displaystyle {\hat {y}}=C_{k}} for some k as follows: y ^ = argmax k ∈ { 1 , … , K } p ( C k ) ∏ i = 1 n p ( x i ∣ C k ) . {\displaystyle {\hat {y}}={\underset {k\in \{1,\ldots ,K\}}{\operatorname {argmax} }}\ p(C_{k})\displays

    Read more →
  • C4.5 algorithm

    C4.5 algorithm

    C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier. In 2011, authors of the Weka machine learning software described the C4.5 algorithm as "a landmark decision tree program that is probably the machine learning workhorse most widely used in practice to date". It became quite popular after ranking #1 in the Top 10 Algorithms in Data Mining pre-eminent paper published by Springer LNCS in 2008. == Algorithm == C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. The training data is a set S = s 1 , s 2 , . . . {\displaystyle S={s_{1},s_{2},...}} of already classified samples. Each sample s i {\displaystyle s_{i}} consists of a p-dimensional vector ( x 1 , i , x 2 , i , . . . , x p , i ) {\displaystyle (x_{1,i},x_{2,i},...,x_{p,i})} , where the x j {\displaystyle x_{j}} represent attribute values or features of the sample, as well as the class in which s i {\displaystyle s_{i}} falls. At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurses on the partitioned sublists. This algorithm has a few base cases. All the samples in the list belong to the same class. When this happens, it simply creates a leaf node for the decision tree saying to choose that class. None of the features provide any information gain. In this case, C4.5 creates a decision node higher up the tree using the expected value of the class. Instance of previously unseen class encountered. Again, C4.5 creates a decision node higher up the tree using the expected value. === Pseudocode === In pseudocode, the general algorithm for building decision trees is: Check for the above base cases. For each attribute a, find the normalized information gain ratio from splitting on a. Let a_best be the attribute with the highest normalized information gain. Create a decision node that splits on a_best. Recurse on the sublists obtained by splitting on a_best, and add those nodes as children of node. == Improvements from ID3 algorithm == C4.5 made a number of improvements to ID3. Some of these are: Handling both continuous and discrete attributes: In order to handle continuous attributes, C4.5 creates a threshold and then splits the list into those whose attribute value is above the threshold and those that are less than or equal to it. Handling training data with missing attribute values: C4.5 allows attribute values to be marked as missing. Missing attribute values are simply not used in gain and entropy calculations. Handling attributes with differing costs. Pruning trees after creation: C4.5 goes back through the tree once it's been created and attempts to remove branches that do not help by replacing them with leaf nodes. == Improvements in C5.0/See5 algorithm == Quinlan went on to create C5.0 and See5 (C5.0 for Unix/Linux, See5 for Windows) which he markets commercially. C5.0 offers a number of improvements on C4.5. Some of these are: Speed - C5.0 is significantly faster than C4.5 (several orders of magnitude) Memory usage - C5.0 is more memory efficient than C4.5 Smaller decision trees - C5.0 gets similar results to C4.5 with considerably smaller decision trees. Support for boosting - Boosting improves the trees and gives them more accuracy. Weighting - C5.0 allows you to weight different cases and misclassification types. Winnowing - a C5.0 option automatically winnows the attributes to remove those that may be unhelpful. Source for a single-threaded Linux version of C5.0 is available under the GNU General Public License (GPL).

    Read more →