AI Excel Spreadsheet Maker Free

AI Excel Spreadsheet Maker Free — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Tradeshift

    Tradeshift

    Tradeshift is a cloud based business network and platform for purchase-to-pay automation, supply chain payments, marketplaces, virtual cards and supply chain financing. Its 2018 round of funding, led by Goldman Sachs, raised US$250 million at a valuation of $1.1 billion, giving the company unicorn status. Tradeshift is headquartered in San Francisco, California and has offices in London, Copenhagen, Bucharest and Kuala Lumpur. Tradeshift has reprocessed over $1 trillion USD through transactions on its network. == History == Tradeshift was founded in 2010 by Christian Lanng, Mikkel Hippe Brun, and Gert Sylvest. Inspiration for Tradeshift came after they created the world's first large scale peer-to-peer infrastructure for an e-business called NemHandel. The founders also had leading roles (Governing board member, Technical Director) in the European Commission project PEPPOL inside the European Union. In 2010, the Tradeshift platform launched in May in Copenhagen. Tradeshift won the European Startup Awards in the category of "Best Business or Enterprise Startup." In 2011, Tradeshift made its app marketplace available. In 2012, Tradeshift moved their headquarters from Copenhagen to San Francisco. In 2013, Tradeshift opened an R&D center in Suzhou, China. Tradeshift opened an additional office in London. And LATAM e-invoicing capabilities were added through partnership with Invoiceware. In 2014, Tradeshift expanded with offices in Tokyo, Paris, and Munich. The EU Commission officially approved the Universal Business Language (UBL) data format – a format Tradeshift supports – as eligible for referencing in tenders from public administrations. In 2015, Tradeshift won the Circulars "Digital Disruptor" Award at the WEF conference in Davos, Switzerland. Tradeshift also acquired product information management company Merchantry, and launched e-procurement and supplier risk management solutions. In 2016, Tradeshift acquired Hyper Travel and secured a $75 million series-D round funding. In 2017, Tradeshift acquired IBX Business Network and launches Tradeshift Ada. In 2018, Tradeshift secured a $250 million series-E round funding. and launched Blockchain Payments, the latter as part of Tradeshift Pay. In December 2018 Tradeshift acquired Babelway, an online B2B integration platform. The acquisition added three new office locations to Tradeshift (Salt Lake City, Louvain-la-neuve, Belgium, Cairo Egypt). In Q3 2018, Tradeshift reported year-over-year revenue growth of 400%, new bookings growth of 284%, and gross merchandise volume (GMV) growth of 262%. New total contract value also grew by US$47 million. Additionally, it added 27 new customers including Hertz, Shiseido, ECU and multiple Fortune 500 companies. In July 2023, HSBC and Tradeshift announced an agreement to launch a new, jointly owned business focused on the development of embedded finance solutions and financial services apps. As part of the agreement, HSBC made a $35 million investment into Tradeshift and joined its board. The agreement was part of a funding round which is expected to raise a minimum of $70 million from HSBC and other investors. The new joint venture will allow HSBC and Tradeshift to deploy a range of digital solutions across Tradeshift and other platforms. This includes payment and fintech services embedded into trade, e-commerce and marketplace experiences. In September 2023, CEO Lanng was fired for "gross misconduct on multiple grounds," including "allegations of sexual assault and harassment." Tradeshift was alleged to have fired his accuser after she complained to the company's human resources department, its co-founders and members of its board of directors about his abuse. == Financials == The company's valuation as of May 2018 was $1.1 billion. Tradeshift is now considered a unicorn, and, according to Bloomberg, will not need any further funding. Jan 14, 2020, Tradeshift announced that they had raised $240 million in Series F finance. == Acquisitions == In 2015, Tradeshift acquired product information management company Merchantry. Merchantry is a retail product information management (PIM) software for multi-vendor ecommerce retailers. In 2016, Tradeshift acquired Hyper Travel. Hyper Travel is a travel management service that allows customers to access travel agents via its native messaging apps, SMS, and email. In 2017, Tradeshift acquired IBX Group. In 2018, Tradeshift acquired Babelway, an online B2B integration platform.

    Read more →
  • Stochastic block model

    Stochastic block model

    The stochastic block model is a generative model for random graphs. This model tends to produce graphs containing communities, subsets of nodes characterized by being connected with one another with particular edge densities. For example, edges may be more common within communities than between communities. Its mathematical formulation was first introduced in 1983 in the field of social network analysis by Paul W. Holland et al. The stochastic block model is important in statistics, machine learning, and network science, where it serves as a useful benchmark for the task of recovering community structure in graph data. == Definition == The stochastic block model takes the following parameters: The number n {\displaystyle n} of vertices; a partition of the vertex set { 1 , … , n } {\displaystyle \{1,\ldots ,n\}} into disjoint subsets C 1 , … , C r {\displaystyle C_{1},\ldots ,C_{r}} , called communities; a symmetric r × r {\displaystyle r\times r} matrix P {\displaystyle P} of edge probabilities. The edge set is then sampled at random as follows: any two vertices u ∈ C i {\displaystyle u\in C_{i}} and v ∈ C j {\displaystyle v\in C_{j}} are connected by an edge with probability P i j {\displaystyle P_{ij}} . An example problem is: given a graph with n {\displaystyle n} vertices, where the edges are sampled as described, recover the groups C 1 , … , C r {\displaystyle C_{1},\ldots ,C_{r}} . == Special cases == If the probability matrix is a constant, in the sense that P i j = p {\displaystyle P_{ij}=p} for all i , j {\displaystyle i,j} , then the result is the Erdős–Rényi model G ( n , p ) {\displaystyle G(n,p)} . This case is degenerate—the partition into communities becomes irrelevant—but it illustrates a close relationship to the Erdős–Rényi model. The planted partition model is the special case that the values of the probability matrix P {\displaystyle P} are a constant p {\displaystyle p} on the diagonal and another constant q {\displaystyle q} off the diagonal. Thus two vertices within the same community share an edge with probability p {\displaystyle p} , while two vertices in different communities share an edge with probability q {\displaystyle q} . Sometimes it is this restricted model that is called the stochastic block model. The case where p > q {\displaystyle p>q} is called an assortative model, while the case p < q {\displaystyle p P j k {\displaystyle P_{ii}>P_{jk}} whenever j ≠ k {\displaystyle j\neq k} : all diagonal entries dominate all off-diagonal entries. A model is called weakly assortative if P i i > P i j {\displaystyle P_{ii}>P_{ij}} whenever i ≠ j {\displaystyle i\neq j} : each diagonal entry is only required to dominate the rest of its own row and column. Disassortative forms of this terminology exist, by reversing all inequalities. For some algorithms, recovery might be easier for block models with assortative or disassortative conditions of this form. == Typical statistical tasks == Much of the literature on algorithmic community detection addresses three statistical tasks: detection, partial recovery, and exact recovery. === Detection === The goal of detection algorithms is simply to determine, given a sampled graph, whether the graph has latent community structure. More precisely, a graph might be generated, with some known prior probability, from a known stochastic block model, and otherwise from a similar Erdos-Renyi model. The algorithmic task is to correctly identify which of these two underlying models generated the graph. === Partial recovery === In partial recovery, the goal is to approximately determine the latent partition into communities, in the sense of finding a partition that is correlated with the true partition significantly better than a random guess. === Exact recovery === In exact recovery, the goal is to recover the latent partition into communities exactly. The community sizes and probability matrix may be known or unknown. == Statistical lower bounds and threshold behavior == Stochastic block models exhibit a sharp threshold effect reminiscent of percolation thresholds. Suppose that we allow the size n {\displaystyle n} of the graph to grow, keeping the community sizes in fixed proportions. If the probability matrix remains fixed, tasks such as partial and exact recovery become feasible for all non-degenerate parameter settings. However, if we scale down the probability matrix at a suitable rate as n {\displaystyle n} increases, we observe a sharp phase transition: for certain settings of the parameters, it will become possible to achieve recovery with probability tending to 1, whereas on the opposite side of the parameter threshold, the probability of recovery tends to 0 no matter what algorithm is used. For partial recovery, the appropriate scaling is to take P i j = P ~ i j / n {\displaystyle P_{ij}={\tilde {P}}_{ij}/n} for fixed P ~ {\displaystyle {\tilde {P}}} , resulting in graphs of constant average degree. In the case of two equal-sized communities, in the assortative planted partition model with probability matrix P = ( p ~ / n q ~ / n q ~ / n p ~ / n ) , {\displaystyle P=\left({\begin{array}{cc}{\tilde {p}}/n&{\tilde {q}}/n\\{\tilde {q}}/n&{\tilde {p}}/n\end{array}}\right),} partial recovery is feasible with probability 1 − o ( 1 ) {\displaystyle 1-o(1)} whenever ( p ~ − q ~ ) 2 > 2 ( p ~ + q ~ ) {\displaystyle ({\tilde {p}}-{\tilde {q}})^{2}>2({\tilde {p}}+{\tilde {q}})} , whereas any estimator fails partial recovery with probability 1 − o ( 1 ) {\displaystyle 1-o(1)} whenever ( p ~ − q ~ ) 2 < 2 ( p ~ + q ~ ) {\displaystyle ({\tilde {p}}-{\tilde {q}})^{2}<2({\tilde {p}}+{\tilde {q}})} . For exact recovery, the appropriate scaling is to take P i j = P ~ i j log ⁡ n / n {\displaystyle P_{ij}={\tilde {P}}_{ij}\log n/n} , resulting in graphs of logarithmic average degree. Here a similar threshold exists: for the assortative planted partition model with r {\displaystyle r} equal-sized communities, the threshold lies at p ~ − q ~ = r {\displaystyle {\sqrt {\tilde {p}}}-{\sqrt {\tilde {q}}}={\sqrt {r}}} . In fact, the exact recovery threshold is known for the fully general stochastic block model. == Algorithms == In principle, exact recovery can be solved in its feasible range using maximum likelihood, but this amounts to solving a constrained or regularized cut problem such as minimum bisection that is typically NP-complete. Hence, no known efficient algorithms will correctly compute the maximum-likelihood estimate in the worst case. However, a wide variety of algorithms perform well in the average case, and many high-probability performance guarantees have been proven for algorithms in both the partial and exact recovery settings. Successful algorithms include spectral clustering of the vertices, semidefinite programming, forms of belief propagation, and community detection among others. == Variants == Several variants of the model exist. One minor tweak allocates vertices to communities randomly, according to a categorical distribution, rather than in a fixed partition. More significant variants include the degree-corrected stochastic block model, the hierarchical stochastic block model, the geometric block model, censored block model and the mixed-membership block model. == Topic models == Stochastic block model have been recognised to be a topic model on bipartite networks. In a network of documents and words, Stochastic block model can identify topics: group of words with a similar meaning. == Extensions to signed graphs == Signed graphs allow for both favorable and adverse relationships and serve as a common model choice for various data analysis applications, e.g., correlation clustering. The stochastic block model can be trivially extended to signed graphs by assigning both positive and negative edge weights or equivalently using a difference of adjacency matrices of two stochastic block models. == DARPA/MIT/AWS Graph Challenge: streaming stochastic block partition == GraphChallenge encourages community approaches to developing new solutions for analyzing graphs and sparse data derived from social media, sensor feeds, and scientific data to enable relationships between events to be discovered as they unfold in the field. Streaming stochastic block partition is one of the challenges since 2017. Spectral clustering has demonstrated outstanding performance compared to the original and even improved base algorithm, matching its quality of clusters while being multiple orders of magnitude faster.

    Read more →
  • Multi expression programming

    Multi expression programming

    Multi Expression Programming (MEP) is an evolutionary algorithm for generating mathematical functions describing a given set of data. MEP is a Genetic Programming variant encoding multiple solutions in the same chromosome. MEP representation is not specific (multiple representations have been tested). In the simplest variant, MEP chromosomes are linear strings of instructions. This representation was inspired by Three-address code. MEP strength consists in the ability to encode multiple solutions, of a problem, in the same chromosome. In this way, one can explore larger zones of the search space. For most of the problems this advantage comes with no running-time penalty compared with genetic programming variants encoding a single solution in a chromosome. == Representation == MEP chromosomes are arrays of instructions represented in Three-address code format. Each instruction contains a variable, a constant, or a function. If the instruction is a function, then the arguments (given as instruction's addresses) are also present. === Example of MEP program === Here is a simple MEP chromosome (labels on the left side are not a part of the chromosome): 1: a 2: b 3: + 1, 2 4: c 5: d 6: + 4, 5 7: 3, 5 == Fitness computation == When the chromosome is evaluated it is unclear which instruction will provide the output of the program. In many cases, a set of programs is obtained, some of them being completely unrelated (they do not have common instructions). For the above chromosome, here is the list of possible programs obtained during decoding: E1 = a, E2 = b, E4 = c, E5 = d, E3 = a + b. E6 = c + d. E7 = (a + b) d. Each instruction is evaluated as a possible output of the program. The fitness (or error) is computed in a standard manner. For instance, in the case of symbolic regression, the fitness is the sum of differences (in absolute value) between the expected output (called target) and the actual output. == Fitness assignment process == Which expression will represent the chromosome? Which one will give the fitness of the chromosome? In MEP, the best of them (which has the lowest error) will represent the chromosome. This is different from other GP techniques: In Linear genetic programming the last instruction will give the output. In Cartesian Genetic Programming the gene providing the output is evolved like all other genes. Note that, for many problems, this evaluation has the same complexity as in the case of encoding a single solution in each chromosome. Thus, there is no penalty in running time compared to other techniques. == Software == === MEPX === MEPX is a cross-platform (Windows, macOS, and Linux Ubuntu) free software for the automatic generation of computer programs. It can be used for data analysis, particularly for solving symbolic regression, statistical classification and time-series problems. === libmep === Libmep is a free and open source library implementing Multi Expression Programming technique. It is written in C++. === hmep === hmep is a new open source library implementing Multi Expression Programming technique in Haskell programming language.

    Read more →
  • Nearest neighbor search

    Nearest neighbor search

    Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values. Formally, the nearest neighbor (NN) search problem is defined as follows: given a set S of points in a space M and a query point q ∈ M {\displaystyle q\in M} , find the closest point in S to q. Donald Knuth in volume 3 of The Art of Computer Programming (1973) called it the post-office problem, referring to an application of assigning to a residence the nearest post office. A direct generalization of this problem is a k-NN search, where we need to find the k closest points. Most commonly M is a metric space and dissimilarity is expressed as a distance metric, which is symmetric and satisfies the triangle inequality. Even more common, M is taken to be the d-dimensional vector space where dissimilarity is measured using the Euclidean distance, Manhattan distance or other distance metric. However, the dissimilarity function can be arbitrary. One example is asymmetric Bregman divergence, for which the triangle inequality does not hold. == Applications == The nearest neighbor search problem arises in numerous fields of application, including: Pattern recognition – in particular for optical character recognition Statistical classification – see k-nearest neighbor algorithm Computer vision – for point cloud registration Computational geometry – see Closest pair of points problem Cryptanalysis – for lattice problem Databases – e.g. content-based image retrieval Coding theory – see maximum likelihood decoding Semantic search Vector databases, where nearest-neighbor lookup over embeddings is used to retrieve semantically similar records Retrieval-augmented generation systems, where nearest-neighbor retrieval over embeddings is used to fetch candidate passages or documents before generation Data compression – see MPEG-2 standard Robotic sensing Recommendation systems, e.g. see Collaborative filtering Internet marketing – see contextual advertising and behavioral targeting DNA sequencing Spell checking – suggesting correct spelling Plagiarism detection Similarity scores for predicting career paths of professional athletes. Cluster analysis – assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense, usually based on Euclidean distance Chemical similarity Sampling-based motion planning == Methods == Various solutions to the NNS problem have been proposed. The quality and usefulness of the algorithms are determined by the time complexity of queries as well as the space complexity of any search data structures that must be maintained. The informal observation usually referred to as the curse of dimensionality states that there is no general-purpose exact solution for NNS in high-dimensional Euclidean space using polynomial preprocessing and polylogarithmic search time. === Exact methods === ==== Linear search ==== The simplest solution to the NNS problem is to compute the distance from the query point to every other point in the database, keeping track of the "best so far". This algorithm, sometimes referred to as the naive approach, has a running time of O(dN), where N is the cardinality of S and d is the dimensionality of S. There are no search data structures to maintain, so the linear search has no space complexity beyond the storage of the database. Naive search can, on average, outperform space partitioning approaches on higher dimensional spaces. The absolute distance is not required for distance comparison, only the relative distance. In geometric coordinate systems the distance calculation can be sped up considerably by omitting the square root calculation from the distance calculation between two coordinates. The distance comparison will still yield identical results. ==== Space partitioning ==== Since the 1970s, the branch and bound methodology has been applied to the problem. In the case of Euclidean space, this approach encompasses spatial index or spatial access methods. Several space-partitioning methods have been developed for solving the NNS problem. Perhaps the simplest is the k-d tree, which iteratively bisects the search space into two regions containing half of the points of the parent region. Queries are performed via traversal of the tree from the root to a leaf by evaluating the query point at each split. Depending on the distance specified in the query, neighboring branches that might contain hits may also need to be evaluated. For constant dimension query time, average complexity is O(log N) in the case of randomly distributed points, worst case complexity is O(kN^(1-1/k)) Alternatively the R-tree data structure was designed to support nearest neighbor search in dynamic context, as it has efficient algorithms for insertions and deletions such as the R tree. R-trees can yield nearest neighbors not only for Euclidean distance, but can also be used with other distances. In the case of general metric space, the branch-and-bound approach is known as the metric tree approach. Particular examples include vp-tree and BK-tree methods. Using a set of points taken from a 3-dimensional space and put into a BSP tree, and given a query point taken from the same space, a possible solution to the problem of finding the nearest point-cloud point to the query point is given in the following description of an algorithm. (Strictly speaking, no such point may exist, because it may not be unique. But in practice, usually we only care about finding any one of the subset of all point-cloud points that exist at the shortest distance to a given query point.) The idea is, for each branching of the tree, guess that the closest point in the cloud resides in the half-space containing the query point. This may not be the case, but it is a good heuristic. After having recursively gone through all the trouble of solving the problem for the guessed half-space, now compare the distance returned by this result with the shortest distance from the query point to the partitioning plane. This latter distance is that between the query point and the closest possible point that could exist in the half-space not searched. If this distance is greater than that returned in the earlier result, then clearly there is no need to search the other half-space. If there is such a need, then you must go through the trouble of solving the problem for the other half space, and then compare its result to the former result, and then return the proper result. The performance of this algorithm is nearer to logarithmic time than linear time when the query point is near the cloud, because as the distance between the query point and the closest point-cloud point nears zero, the algorithm needs only perform a look-up using the query point as a key to get the correct result. === Approximation methods === An approximate nearest neighbor search algorithm is allowed to return points whose distance from the query is at most c {\displaystyle c} times the distance from the query to its nearest points. The appeal of this approach is that, in many cases, an approximate nearest neighbor is almost as good as the exact one. In particular, if the distance measure accurately captures the notion of user quality, then small differences in the distance should not matter. ==== Greedy search in proximity neighborhood graphs ==== Proximity graph methods (such as navigable small world graphs and HNSW) are considered the current state-of-the-art for the approximate nearest neighbors search. The methods are based on greedy traversing in proximity neighborhood graphs G ( V , E ) {\displaystyle G(V,E)} in which every point x i ∈ S {\displaystyle x_{i}\in S} is uniquely associated with vertex v i ∈ V {\displaystyle v_{i}\in V} . The search for the nearest neighbors to a query q in the set S takes the form of searching for the vertex in the graph G ( V , E ) {\displaystyle G(V,E)} . The basic algorithm – greedy search – works as follows: search starts from an enter-point vertex v i ∈ V {\displaystyle v_{i}\in V} by computing the distances from the query q to each vertex of its neighborhood { v j : ( v i , v j ) ∈ E } {\displaystyle \{v_{j}:(v_{i},v_{j})\in E\}} , and then finds a vertex with the minimal distance value. If the distance value between the query and the selected vertex is smaller than the one between the query and the current element, then the algorithm moves to the selected vertex, and it becomes new enter-point. The algorithm stops when it reaches a local minimum: a vertex whose neighborhood does not contain a vertex that is closer to the query than the vertex itself. The idea of proximity neighborhood graphs was exploited in multiple publications, including the seminal paper by Arya and Mount, in the VoroNet syst

    Read more →
  • Human visual system model

    Human visual system model

    A human visual system model (HVS model) is used by image processing, video processing and computer vision experts to deal with biological and psychological processes that are not yet fully understood. Such a model is used to simplify the behaviors of what is a very complex system. As our knowledge of the true visual system improves, the model is updated. Psychovisual study is the study of the psychology of vision. The human visual system model can produce desired effects in perception and vision. Examples of using an HVS model include color television, lossy compression, and Cathode-ray tube (CRT) television. Originally, it was thought that color television required too high a bandwidth for the then available technology. Then it was noticed that the color resolution of the HVS was much lower than the brightness resolution; this allowed color to be squeezed into the signal by chroma subsampling. Another example is lossy image compression, like JPEG. Our HVS model says we cannot see high frequency detail, so in JPEG we can quantize these components without a perceptible loss of quality. Similar concepts are applied in audio compression, where sound frequencies inaudible to humans are band-stop filtered. Several HVS features are derived from evolution when we needed to defend ourselves or hunt for food. We often see demonstrations of HVS features when we are looking at optical illusions. == Block diagram of HVS == == Assumptions about the HVS == Low-pass filter characteristic (limited number of rods in human eye): see Mach bands Lack of color resolution (fewer cones in human eye than rods) Motion sensitivity More sensitive in peripheral vision Stronger than texture sensitivity, e.g. viewing a camouflaged animal Texture stronger than disparity – 3D depth resolution does not need to be so accurate Integral Face recognition (babies smile at faces) Depth inverted face looks normal (facial features overrule depth information) Upside down face with inverted mouth and eyes looks normal == Examples of taking advantage of an HVS model == Flicker frequency of film and television using persistence of vision to fool viewer into seeing a continuous image Interlaced television painting half images to give the impression of a higher flicker frequency Color television (chrominance at half resolution of luminance corresponding to proportions of rods and cones in eye) Image compression (difficult to see higher frequencies more harshly quantized) Motion estimation (use luminance and ignore color) Watermarking and Steganography

    Read more →
  • Wake-sleep algorithm

    Wake-sleep algorithm

    The wake-sleep algorithm is an unsupervised learning algorithm for deep generative models, especially Helmholtz Machines. The algorithm is similar to the expectation-maximization algorithm, and optimizes the model likelihood for observed data. The name of the algorithm derives from its use of two learning phases, the “wake” phase and the “sleep” phase, which are performed alternately. It can be conceived as a model for learning in the brain, but is also being applied for machine learning. == Description == The goal of the wake-sleep algorithm is to find a hierarchical representation of observed data. In a graphical representation of the algorithm, data is applied to the algorithm at the bottom, while higher layers form gradually more abstract representations. Between each pair of layers are two sets of weights: Recognition weights, which define how representations are inferred from data, and generative weights, which define how these representations relate to data. == Training == Training consists of two phases – the “wake” phase and the “sleep” phase. It has been proven that this learning algorithm is convergent. === The "wake" phase === Neurons are fired by recognition connections (from what would be input to what would be output). Generative connections (leading from outputs to inputs) are then modified to increase probability that they would recreate the correct activity in the layer below – closer to actual data from sensory input. === The "sleep" phase === The process is reversed in the “sleep” phase – neurons are fired by generative connections while recognition connections are being modified to increase probability that they would recreate the correct activity in the layer above – further to actual data from sensory input. == Extensions == Since the recognition network is limited in its flexibility, it might not be able to approximate the posterior distribution of latent variables well. To better approximate the posterior distribution, it is possible to employ importance sampling, with the recognition network as the proposal distribution. This improved approximation of the posterior distribution also improves the overall performance of the model.

    Read more →
  • Quickprop

    Quickprop

    Quickprop is an iterative method for determining the minimum of the loss function of an artificial neural network, following an algorithm inspired by the Newton's method. Sometimes, the algorithm is classified to the group of the second order learning methods. It follows a quadratic approximation of the previous gradient step and the current gradient, which is expected to be close to the minimum of the loss function, under the assumption that the loss function is locally approximately square, trying to describe it by means of an upwardly open parabola. The minimum is sought in the vertex of the parabola. The procedure requires only local information of the artificial neuron to which it is applied. The k {\displaystyle k} -th approximation step is given by: Δ ( k ) w i j = Δ ( k − 1 ) w i j ( ∇ i j E ( k ) ∇ i j E ( k − 1 ) − ∇ i j E ( k ) ) {\displaystyle \Delta ^{(k)}\,w_{ij}=\Delta ^{(k-1)}\,w_{ij}\left({\frac {\nabla _{ij}\,E^{(k)}}{\nabla _{ij}\,E^{(k-1)}-\nabla _{ij}\,E^{(k)}}}\right)} Where w i j {\displaystyle w_{ij}} is the weight of input i {\displaystyle i} of neuron j {\displaystyle j} , and E {\displaystyle E} is the loss function. The Quickprop algorithm is an implementation of the error backpropagation algorithm, but the network can behave chaotically during the learning phase due to large step sizes.

    Read more →
  • Multifactor dimensionality reduction

    Multifactor dimensionality reduction

    Multifactor dimensionality reduction (MDR) is a statistical approach, also used in machine learning automatic approaches, for detecting and characterizing combinations of attributes or independent variables that interact to influence a dependent or class variable. MDR was designed specifically to identify nonadditive interactions among discrete variables that influence a binary outcome and is considered a nonparametric and model-free alternative to traditional statistical methods such as logistic regression. The basis of the MDR method is a constructive induction or feature engineering algorithm that converts two or more variables or attributes to a single attribute. This process of constructing a new attribute changes the representation space of the data. The end goal is to create or discover a representation that facilitates the detection of nonlinear or nonadditive interactions among the attributes such that prediction of the class variable is improved over that of the original representation of the data. == Illustrative example == Consider the following simple example using the exclusive OR (XOR) function. XOR is a logical operator that is commonly used in data mining and machine learning as an example of a function that is not linearly separable. The table below represents a simple dataset where the relationship between the attributes (X1 and X2) and the class variable (Y) is defined by the XOR function such that Y = X1 XOR X2. Table 1 A machine learning algorithm would need to discover or approximate the XOR function in order to accurately predict Y using information about X1 and X2. An alternative strategy would be to first change the representation of the data using constructive induction to facilitate predictive modeling. The MDR algorithm would change the representation of the data (X1 and X2) in the following manner. MDR starts by selecting two attributes. In this simple example, X1 and X2 are selected. Each combination of values for X1 and X2 are examined and the number of times Y=1 and/or Y=0 is counted. In this simple example, Y=1 occurs zero times and Y=0 occurs once for the combination of X1=0 and X2=0. With MDR, the ratio of these counts is computed and compared to a fixed threshold. Here, the ratio of counts is 0/1 which is less than our fixed threshold of 1. Since 0/1 < 1 we encode a new attribute (Z) as a 0. When the ratio is greater than one we encode Z as a 1. This process is repeated for all unique combinations of values for X1 and X2. Table 2 illustrates our new transformation of the data. Table 2 The machine learning algorithm now has much less work to do to find a good predictive function. In fact, in this very simple example, the function Y = Z has a classification accuracy of 1. A nice feature of constructive induction methods such as MDR is the ability to use any data mining or machine learning method to analyze the new representation of the data. Decision trees, neural networks, or a naive Bayes classifier could be used in combination with measures of model quality such as balanced accuracy and mutual information. == Machine learning with MDR == As illustrated above, the basic constructive induction algorithm in MDR is very simple. However, its implementation for mining patterns from real data can be computationally complex. As with any machine learning algorithm there is always concern about overfitting. That is, machine learning algorithms are good at finding patterns in completely random data. It is often difficult to determine whether a reported pattern is an important signal or just chance. One approach is to estimate the generalizability of a model to independent datasets using methods such as cross-validation. Models that describe random data typically don't generalize. Another approach is to generate many random permutations of the data to see what the data mining algorithm finds when given the chance to overfit. Permutation testing makes it possible to generate an empirical p-value for the result. Replication in independent data may also provide evidence for an MDR model but can be sensitive to difference in the data sets. These approaches have all been shown to be useful for choosing and evaluating MDR models. An important step in a machine learning exercise is interpretation. Several approaches have been used with MDR including entropy analysis and pathway analysis. Tips and approaches for using MDR to model gene-gene interactions have been reviewed. == Extensions to MDR == Numerous extensions to MDR have been introduced. These include family-based methods, fuzzy methods, covariate adjustment, odds ratios, risk scores, survival methods, robust methods, methods for quantitative traits, and many others. == Applications of MDR == MDR has mostly been applied to detecting gene-gene interactions or epistasis in genetic studies of common human diseases such as atrial fibrillation, autism, bladder cancer, breast cancer, cardiovascular disease, hypertension, obesity, pancreatic cancer, prostate cancer and tuberculosis. It has also been applied to other biomedical problems such as the genetic analysis of pharmacology outcomes. A central challenge is the scaling of MDR to big data such as that from genome-wide association studies (GWAS). Several approaches have been used. One approach is to filter the features prior to MDR analysis. This can be done using biological knowledge through tools such as BioFilter. It can also be done using computational tools such as ReliefF. Another approach is to use stochastic search algorithms such as genetic programming to explore the search space of feature combinations. Yet another approach is a brute-force search using high-performance computing. == Implementations == www.epistasis.org provides an open-source and freely-available MDR software package. An R package for MDR. An sklearn-compatible Python implementation. An R package for Model-Based MDR. MDR in Weka. Generalized MDR.

    Read more →
  • Subvocal recognition

    Subvocal recognition

    Subvocal recognition (SVR) is the process of taking subvocalization and converting the detected results to a digital output, aural or text-based. A silent speech interface is a device that allows speech communication without using the sound made when people vocalize their speech sounds. It works by the computer identifying the phonemes that an individual pronounces from nonauditory sources of information about their speech movements. These are then used to recreate the speech using speech synthesis. == Input methods == Silent speech interface systems have been created using ultrasound and optical camera input of tongue and lip movements. Electromagnetic devices are another technique for tracking tongue and lip movements. The detection of speech movements by electromyography of speech articulator muscles and the larynx is another technique. Another source of information is the vocal tract resonance signals that get transmitted through bone conduction called non-audible murmurs. They have also been created as a brain–computer interface using brain activity in the motor cortex obtained from intracortical microelectrodes. == Uses == Such devices are created as aids to those unable to create the sound phonation needed for audible speech such as after laryngectomies. Another use is for communication when speech is masked by background noise or distorted by self-contained breathing apparatus. A further practical use is where a need exists for silent communication, such as when privacy is required in a public place, or hands-free data silent transmission is needed during a military or security operation. In 2002, the Japanese company NTT DoCoMo announced it had created a silent mobile phone using electromyography and imaging of lip movement. The company stated that "the spur to developing such a phone was ridding public places of noise," adding that, "the technology is also expected to help people who have permanently lost their voice." The feasibility of using silent speech interfaces for practical communication has since then been shown. In 2019, Arnav Kapur, a researcher from the Massachusetts Institute of Technology, conducted a study known as AlterEgo. Its implementation of the silent speech interface enables direct communication between the human brain and external devices through stimulation of the speech muscles. By leveraging neural signals associated with speech and language, the AlterEgo system deciphers the user's intended words and translates them into text or commands without the need for audible speech. == Research and patents == With a grant from the U.S. Army, research into synthetic telepathy using subvocalization is taking place at the University of California, Irvine under lead scientist Mike D'Zmura. NASA's Ames Research Laboratory in Mountain View, California, under the supervision of Charles Jorgensen is conducting subvocalization research. The Brain Computer Interface R&D program at Wadsworth Center under the New York State Department of Health has confirmed the existing ability to decipher consonants and vowels from imagined speech, which allows for brain-based communication using imagined speech, however using EEGs instead of subvocalization techniques. US Patents on silent communication technologies include: US Patent 6587729 "Apparatus for audibly communicating speech using the radio frequency hearing effect", US Patent 5159703 "Silent subliminal presentation system", US Patent 6011991 "Communication system and method including brain wave analysis and/or use of brain activity", US Patent 3951134 "Apparatus and method for remotely monitoring and altering brain waves". Latter two rely on brain wave analysis. == In fiction == The decoding of silent speech using a computer played an important role in Arthur C. Clarke's story and Stanley Kubrick's associated film A Space Odyssey. In this, HAL 9000, a computer controlling spaceship Discovery One, bound for Jupiter, discovers a plot to deactivate it by the mission astronauts Dave Bowman and Frank Poole through lip reading their conversations. In Orson Scott Card's series (including Ender's Game), the artificial intelligence can be spoken to while the protagonist wears a movement sensor in his jaw, enabling him to converse with the AI without making noise. He also wears an ear implant. In Speaker for the Dead and subsequent novels, author Orson Scott Card described an ear implant, called a "jewel", that allows subvocal communication with computer systems. Author Robert J. Sawyer made use of subvocal recognition to allow silent commands to the cybernetic 'companion implants' used by the advanced Neanderthal characters in his Neanderthal Parallax trilogy of science fiction novels. In Earth, David Brin depicts this technology and its uses as a normal gear in the near future. In Down and Out in the Magic Kingdom, Cory Doctorow has cellphone technology become silent through a cochlear implant and miking the throat to pick up subvocalization. William Gibson's Sprawl Trilogy frequently uses sub-vocalization systems in various devices. In Kage Baker's Company novels, the immortal cyborgs communicate subvocally. In the Hugo Award-winning Hyperion Cantos by Dan Simmons, the characters often use subvocalization to communicate. In the Culture novels by Iain M. Banks, more highly advanced species often communicate subvocally through their technology. In Deus Ex: Human Revolution (2011), the protagonist is augmented with a subvocalization implant for sending covert communications (and a corresponding cochlear implant for receiving covert communications). In the tabletop RPG and video game series Shadowrun, player characters can communicate via subvocal microphones in some instances. In Paranoia, all citizens can speak to the computer via their "cerebral cortech" implants. Alistair Reynolds Revelation Space trilogy frequently uses sub-vocalization systems in various devices.

    Read more →
  • Genotypic and phenotypic repair

    Genotypic and phenotypic repair

    Genotypic and phenotypic repair are optional components of an evolutionary algorithm (EA). An EA reproduces essential elements of biological evolution as a computer algorithm in order to solve demanding optimization or planning tasks, at least approximately. A candidate solution is represented by a - usually linear - data structure that plays the role of an individual's chromosome. New solution candidates are generated by mutation and crossover operators following the example of biology. These offspring may be defective, which is corrected or compensated for by genotypic or phenotypic repair. == Description == Genotypic repair, also known as genetic repair, is the removal or correction of impermissible entries in the chromosome that violate restrictions. In phenotypic repair, the corrections are only made in the genotype-phenotype mapping and the chromosome remains unchanged. Michalewicz wrote about the importance of restrictions in real-world applications: "In general, constraints are an integral part of the formulation of any problem". Restriction violations are application-specific and therefore it depends on the current problem whether and which type of repair is useful. They can usually also be treated by a correspondingly extended evaluation and it depends on the problem which measures are possible and which is the most suitable. If a phenotypic repair is feasible, then it is usually the most efficient compared to the other measures. A survey on repair methods used as constraint handling techniques can be found in. Violations of the range limits of genes should be avoided as far as possible by the formulation of the genome. If this is not possible or if restrictions within the search space defined by the genome are involved, their violations are usually handled by the evaluation. This can be done, for example, by penalty functions that lower the fitness. Repair is often also required for combinatorial tasks. The application of a 1- or n-point crossover operator can, for example, lead to genes being missing in one of the child genomes that are present in duplicate in the other. In this case, a suitable genotypic repair measure is to move the surplus genes to the other genome in a positional manner. The use of the aforementioned operators in combinatorial tasks has also proven to be useful in combination with crossover types specially developed for permutations, at least for certain problems. Particularly in combinatorial problems, it has been observed that genotypic repair can promote premature convergence to a suboptimum, but can also significantly accelerate a successful search. Studies on various tasks have shown that this is application-dependent. An effective measure to avoid premature convergence is generally the use of structured populations instead of the usual panmictic ones. Sequence restrictions play a role in many scheduling tasks, for example when it comes to planning workflows. If, for example, it is specified that step A must be carried out before step B and the gene of step B is located before the gene of A in the chromosome, then there is an impermissible gene sequence. This is because the scheduling operation of step B requires the planned end of step A for correct scheduling, but this is not yet scheduled at the time gene B is processed. The problem can be solved in two ways: The scheduling operation of step B is postponed until the gene from step A has been processed. The genome remains unchanged and the repair only influences the genotype-phenotype mapping. Since only the phenotype is changed, this is referred to as phenotypic repair. If, on the other hand, the gene of step B is moved behind the gene of step A, this is a genotypic repair. The same applies to the alternative shift of gene A in front of gene B. In this case, genotypic repair has the disadvantage that it prevents a meaningful restructuring of the gene sequence in the chromosome if this requires several intermediate steps (mutations) that at least partially violate restrictions.

    Read more →
  • Density-based clustering validation

    Density-based clustering validation

    Density-Based Clustering Validation (DBCV) is a metric designed to assess the quality of clustering solutions, particularly for density-based clustering algorithms like DBSCAN, Mean shift, and OPTICS. This metric is particularly suited for identifying concave and nested clusters, where traditional metrics such as the Silhouette coefficient, Davies–Bouldin index, or Calinski–Harabasz index often struggle to provide meaningful evaluations. Unlike traditional validation measures, which often rely on compact and well-separated clusters, DBCV index evaluates how well clusters are defined in terms of local density variations and structural coherence. This metric was introduced in 2014 by David Moulavi and colleagues in their work. It utilizes density connectivity principles to quantify clustering structures, making it especially effective at detecting arbitrarily shaped clusters in concave datasets, where traditional metrics may be less reliable. The DBCV index has been employed for clustering analysis in bioinformatics, ecology, techno-economy, and health informatics , as well as in numerous other fields. == Definition == DBCV index evaluates clustering structures by analyzing the relationships between data points within and across clusters. Given a dataset X = x 1 , x 2 , . . . , x n {\displaystyle X={x_{1},x_{2},...,x_{n}}} , a density-based algorithm partitions it into K clusters C 1 , C 2 , . . . , C K {\displaystyle {C_{1},C_{2},...,C_{K}}} . Each point x i {\displaystyle x_{i}} belongs to a specific cluster, denoted as C c l u s t e r ( x i ) {\displaystyle C_{cluster(x_{i})}} A key concept in DBCV index is the notion of density-connected paths. Two points within the same cluster are considered density-connected if there exists a sequence of intermediate points linking them, where each consecutive pair meets a predefined density criterion. The density-based distance between two points is determined by identifying the optimal path that minimizes the maximum local reachability distance along its trajectory. DBCV index extends the Silhouette coefficient by redefining cluster cohesion and separation using density-based distances: Within-cluster density distance measures how closely a point is related to other members of its cluster: a i = 1 | C c l u s t e r ( x i ) | − 1 ∑ x j ∈ C c l u s t e r ( x i ) , y ≠ x d d e n s i t y ( x j , x i ) {\displaystyle a_{i}={\frac {1}{|C_{cluster(x_{i})}|-1}}\sum _{x_{j}\in C_{cluster(x_{i})},y\neq x}d_{density}(x_{j},x_{i})} Nearest-cluster density distance quantifies how far a point is from the closest external cluster: b i = min C ≠ C cluster ( x i ) C ∈ { C 1 , … , C K } ( 1 | C | ∑ x j ∈ C d density ( x i , x j ) ) . {\displaystyle b_{i}=\min _{C\neq C_{{\text{cluster}}(x_{i})} \atop C\in \{C_{1},\dots ,C_{K}\}}\left({\frac {1}{|C|}}\sum _{x_{j}\in C}d_{\text{density}}(x_{i},x_{j})\right).} Using these measures, the DBCV index is computed as: D B C V = 1 n ∑ i = 1 n b i − a i max ( a i , b i ) {\displaystyle DBCV={\frac {1}{n}}\sum _{i=1}^{n}{\frac {b_{i}-a_{i}}{\max(a_{i},b_{i})}}} == Explanation == DBCV index values range between −1 and +1: +1: Strongly cohesive and well-separated clusters. 0: Ambiguous clustering structure. −1: Poorly formed clusters or incorrect assignments. By leveraging density-based distances instead of traditional Euclidean measures, DBCV index provides a more robust evaluation of clustering performance in datasets with irregular or non-spherical distributions.

    Read more →
  • Loss function

    Loss function

    In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its opposite (in specific domains, variously called a reward function, a profit function, a utility function, a fitness function, etc.), in which case it is to be maximized. The loss function could include terms from several levels of the hierarchy. In statistics, typically a loss function is used for parameter estimation, and the event in question is some function of the difference between estimated and true values for an instance of data. The concept, as old as Laplace, was reintroduced in statistics by Abraham Wald in the middle of the 20th century. In the context of economics, for example, this is usually economic cost or regret. In classification, it is the penalty for an incorrect classification of an example. In actuarial science, it is used in an insurance context to model benefits paid over premiums, particularly since the works of Harald Cramér in the 1920s. In optimal control, the loss is the penalty for failing to achieve a desired value. In financial risk management, the function is mapped to a monetary loss. == Examples == === Regret === Leonard J. Savage argued that using non-Bayesian methods such as minimax, the loss function should be based on the idea of regret, i.e., the loss associated with a decision should be the difference between the consequences of the best decision that could have been made under circumstances will be known and the decision that was in fact taken before they were known. === Quadratic loss function === The use of a quadratic loss function is common, for example when using least squares techniques. It is often more mathematically tractable than other loss functions because of the properties of variances, as well as being symmetric: an error above the target causes the same loss as the same magnitude of error below the target. If the target is t {\displaystyle t} , then a quadratic loss function is λ ( x ) = C ( t − x ) 2 {\displaystyle \lambda (x)=C(t-x)^{2}\;} for some constant C {\displaystyle C} ; the value of the constant makes no difference to a decision, and can be ignored by setting it equal to 1. This is also known as the squared error loss (SEL). Many common statistics, including t-tests, regression models, design of experiments, and much else, use least squares methods applied using linear regression theory, which is based on the quadratic loss function. The quadratic loss function is also used in linear-quadratic optimal control problems. In these problems, even in the absence of uncertainty, it may not be possible to achieve the desired values of all target variables. Often loss is expressed as a quadratic form in the deviations of the variables of interest from their desired values; this approach is tractable because it results in linear first-order conditions. In the context of stochastic control, the expected value of the quadratic form is used. The quadratic loss assigns more importance to outliers than to the true data due to its square nature, so alternatives like the Huber, log-cosh and SMAE losses are used when the data has many large outliers. === 0-1 loss function === In statistics and decision theory, a frequently used loss function is the 0-1 loss function L ( y ^ , y ) = { 0 if y = y ^ 1 if y ≠ y ^ {\displaystyle L({\hat {y}},y)={\begin{cases}0&{\text{if }}y={\hat {y}}\\1&{\text{if }}y\neq {\hat {y}}\end{cases}}} In information theory, this loss function is known as Hamming distortion. == Constructing loss and objective functions == In many applications, objective functions, including loss functions as a particular case, are determined by the problem formulation. In other situations, the decision maker’s preference must be elicited and represented by a scalar-valued function (called also utility function) in a form suitable for optimization — the problem that Ragnar Frisch has highlighted in his Nobel Prize lecture. The existing methods for constructing objective functions are collected in the proceedings of two dedicated conferences. In particular, Andranik Tangian showed that the most usable objective functions — quadratic and additive — are determined by a few indifference points. He used this property in the models for constructing these objective functions from either ordinal or cardinal data that were elicited through computer-assisted interviews with decision makers. Among other things, he constructed objective functions to optimally distribute budgets for 16 Westfalian universities and the European subsidies for equalizing unemployment rates among 271 German regions. == Expected loss == In some contexts, the value of the loss function itself is a random quantity because it depends on the outcome of a random variable X {\displaystyle X} . === Statistics === Both frequentist and Bayesian statistical theory involve making a decision based on the expected value of the loss function; however, this quantity is defined differently under the two paradigms. ==== Frequentist expected loss ==== We first define the expected loss in the frequentist context. It is obtained by taking the expected value with respect to the probability distribution, P θ {\displaystyle P_{\theta }} , of the observed data, X {\displaystyle X} . This is also referred to as the risk function of the decision rule δ {\displaystyle \delta } and the parameter θ {\displaystyle \theta } . Here the decision rule depends on the outcome of X {\displaystyle X} . The risk function is given by: R ( θ , δ ) = E θ ⁡ L ( θ , δ ( X ) ) = ∫ X L ( θ , δ ( x ) ) d P θ ( x ) . {\displaystyle R(\theta ,\delta )=\operatorname {E} _{\theta }L{\big (}\theta ,\delta (X){\big )}=\int _{X}L{\big (}\theta ,\delta (x){\big )}\,\mathrm {d} P_{\theta }(x).} Here, θ {\displaystyle \theta } is a fixed but possibly unknown state of nature, X {\displaystyle X} is a vector of observations stochastically drawn from a population, E θ {\displaystyle \operatorname {E} _{\theta }} is the expectation over all population values of X {\displaystyle X} , d P θ {\displaystyle \mathrm {d} P_{\theta }} is a probability measure over the event space of X {\displaystyle X} (parametrized by θ {\displaystyle \theta } ) and the integral is evaluated over the entire support of X {\displaystyle X} . ==== Bayes Risk ==== In a Bayesian approach, the expectation is calculated using the prior distribution π ∗ {\displaystyle \pi ^{}} of the parameter θ {\displaystyle \theta } : ρ ( π ∗ , a ) = ∫ Θ ∫ X L ( θ , a ( x ) ) d P ( x | θ ) d π ∗ ( θ ) = ∫ X ∫ Θ L ( θ , a ( x ) ) d π ∗ ( θ | x ) d M ( x ) {\displaystyle \rho (\pi ^{},a)=\int _{\Theta }\int _{\mathbf {X}}L(\theta ,a({\mathbf {x}}))\,\mathrm {d} P({\mathbf {x}}\vert \theta )\,\mathrm {d} \pi ^{}(\theta )=\int _{\mathbf {X}}\int _{\Theta }L(\theta ,a({\mathbf {x}}))\,\mathrm {d} \pi ^{}(\theta \vert {\mathbf {x}})\,\mathrm {d} M({\mathbf {x}})} where M ( x ) {\displaystyle M(\mathbf {x} )} is known as the predictive likelihood wherein θ {\displaystyle \theta } has been "integrated out," π ∗ ( θ | x ) {\displaystyle \pi ^{}(\theta |\mathbf {x} )} is the posterior distribution, and the order of integration has been changed. One then should choose the action a ∗ {\displaystyle a^{}} which minimises this expected loss, which is referred to as Bayes Risk. In the latter equation, the integrand inside d x {\displaystyle \mathrm {d} x} is known as the Posterior Risk, and minimising it with respect to decision a {\displaystyle a} also minimizes the overall Bayes Risk. This optimal decision, a ∗ {\displaystyle a^{}} is known as the Bayes (decision) Rule - it minimises the average loss over all possible states of nature θ {\displaystyle \theta } , over all possible (probability-weighted) data outcomes. One advantage of the Bayesian approach is to that one need only choose the optimal action under the actual observed data to obtain a uniformly optimal one, whereas choosing the actual frequentist optimal decision rule as a function of all possible observations, is a much more difficult problem. Of equal importance though, the Bayes Rule reflects consideration of loss outcomes under different states of nature, θ {\displaystyle \theta } . ==== Examples in statistics ==== For a scalar parameter θ {\displaystyle \theta } , a decision function whose output θ ^ {\displaystyle {\hat {\theta }}} is an estimate of θ {\displaystyle \theta } , and a quadratic loss function (squared error loss) L ( θ , θ ^ ) = ( θ − θ ^ ) 2 , {\displaystyle L(\theta ,{\hat {\theta }})=(\theta -{\hat {\theta }})^{2},} the risk function becomes the mean squared error of the estimate, R ( θ , θ ^ ) = E θ ⁡ [ ( θ − θ ^ ) 2 ] . {\displaystyle R(\theta ,{\hat {\thet

    Read more →
  • Timeline of artificial intelligence risks in global finance

    Timeline of artificial intelligence risks in global finance

    The following article is a broad timeline of the course of events related to artificial intelligence risks in global finance. The AI boom has led to concerns including the existential risk from artificial intelligence, as the uptake on applications of artificial intelligence increases. By late 2025, global finance and artificial intelligence were "deeply intertwined". A June 2025 Menlo Ventures report raised concerns about the sustainability of future revenue and long-term profitability of AI, given the relatively low rate of consumer monetization. == 2017 == 30 NovemberThe New York Times said that new AI reports by McKinsey & Company, the National Bureau of Economic Research, and an AI Index created by university researchers, indicated an early AI boom. The Index built on a project—"The One Hundred Year Study on Artificial Intelligence" launched in 2014. == 2018 == 2018 was a year of incremental AI growth in finance. == 2022 == The release of ChatGPT by OpenAI became the catalyst for an artificial intelligence boom that continues to remake the global economy. According to a European Central Bank report, public interest in AI increased rapidly as evidenced with rising Google searches, AI jobs, models, patents, and innovations since late 2022. At that time Europe led the US in the size of its AI workforce. == 2023 == The regulatory body, the International Monetary Fund (IMF), published their report, "Generative Artificial Intelligence in Finance: Risk Considerations", drawing attention to oversight gaps and the need for regulations. The report explores the risks posed by using generative artificial intelligence (GenAI) systems in the financial sector including "broader risks to financial stability." == 2024 == January 12 In January 2024 Bloomberg's published its list of the "Magnificent Seven" Big Tech companies on the stock market based on their strength, size and market capitalization:Apple, Microsoft, Alphabet (Google), Amazon, Meta Platforms (Facebook), Nvidia, and Tesla. 21 June During the AI boom, Nvidia became the world's most valuable company, surpassing Microsoft, as its value increased to over US$4 trillion. In 2023 and 2024, the "Magnificent Seven" stocks were the primary drivers behind the increase in equity indexes, according to Reuters. == 2025 == === January === 23 January President Donald Trump's AI policy was announced calling for United States global leadership in artificial intelligence. The Economist noted that this politic shift in which the United States seeks "global dominance" in AI includes trimming regulations and assisting in expansion of infrastructure and increase in number of AI workers. Governments of Gulf nations were also investing trillions of dollars in AI. 27 January Against the backdrop of a tech war between China and the United States over AI dominance, within days of the launch of China's free DeepSeek App, it was the most downloaded app in the United States, rising to the first place in the Apple app store. President Trump responded immediately, saying this "sudden rise" should be a "wake-up" call to the United States, and called on US companies to be more competitive. === June === 26 June In their June 2025 report, Menlo Ventures estimated that only about 3% of consumers paid for artificial intelligence-related services, representing about $USD12 billion in annual spending. This is relatively low in contrast to the massive capital expenditure by AI infrastructure companies, which raises concerns about revenue sustainability and long-term profitability. === July === 23 July The Trump administration launched the US AI Action Plan, positioning the United States in a high-stakes technological race with China for global dominance in artificial intelligence, emphasizing that neither nation can afford to fall behind due to the exponential nature of AI advancement. The plan, a new government website and policy speech called for accelerated AI adoption across federal agencies, and a number of initiatives to make is easier for AI infrastructure expansion, and other measures to ensure American leadership in AI standards. Some leading experts warned that the administration failed to provide sufficient regulations and safeguards for AI safety. Concerns were raised about the negative impacts of cuts to research funding and tightened visa policies for scientists, potentially undermining public trust and America's ability to compete internationally. === September === 7 September The Economist cautioned that AI revenues are relatively modest compared to the high cost and investments in the creation of new data centers. Even Sam Altman, OpenAI CEO and one of the leading figures of the AI boom,, raised concerns about investors' outsized hopes for financial returns. At the same time, history has shown that new technologies, like railways and electricity, endured and spread after the initial hype faded. 12 September Economists warn that U.S. households' direct and indirect investments—mutual funds or retirement plans—in the stock market reached an unprecedented historically high level, now representing 45% of all financial assets, or about $USD51.2 trillion. Compared to the Dot-com bubble this represents a sharp increase in exposure. This makes U.S. households vulnerable to market downturns which in turn would result in decreasing consumer spending. U.S. household net worth rose to a record $176.3 trillion in the second quarter, an increase of $7.3 trillion since early 2025 and about $46 trillion higher than before the pandemic. Federal Reserve data attribute the surge primarily to gains in stock markets and housing values. However, the rise in wealth on paper coincided with increased household borrowing and growing government debt. 18 September Questions were being raised about how quickly the data centers, chips, servers, and GPUs assets of major AI companies will depreciate in value. Comparisons have been made to the Railway Mania in the aftermath of the stock market bubble where a valuable physical infrastructure remained standing, and the telecoms crash after the dot-com bubble which left fiber networks. 28 September There were warnings that record-high American stock ownership during the AI-fueled market boom is a red flag for systemic risk, as the current concentration in equities exceeds levels seen before the dot-com bubble burst in 2000, and could amplify the impact of any future stock market correction. === October === 3 October In 2025 alone, venture capitalists invested almost $USD200 billion in the artificial intelligence sector. 29 October Nvidia was the first company in the world to be valued at US$5 trillion, largely due to AI demand and strategic partnerships with leading technology and AI firms. Nvidia's increase in value was "meteoric". === November === 2 November Forbes reported that, since April, the 'Magnificent Seven' tech giants together contributed over 40% of the S&P 500's return, highlighting their outsized influence and the growing impact of AI on market valuations. CNN warned that while there is a current benefit to investors, with such a high concentration in the S&P 500, they are highly exposed to the fate of the Mag Seven. 2 November Globally there are 11,000 datacentres—huge campuses for AI infrastructure, including thousands of chips, GPUS, and servers. This represents a 500% increase over the last two decades. It is anticipated that $3USDtn more will be spent on increasing that number over the next two or three years. 5 November Concerns about the potential for a market bubble were raised as six of the AI-related Big Tech "Magnificent Seven"—that contribute to the AI boom—reported losing ground in the stock market. Global markets and artificial intelligence have become "deeply intertwined", according to a Reuters report. As of November 2025, more than 50% of the 20 largest S&P firms were deeply exposed to AI. In contrast, in 2000, the 20 S&P 500 firms represented 39% of its total value only 11 of these companies were exposed to the internet. If AI fails to deliver strong returns on their investments, these top S&P firms would be significantly impacted, according to the Economist. Analysts suggest that the AI market in 2025 may not behave like a traditional one, as investors are simultaneously aware of the risks and driven by the potential for outsized rewards. Leading AI labs may believe that the first company to achieve artificial general intelligence (AGI), when an AI system surpasses all human cognitive abilities and becomes capable of self-improvement—could dominate the future of technology and finance. While some have estimated that the potential value of such a breakthrough could be as high as $1.46 quadrillion, this figure is speculative and widely debated. 5 November Bloomberg described Nvidia's H100 Hopper-Blackwell AI chips as the "King of AI chips". Nvidia dominates the AI chip market with over 78% of the market share because of both speed and cost. According to B

    Read more →
  • Feature selection

    Feature selection

    In machine learning, feature selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons: simplification of models to make them easier to interpret, shorter training times, to avoid the curse of dimensionality, improve the compatibility of the data with a certain learning model class, to encode inherent symmetries present in the input space. The central premise when using feature selection is that data sometimes contains features that are redundant or irrelevant, and can thus be removed without incurring much loss of information. Redundancy and irrelevance are two distinct notions, since one relevant feature may be redundant in the presence of another relevant feature with which it is strongly correlated. Feature extraction creates new features from functions of the original features, whereas feature selection finds a subset of the features. Feature selection techniques are often used in domains where there are many features and comparatively few samples (data points). == Introduction == A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The simplest algorithm is to test each possible subset of features finding the one which minimizes the error rate. This is an exhaustive search of the space, and is computationally intractable for all but the smallest of feature sets. The choice of evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between the three main categories of feature selection algorithms: wrappers, filters and embedded methods. Wrapper methods use a predictive model to score feature subsets. Each new subset is used to train a model, which is tested on a hold-out set. Counting the number of mistakes made on that hold-out set (the error rate of the model) gives the score for that subset. As wrapper methods train a new model for each subset, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model or typical problem. Filter methods use a proxy measure instead of the error rate to score a feature subset. This measure is chosen to be fast to compute, while still capturing the usefulness of the feature set. Common measures include the mutual information, the pointwise mutual information, Pearson product-moment correlation coefficient, Relief-based algorithms, and inter/intra class distance or the scores of significance tests for each class/feature combinations. Filters are usually less computationally intensive than wrappers, but they produce a feature set which is not tuned to a specific type of predictive model. This lack of tuning means a feature set from a filter is more general than the set from a wrapper, usually giving lower prediction performance than a wrapper. However the feature set doesn't contain the assumptions of a prediction model, and so is more useful for exposing the relationships between the features. Many filters provide a feature ranking rather than an explicit best feature subset, and the cut off point in the ranking is chosen via cross-validation. Filter methods have also been used as a preprocessing step for wrapper methods, allowing a wrapper to be used on larger problems. One other popular approach is the Recursive Feature Elimination algorithm, commonly used with Support Vector Machines to repeatedly construct a model and remove features with low weights. Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. The exemplar of this approach is the LASSO method for constructing a linear model, which penalizes the regression coefficients with an L1 penalty, shrinking many of them to zero. Any features which have non-zero regression coefficients are 'selected' by the LASSO algorithm. Improvements to the LASSO include Bolasso which bootstraps samples; Elastic net regularization, which combines the L1 penalty of LASSO with the L2 penalty of ridge regression; and FeaLect which scores all the features based on combinatorial analysis of regression coefficients. AEFS further extends LASSO to nonlinear scenario with autoencoders. These approaches tend to be between filters and wrappers in terms of computational complexity. In traditional regression analysis, the most popular form of feature selection is stepwise regression, which is a wrapper technique. It is a greedy algorithm that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically done by cross-validation. In statistics, some criteria are optimized. This leads to the inherent problem of nesting. More robust methods have been explored, such as branch and bound and piecewise linear network. == Subset selection == Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken up into wrappers, filters, and embedded methods. Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in, and specific to, a model. Many popular search approaches use greedy hill climbing, which iteratively evaluates a candidate subset of features, then modifies the subset and evaluates if the new subset is an improvement over the old. Evaluation of the subsets requires a scoring metric that grades a subset of features. Exhaustive search is generally impractical, so at some implementor (or operator) defined stopping point, the subset of features with the highest score discovered up to that point is selected as the satisfactory feature subset. The stopping criterion varies by algorithm; possible criteria include: a subset score exceeds a threshold, a program's maximum allowed run time has been surpassed, etc. Alternative search-based techniques are based on targeted projection pursuit which finds low-dimensional projections of the data that score highly: the features that have the largest projections in the lower-dimensional space are then selected. Search approaches include: Exhaustive Best first Simulated annealing Genetic algorithm Greedy forward selection Greedy backward elimination Particle swarm optimization Targeted projection pursuit Scatter search Variable neighborhood search Two popular filter metrics for classification problems are correlation and mutual information, although neither are true metrics or 'distance measures' in the mathematical sense, since they fail to obey the triangle inequality and thus do not compute any actual 'distance' – they should rather be regarded as 'scores'. These scores are computed between a candidate feature (or set of features) and the desired output category. There are, however, true metrics that are a simple function of the mutual information; see here. Other available filter metrics include: Class separability Error probability Inter-class distance Probabilistic distance Entropy Consistency-based feature selection Correlation-based feature selection == Optimality criteria == The choice of optimality criteria is difficult as there are multiple objectives in a feature selection task. Many common criteria incorporate a measure of accuracy, penalised by the number of features selected. Examples include Akaike information criterion (AIC) and Mallows's Cp, which have a penalty of 2 for each added feature. AIC is based on information theory, and is effectively derived via the maximum entropy principle. Other criteria are Bayesian information criterion (BIC), which uses a penalty of log ⁡ n {\displaystyle {\sqrt {\log {n}}}} for each added feature, minimum description length (MDL) which asymptotically uses log ⁡ n {\displaystyle {\sqrt {\log {n}}}} , Bonferroni / RIC which use 2 log ⁡ p {\displaystyle {\sqrt {2\log {p}}}} , maximum dependency feature selection, and a variety of new criteria that are motivated by false discovery rate (FDR), which use something close to 2 log ⁡ p q {\displaystyle {\sqrt {2\log {\frac {p}{q}}}}} . A maximum entropy rate criterion may also be used to select the most relevant subset of features. == Structure learning == Filter feature selection is a specific case of a more general paradigm called structure learning. Feature selection finds the relevant feature set for a specific target variable whereas structure learning finds the relationships between all the variables, usually by expressing these relationships as a graph. The most common structure learning algorithms

    Read more →
  • Margin classifier

    Margin classifier

    In machine learning (ML), a margin classifier is a type of classification model which is able to give an associated distance from the decision boundary for each data sample. For instance, if a linear classifier is used, the distance (typically Euclidean, though others may be used) of a sample from the separating hyperplane is the margin of that sample. The notion of margins is important in several ML classification algorithms, as it can be used to bound the generalization error of these classifiers. These bounds are frequently shown using the VC dimension. The generalization error bound in boosting algorithms and support vector machines is particularly prominent. == Margin for boosting algorithms == The margin for an iterative boosting algorithm given a dataset with two classes can be defined as follows: the classifier is given a sample pair ( x , y ) {\displaystyle (x,y)} , where x ∈ X {\displaystyle x\in X} is a domain space and y ∈ Y = { − 1 , + 1 } {\displaystyle y\in Y=\{-1,+1\}} is the sample's label. The algorithm then selects a classifier h j ∈ C {\displaystyle h_{j}\in C} at each iteration j {\displaystyle j} where C {\displaystyle C} is a space of possible classifiers that predict real values. This hypothesis is then weighted by α j ∈ R {\displaystyle \alpha _{j}\in R} as selected by the boosting algorithm. At iteration t {\displaystyle t} , the margin of a sample x {\displaystyle x} can thus be defined as y ∑ j t α j h j ( x ) ∑ | α j | . {\displaystyle {\frac {y\sum _{j}^{t}\alpha _{j}h_{j}(x)}{\sum |\alpha _{j}|}}.} By this definition, the margin is positive if the sample is labeled correctly, or negative if the sample is labeled incorrectly. This definition may be modified and is not the only way to define the margin for boosting algorithms. However, there are reasons why this definition may be appealing. == Examples of margin-based algorithms == Many classifiers can give an associated margin for each sample. However, only some classifiers utilize information of the margin while learning from a dataset. Many boosting algorithms rely on the notion of a margin to assign weight to samples. If a convex loss is utilized (as in AdaBoost or LogitBoost, for instance) then a sample with a higher margin will receive less (or equal) weight than a sample with a lower margin. This leads the boosting algorithm to focus weight on low-margin samples. In non-convex algorithms (e.g., BrownBoost), the margin still dictates the weighting of a sample, though the weighting is non-monotone with respect to the margin. == Generalization error bounds == One theoretical motivation behind margin classifiers is that their generalization error may be bound by the algorithm parameters and a margin term. An example of such a bound is for the AdaBoost algorithm. Let S {\displaystyle S} be a set of m {\displaystyle m} data points, sampled independently at random from a distribution D {\displaystyle D} . Assume the VC-dimension of the underlying base classifier is d {\displaystyle d} and m ≥ d ≥ 1 {\displaystyle m\geq d\geq 1} . Then, with probability 1 − δ {\displaystyle 1-\delta } , we have the bound: P D ( y ∑ j t α j h j ( x ) ∑ | α j | ≤ 0 ) ≤ P S ( y ∑ j t α j h j ( x ) ∑ | α j | ≤ θ ) + O ( 1 m d log 2 ⁡ ( m / d ) / θ 2 + log ⁡ ( 1 / δ ) ) {\displaystyle P_{D}\left({\frac {y\sum _{j}^{t}\alpha _{j}h_{j}(x)}{\sum |\alpha _{j}|}}\leq 0\right)\leq P_{S}\left({\frac {y\sum _{j}^{t}\alpha _{j}h_{j}(x)}{\sum |\alpha _{j}|}}\leq \theta \right)+O\left({\frac {1}{\sqrt {m}}}{\sqrt {d\log ^{2}(m/d)/\theta ^{2}+\log(1/\delta )}}\right)} for all θ > 0 {\displaystyle \theta >0} .

    Read more →