AI Code Janitor

AI Code Janitor — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Zoho Office Suite

    Zoho Office Suite

    Zoho Office Suite is an online office suite developed by Zoho Corporation. == History == Zoho Office Suite was launched in 2005 with a web-based word processor. Additional products, such as those for spreadsheets and presentations, were incorporated later into the suite. The applications are distributed as software as a service (SaaS). == Products == Zoho uses an open API for its Writer, Sheet, Show, Creator, Meeting, and Planner products. It also has plugins into Microsoft Word and Excel, an OpenOffice.org plugin, and a plugin for Firefox. Zoho Office Suite is free for individuals but offers a plan for teams, which includes Zoho WorkDrive, Zoho Workplace and other Zoho apps. In October 2009, Zoho integrated some of their applications with the Google Apps online suite.

    Read more →
  • Multidimensional scaling

    Multidimensional scaling

    Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a data set. MDS is used to translate distances between each pair of n {\textstyle n} objects in a set into a configuration of n {\textstyle n} points mapped into an abstract Cartesian space. More technically, MDS refers to a set of related ordination techniques used in information visualization, in particular to display the information contained in a distance matrix. It is a form of non-linear dimensionality reduction. Given a distance matrix with the distances between each pair of objects in a set, and a chosen number of dimensions, N, an MDS algorithm places each object into N-dimensional space (a lower-dimensional representation) such that the between-object distances are preserved as well as possible. For N = 1, 2, and 3, the resulting points can be visualized on a scatter plot. Core theoretical contributions to MDS were made by James O. Ramsay of McGill University, who is also regarded as the founder of functional data analysis. == Types == MDS algorithms fall into a taxonomy, depending on the meaning of the input matrix: === Classical multidimensional scaling === It is also known as Principal Coordinates Analysis (PCoA), Torgerson Scaling or Torgerson–Gower scaling. It takes an input matrix giving dissimilarities between pairs of items and outputs a coordinate matrix whose configuration minimizes a loss function called strain, which is given by Strain D ( x 1 , x 2 , . . . , x n ) = ( ∑ i , j ( b i j − x i T x j ) 2 ∑ i , j b i j 2 ) 1 / 2 , {\displaystyle {\text{Strain}}_{D}(x_{1},x_{2},...,x_{n})={\Biggl (}{\frac {\sum _{i,j}{\bigl (}b_{ij}-x_{i}^{T}x_{j}{\bigr )}^{2}}{\sum _{i,j}b_{ij}^{2}}}{\Biggr )}^{1/2},} where x i {\displaystyle x_{i}} denote vectors in N-dimensional space, x i T x j {\displaystyle x_{i}^{T}x_{j}} denotes the scalar product between x i {\displaystyle x_{i}} and x j {\displaystyle x_{j}} , and b i j {\displaystyle b_{ij}} are the elements of the matrix B {\displaystyle B} defined on step 2 of the following algorithm, which are computed from the distances. Steps of a Classical MDS algorithm: Classical MDS uses the fact that the coordinate matrix X {\displaystyle X} can be derived by eigenvalue decomposition from B = X X ′ {\textstyle B=XX'} . And the matrix B {\textstyle B} can be computed from proximity matrix D {\textstyle D} by using double centering. Set up the squared proximity matrix D ( 2 ) = [ d i j 2 ] {\textstyle D^{(2)}=[d_{ij}^{2}]} Apply double centering: B = − 1 2 C D ( 2 ) C {\textstyle B=-{\frac {1}{2}}CD^{(2)}C} using the centering matrix C = I − 1 n J n {\textstyle C=I-{\frac {1}{n}}J_{n}} , where n {\textstyle n} is the number of objects, I {\textstyle I} is the n × n {\textstyle n\times n} identity matrix, and J n {\textstyle J_{n}} is an n × n {\textstyle n\times n} matrix of all ones. Determine the m {\textstyle m} largest eigenvalues λ 1 , λ 2 , . . . , λ m {\textstyle \lambda _{1},\lambda _{2},...,\lambda _{m}} and corresponding eigenvectors e 1 , e 2 , . . . , e m {\textstyle e_{1},e_{2},...,e_{m}} of B {\textstyle B} (where m {\textstyle m} is the number of dimensions desired for the output). Now, X = E m Λ m 1 / 2 {\textstyle X=E_{m}\Lambda _{m}^{1/2}} , where E m {\textstyle E_{m}} is the matrix of m {\textstyle m} eigenvectors and Λ m {\textstyle \Lambda _{m}} is the diagonal matrix of m {\textstyle m} eigenvalues of B {\textstyle B} . Classical MDS assumes metric distances. So this is not applicable for direct dissimilarity ratings. === Metric multidimensional scaling (mMDS) === It is a superset of classical MDS that generalizes the optimization procedure to a variety of loss functions and input matrices of known distances with weights and so on. A useful loss function in this context is called stress, which is often minimized using a procedure called stress majorization. Metric MDS minimizes the cost function called “stress” which is a residual sum of squares: Stress D ( x 1 , x 2 , . . . , x n ) = ∑ i ≠ j = 1 , . . . , n ( d i j − ‖ x i − x j ‖ ) 2 . {\displaystyle {\text{Stress}}_{D}(x_{1},x_{2},...,x_{n})={\sqrt {\sum _{i\neq j=1,...,n}{\bigl (}d_{ij}-\|x_{i}-x_{j}\|{\bigr )}^{2}}}.} Metric scaling uses a power transformation with a user-controlled exponent p {\textstyle p} : d i j p {\textstyle d_{ij}^{p}} and − d i j 2 p {\textstyle -d_{ij}^{2p}} for distance. In classical scaling p = 1. {\textstyle p=1.} Non-metric scaling is defined by the use of isotonic regression to nonparametrically estimate a transformation of the dissimilarities. === Non-metric multidimensional scaling (NMDS) === In contrast to metric MDS, non-metric MDS finds both a non-parametric monotonic relationship between the dissimilarities in the item-item matrix and the Euclidean distances between items, and the location of each item in the low-dimensional space. Let d i j {\displaystyle d_{ij}} be the dissimilarity between points i , j {\displaystyle i,j} . Let d ^ i j = ‖ x i − x j ‖ {\displaystyle {\hat {d}}_{ij}=\|x_{i}-x_{j}\|} be the Euclidean distance between embedded points x i , x j {\displaystyle x_{i},x_{j}} . Now, for each choice of the embedded points x i {\displaystyle x_{i}} and is a monotonically increasing function f {\displaystyle f} , define the "stress" function: S ( x 1 , . . . , x n ; f ) = ∑ i < j ( f ( d i j ) − d ^ i j ) 2 ∑ i < j d ^ i j 2 . {\displaystyle S(x_{1},...,x_{n};f)={\sqrt {\frac {\sum _{i Read more →

  • Correlation clustering

    Correlation clustering

    Clustering is the problem of partitioning data points into groups based on similarity or dissimilarity. Correlation clustering is a clustering framework in which a set of objects is partitioned into clusters based on pairwise similarity and dissimilarity information, without requiring the number of clusters to be specified in advance. == Description of the problem == In machine learning, correlation clustering (also known as cluster editing) considers settings in which pairwise similarity or dissimilarity relationships between objects are known. A standard formulation models the input as an unweighted complete graph G = ( V , E ) {\displaystyle G=(V,E)} , where each edge is labeled either + {\displaystyle +} or − {\displaystyle -} (that is, the graph is a signed graph), indicating whether the corresponding endpoints are similar or dissimilar. The goal is to find a clustering (that is, a partition of V {\displaystyle V} ) that either maximizes the number of agreements—the sum of positive edges whose endpoints lie in the same cluster and negative edges whose endpoints lie in different clusters—or minimizes the number of disagreements—the sum of positive edges whose endpoints are separated and negative edges whose endpoints lie in the same cluster. Unlike other clustering methods such as k-means, correlation clustering does not require choosing the number of clusters k {\displaystyle k} in advance. It is not always possible to find a clustering with zero disagreements. For example, consider a triangle graph containing two positive edges and one negative edge. In this case, every clustering incurs at least one disagreement. Such configurations are referred to in the literature as bad triangles. From a computational perspective, optimizing the correlation clustering objective is challenging. The (decision version of the) problem is NP-complete. A large body of subsequent work has developed approximation algorithms for correlation clustering under various assumptions, including complete or general graphs and unweighted or weighted graphs, for both minimization and maximization objectives. This problem is considered one of the fundamental combinatorial optimization problems, and many algorithmic techniques have been developed to address it. The problem has also been studied extensively across multiple disciplines. A comprehensive literature review of early correlation clustering research is provided by Wahid and Hassini. == Formal Definitions == Let G = ( V , E ) {\displaystyle G=(V,E)} be a graph with nodes V {\displaystyle V} and edges E {\displaystyle E} . A clustering of G {\displaystyle G} is a partition of its node set Π = { π 1 , … , π k } {\displaystyle \Pi =\{\pi _{1},\dots ,\pi _{k}\}} with V = π 1 ∪ ⋯ ∪ π k {\displaystyle V=\pi _{1}\cup \dots \cup \pi _{k}} and π i ∩ π j = ∅ {\displaystyle \pi _{i}\cap \pi _{j}=\emptyset } for i ≠ j {\displaystyle i\neq j} . For a given clustering Π {\displaystyle \Pi } , let δ ( Π ) = { { u , v } ∈ E ∣ { u , v } ⊈ π ∀ π ∈ Π } {\displaystyle \delta (\Pi )=\{\{u,v\}\in E\mid \{u,v\}\not \subseteq \pi \;\forall \pi \in \Pi \}} denote the subset of edges of G {\displaystyle G} whose endpoints are in different subsets of the clustering Π {\displaystyle \Pi } . Now, let w : E → R ≥ 0 {\displaystyle w\colon E\to \mathbb {R} _{\geq 0}} be a function that assigns a non-negative weight to each edge of the graph and let E = E + ∪ E − {\displaystyle E=E^{+}\cup E^{-}} be a partition of the edges into attractive ( E + {\displaystyle E^{+}} ) and repulsive ( E − {\displaystyle E^{-}} ) edges; that is, the edges are signed. The minimum disagreement correlation clustering problem is the following optimization problem: minimize Π ∑ e ∈ E + ∩ δ ( Π ) w e + ∑ e ∈ E − ∖ δ ( Π ) w e . {\displaystyle {\begin{aligned}&{\underset {\Pi }{\operatorname {minimize} }}&&\sum _{e\in E^{+}\cap \delta (\Pi )}w_{e}+\sum _{e\in E^{-}\setminus \delta (\Pi )}w_{e}\;.\end{aligned}}} Here, the set E + ∩ δ ( Π ) {\displaystyle E^{+}\cap \delta (\Pi )} contains the attractive edges whose endpoints are in different components with respect to the clustering Π {\displaystyle \Pi } and the set E − ∖ δ ( Π ) {\displaystyle E^{-}\setminus \delta (\Pi )} contains the repulsive edges whose endpoints are in the same component with respect to the clustering Π {\displaystyle \Pi } . Together these two sets contain all edges that disagree with the clustering Π {\displaystyle \Pi } . Similarly to the minimum disagreement correlation clustering problem, the maximum agreement correlation clustering problem is defined as maximize Π ∑ e ∈ E + ∖ δ ( Π ) w e + ∑ e ∈ E − ∩ δ ( Π ) w e . {\displaystyle {\begin{aligned}&{\underset {\Pi }{\operatorname {maximize} }}&&\sum _{e\in E^{+}\setminus \delta (\Pi )}w_{e}+\sum _{e\in E^{-}\cap \delta (\Pi )}w_{e}\;.\end{aligned}}} Here, the set E + ∖ δ ( Π ) {\displaystyle E^{+}\setminus \delta (\Pi )} contains the attractive edges whose endpoints are in the same component with respect to the clustering Π {\displaystyle \Pi } and the set E − ∩ δ ( Π ) {\displaystyle E^{-}\cap \delta (\Pi )} contains the repulsive edges whose endpoints are in different components with respect to the clustering Π {\displaystyle \Pi } . Together these two sets contain all edges that agree with the clustering Π {\displaystyle \Pi } . Instead of formulating the correlation clustering problem in terms of non-negative edge weights and a partition of the edges into attractive and repulsive edges the problem is also formulated in terms of positive and negative edge costs without partitioning the set of edges explicitly. For given weights w : E → R ≥ 0 {\displaystyle w\colon E\to \mathbb {R} _{\geq 0}} and a given partition E = E + ∪ E − {\displaystyle E=E^{+}\cup E^{-}} of the edges into attractive and repulsive edges, the edge costs can be defined by c e = { w e if e ∈ E + − w e if e ∈ E − {\displaystyle {\begin{aligned}c_{e}={\begin{cases}\;\;w_{e}&{\text{if }}e\in E^{+}\\-w_{e}&{\text{if }}e\in E^{-}\end{cases}}\end{aligned}}} for all e ∈ E {\displaystyle e\in E} . An edge whose endpoints are in different clusters is said to be cut. The set δ ( Π ) {\displaystyle \delta (\Pi )} of all edges that are cut is often called a multicut of G {\displaystyle G} . The minimum cost multicut problem is the problem of finding a clustering Π {\displaystyle \Pi } of G {\displaystyle G} such that the sum of the costs of the edges whose endpoints are in different clusters is minimal: minimize Π ∑ e ∈ δ ( Π ) c e . {\displaystyle {\begin{aligned}&{\underset {\Pi }{\operatorname {minimize} }}&&\sum _{e\in \delta (\Pi )}c_{e}\;.\end{aligned}}} Similar to the minimum cost multicut problem, coalition structure generation in weighted graph games is the problem of finding a clustering such that the sum of the costs of the edges that are not cut is maximal: maximize Π ∑ e ∈ E ∖ δ ( Π ) c e . {\displaystyle {\begin{aligned}&{\underset {\Pi }{\operatorname {maximize} }}&&\sum _{e\in E\setminus \delta (\Pi )}c_{e}\;.\end{aligned}}} This formulation is also known as the clique partitioning problem. It can be shown that all four problems that are formulated above are equivalent. This means that a clustering that is optimal with respect to any of the four objectives is optimal for all of the four objectives. == Algorithms == If the graph admits a clustering with zero disagreements, then deleting all negative edges and computing the connected components of the remaining graph yields an optimal clustering. A necessary and sufficient condition for the existence of such a clustering was given by Davis: no cycle in the graph may contain exactly one negative edge. Bansal et al. discuss the NP-completeness proof and also present both a constant factor approximation algorithm and polynomial-time approximation scheme to find the clusters in this setting. Ailon et al. propose a randomized 3-approximation algorithm for the same problem. CC-Pivot(G=(V,E+,E−)) Pick random pivot i ∈ V Set C = { i } {\displaystyle C=\{i\}} , V'=Ø For all j ∈ V, j ≠ i; If (i,j) ∈ E+ then Add j to C Else (If (i,j) ∈ E−) Add j to V' Let G' be the subgraph induced by V' Return clustering C,CC-Pivot(G') The authors show that the above algorithm is a 3-approximation algorithm for correlation clustering. The best polynomial-time approximation algorithm known at the moment for this problem achieves a ~2.06 approximation by rounding a linear program, as shown by Chawla, Makarychev, Schramm, and Yaroslavtsev. Karpinski and Schudy proved existence of a polynomial time approximation scheme (PTAS) for that problem on complete graphs and fixed number of clusters. == Optimal number of clusters == In 2011, it was shown by Bagon and Galun that the optimization of the correlation clustering functional is closely related to well known discrete optimization methods. In their work they proposed a probabilistic analysis of the underlying implicit model that allows the correlation clustering functional to estimate the

    Read more →
  • Influence diagram

    Influence diagram

    An influence diagram (ID) (also called a relevance diagram, decision diagram or a decision network) is a compact graphical and mathematical representation of a decision situation. It is a generalization of a Bayesian network, in which not only probabilistic inference problems but also decision making problems (following the maximum expected utility criterion) can be modeled and solved. ID was first developed in the mid-1970s by decision analysts with an intuitive semantic that is easy to understand. It is now adopted widely and becoming an alternative to the decision tree which typically suffers from exponential growth in number of branches with each variable modeled. ID is directly applicable in team decision analysis, since it allows incomplete sharing of information among team members to be modeled and solved explicitly. Extensions of ID also find their use in game theory as an alternative representation of the game tree. == Semantics == An ID is a directed acyclic graph with three types (plus one subtype) of node and three types of arc (or arrow) between nodes. Nodes: Decision node (corresponding to each decision to be made) is drawn as a rectangle. Uncertainty node (corresponding to each uncertainty to be modeled) is drawn as an oval. Deterministic node (corresponding to special kind of uncertainty that its outcome is deterministically known whenever the outcome of some other uncertainties are also known) is drawn as a double oval. Value node (corresponding to each component of additively separable Von Neumann-Morgenstern utility function) is drawn as an octagon (or diamond). Arcs: Functional arcs (ending in value node) indicate that one of the components of additively separable utility function is a function of all the nodes at their tails. Conditional arcs (ending in uncertainty node) indicate that the uncertainty at their heads is probabilistically conditioned on all the nodes at their tails. Conditional arcs (ending in deterministic node) indicate that the uncertainty at their heads is deterministically conditioned on all the nodes at their tails. Informational arcs (ending in decision node) indicate that the decision at their heads is made with the outcome of all the nodes at their tails known beforehand. Given a properly structured ID: Decision nodes and incoming information arcs collectively state the alternatives (what can be done when the outcome of certain decisions and/or uncertainties are known beforehand) Uncertainty/deterministic nodes and incoming conditional arcs collectively model the information (what are known and their probabilistic/deterministic relationships) Value nodes and incoming functional arcs collectively quantify the preference (how things are preferred over one another). Alternative, information, and preference are termed decision basis in decision analysis, they represent three required components of any valid decision situation. Formally, the semantic of influence diagram is based on sequential construction of nodes and arcs, which implies a specification of all conditional independencies in the diagram. The specification is defined by the d {\displaystyle d} -separation criterion of Bayesian network. According to this semantic, every node is probabilistically independent on its non-successor nodes given the outcome of its immediate predecessor nodes. Likewise, a missing arc between non-value node X {\displaystyle X} and non-value node Y {\displaystyle Y} implies that there exists a set of non-value nodes Z {\displaystyle Z} , e.g., the parents of Y {\displaystyle Y} , that renders Y {\displaystyle Y} independent of X {\displaystyle X} given the outcome of the nodes in Z {\displaystyle Z} . == Example == Consider the simple influence diagram representing a situation where a decision-maker is planning their vacation. There is 1 decision node (Vacation Activity), 2 uncertainty nodes (Weather Condition, Weather Forecast), and 1 value node (Satisfaction). There are 2 functional arcs (ending in Satisfaction), 1 conditional arc (ending in Weather Forecast), and 1 informational arc (ending in Vacation Activity). Functional arcs ending in Satisfaction indicate that Satisfaction is a utility function of Weather Condition and Vacation Activity. In other words, their satisfaction can be quantified if they know what the weather is like and what their choice of activity is. (Note that they do not value Weather Forecast directly) Conditional arc ending in Weather Forecast indicates their belief that Weather Forecast and Weather Condition can be dependent. Informational arc ending in Vacation Activity indicates that they will only know Weather Forecast, not Weather Condition, when making their choice. In other words, actual weather will be known after they make their choice, and only forecast is what they can count on at this stage. It also follows semantically, for example, that Vacation Activity is independent on (irrelevant to) Weather Condition given Weather Forecast is known. == Applicability to value of information == The above example highlights the power of the influence diagram in representing an extremely important concept in decision analysis known as the value of information. Consider the following three scenarios; Scenario 1: The decision-maker could make their Vacation Activity decision while knowing what Weather Condition will be like. This corresponds to adding extra informational arc from Weather Condition to Vacation Activity in the above influence diagram. Scenario 2: The original influence diagram as shown above. Scenario 3: The decision-maker makes their decision without even knowing the Weather Forecast. This corresponds to removing informational arc from Weather Forecast to Vacation Activity in the above influence diagram. Scenario 1 is the best possible scenario for this decision situation since there is no longer any uncertainty on what they care about (Weather Condition) when making their decision. Scenario 3, however, is the worst possible scenario for this decision situation since they need to make their decision without any hint (Weather Forecast) on what they care about (Weather Condition) will turn out to be. The decision-maker is usually better off (definitely no worse off, on average) to move from scenario 3 to scenario 2 through the acquisition of new information. The most they should be willing to pay for such move is called the value of information on Weather Forecast, which is essentially the value of imperfect information on Weather Condition. The applicability of this simple ID and the value of information concept is tremendous, especially in medical decision making when most decisions have to be made with imperfect information about their patients, diseases, etc. == Related concepts == Influence diagrams are hierarchical and can be defined either in terms of their structure or in greater detail in terms of the functional and numerical relation between diagram elements. An ID that is consistently defined at all levels—structure, function, and number—is a well-defined mathematical representation and is referred to as a well-formed influence diagram (WFID). WFIDs can be evaluated using reversal and removal operations to yield answers to a large class of probabilistic, inferential, and decision questions. More recent techniques have been developed by artificial intelligence researchers concerning Bayesian network inference (belief propagation). An influence diagram having only uncertainty nodes (i.e., a Bayesian network) is also called a relevance diagram. An arc connecting node A to B implies not only that "A is relevant to B", but also that "B is relevant to A" (i.e., relevance is a symmetric relationship).

    Read more →
  • Document classification

    Document classification

    Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification. The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied. Documents may be classified according to their subjects or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: the content-based approach and the request-based approach. == "Content-based" versus "request-based" classification == Content-based classification is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. It is, for example, a common rule for classification in libraries, that at least 20% of the content of a book should be about the class to which the book is assigned. In automatic classification it could be the number of times given words appears in a document. Request-oriented classification (or -indexing) is classification in which the anticipated request from users is influencing how documents are being classified. The classifier asks themself: “Under which descriptors should this entity be found?” and “think of all the possible queries and decide for which ones the entity at hand is relevant” (Soergel, 1985, p. 230). Request-oriented classification may be classification that is targeted towards a particular audience or user group. For example, a library or a database for feminist studies may classify/index documents differently when compared to a historical library. It is probably better, however, to understand request-oriented classification as policy-based classification: The classification is done according to some ideals and reflects the purpose of the library or database doing the classification. In this way it is not necessarily a kind of classification or indexing based on user studies. Only if empirical data about use or users are applied should request-oriented classification be regarded as a user-based approach. == Classification versus indexing == Sometimes a distinction is made between assigning documents to classes ("classification") versus assigning subjects to documents ("subject indexing") but as Frederick Wilfrid Lancaster has argued, this distinction is not fruitful. "These terminological distinctions,” he writes, “are quite meaningless and only serve to cause confusion” (Lancaster, 2003, p. 21). The view that this distinction is purely superficial is also supported by the fact that a classification system may be transformed into a thesaurus and vice versa (cf., Aitchison, 1986, 2004; Broughton, 2008; Riesthuis & Bliedung, 1991). Therefore, assigning a subject term to a document in an index is equivalent to assigning that document to the class of documents indexed by that term (all documents indexed or classified as X belong to the same class of documents). == Automatic document classification (ADC) == Automatic document classification tasks can be divided into three sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, unsupervised document classification (also known as document clustering), where the classification must be done entirely without reference to external information, and semi-supervised document classification, where parts of the documents are labeled by the external mechanism. There are several software products under various license models available. === Techniques === Automatic document classification techniques include: Artificial neural network Concept Mining Decision trees such as ID3 or C4.5 Expectation maximization (EM) Instantaneously trained neural networks Latent semantic indexing Multiple-instance learning Naive Bayes classifier Natural language processing approaches Rough set-based classifier Soft set-based classifier Support vector machines (SVM) K-nearest neighbour algorithms tf–idf == Applications == Classification techniques have been applied to spam filtering, a process which tries to discern E-mail spam messages from legitimate emails email routing, sending an email sent to a general address to a specific address or mailbox depending on topic language identification, automatically determining the language of a text genre classification, automatically determining the genre of a text readability assessment, automatically determining the degree of readability of a text, either to find suitable materials for different age groups or reader types or as part of a larger text simplification system sentiment analysis, determining the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. health-related classification using social media in public health surveillance article triage, selecting articles that are relevant for manual literature curation, for example as is being done as the first step to generate manually curated annotation databases in biology

    Read more →
  • Reservoir computing

    Reservoir computing

    Reservoir computing is a framework for computation derived from recurrent neural network theory that maps input signals into higher dimensional computational spaces through the dynamics of a fixed, non-linear system called a reservoir. After the input signal is fed into the reservoir, which is treated as a "black box," a simple readout mechanism is trained to read the state of the reservoir and map it to the desired output. The first key benefit of this framework is that training is performed only at the readout stage, as the reservoir dynamics are fixed. The second is that the computational power of naturally available systems, both classical and quantum mechanical, can be used to reduce the effective computational cost. == History == The first examples of reservoir neural networks demonstrated that randomly connected recurrent neural networks could be used for sensorimotor sequence learning, and simple forms of interval and speech discrimination. In these early models the memory in the network took the form of both short-term synaptic plasticity and activity mediated by recurrent connections. In other early reservoir neural network models the memory of the recent stimulus history was provided solely by the recurrent activity. Overall, the general concept of reservoir computing stems from the use of recursive connections within neural networks to create a complex dynamical system. It is a generalisation of earlier neural network architectures such as recurrent neural networks, liquid-state machines and echo-state networks. Reservoir computing also extends to physical systems that are not networks in the classical sense, but rather continuous systems in space and/or time: e.g. a literal "bucket of water" can serve as a reservoir that performs computations on inputs given as perturbations of the surface. The resultant complexity of such recurrent neural networks was found to be useful in solving a variety of problems including language processing and dynamic system modeling. However, training of recurrent neural networks is challenging and computationally expensive. Reservoir computing reduces those training-related challenges by fixing the dynamics of the reservoir and only training the linear output layer. A large variety of nonlinear dynamical systems can serve as a reservoir that performs computations. In recent years semiconductor lasers have attracted considerable interest as computation can be fast and energy efficient compared to electrical components. Recent advances in both AI and quantum information theory have given rise to the concept of quantum neural networks. These hold promise in quantum information processing, which is challenging to classical networks, but can also find application in solving classical problems. In 2018, a physical realization of a quantum reservoir computing architecture was demonstrated in the form of nuclear spins within a molecular solid. However, the nuclear spin experiments in did not demonstrate quantum reservoir computing per se as they did not involve processing of sequential data. Rather the data were vector inputs, which makes this more accurately a demonstration of quantum implementation of a random kitchen sink algorithm (also going by the name of extreme learning machines in some communities). In 2019, another possible implementation of quantum reservoir processors was proposed in the form of two-dimensional fermionic lattices. In 2020, realization of reservoir computing on gate-based quantum computers was proposed and demonstrated on cloud-based IBM superconducting near-term quantum computers. Reservoir computers have been used for time-series analysis purposes. In particular, some of their usages involve chaotic time-series prediction, separation of chaotic signals, and link inference of networks from their dynamics. == Classical reservoir computing == === Reservoir === The 'reservoir' in reservoir computing is the internal structure of the computer, and must have two properties: it must be made up of individual, non-linear units, and it must be capable of storing information. The non-linearity describes the response of each unit to input, which is what allows reservoir computers to solve complex problems. Reservoirs are able to store information by connecting the units in recurrent loops, where the previous input affects the next response. The change in reaction due to the past allows the computers to be trained to complete specific tasks. Reservoirs can be virtual or physical. Virtual reservoirs are typically randomly generated and are designed like neural networks. Virtual reservoirs can be designed to have non-linearity and recurrent loops, but, unlike neural networks, the connections between units are randomized and remain unchanged throughout computation. Physical reservoirs are possible because of the inherent non-linearity of certain natural systems. The interaction between ripples on the surface of water contains the nonlinear dynamics required in reservoir creation, and a pattern recognition RC was developed by first inputting ripples with electric motors then recording and analyzing the ripples in the readout. === Readout === The readout is a neural network layer that performs a linear transformation on the output of the reservoir. The weights of the readout layer are trained by analyzing the spatiotemporal patterns of the reservoir after excitation by known inputs, and by utilizing a training method such as a linear regression or a Ridge regression. As its implementation depends on spatiotemporal reservoir patterns, the details of readout methods are tailored to each type of reservoir. For example, the readout for a reservoir computer using a container of liquid as its reservoir might entail observing spatiotemporal patterns on the surface of the liquid. === Types === ==== Context reverberation network ==== An early example of reservoir computing was the context reverberation network. In this architecture, an input layer feeds into a high dimensional dynamical system which is read out by a trainable single-layer perceptron. Two kinds of dynamical system were described: a recurrent neural network with fixed random weights, and a continuous reaction–diffusion system inspired by Alan Turing's model of morphogenesis. At the trainable layer, the perceptron associates current inputs with the signals that reverberate in the dynamical system; the latter were said to provide a dynamic "context" for the inputs. In the language of later work, the reaction–diffusion system served as the reservoir. ==== Echo state network ==== The tree echo state network (TreeESN) model represents a generalization of the reservoir computing framework to tree structured data. ==== Liquid-state machine ==== Chaotic liquid state machine The liquid (i.e. reservoir) of a chaotic liquid state machine (CLSM), or chaotic reservoir, is made from chaotic spiking neurons but which stabilize their activity by settling to a single hypothesis that describes the trained inputs of the machine. This is in contrast to general types of reservoirs that don't stabilize. The liquid stabilization occurs via synaptic plasticity and chaos control that govern neural connections inside the liquid. CLSM showed promising results in learning sensitive time series data. ==== Nonlinear transient computation ==== This type of information processing is most relevant when time-dependent input signals depart from the mechanism's internal dynamics. These departures cause transients or temporary altercations which are represented in the device's output. ==== Deep reservoir computing ==== The extension of the reservoir computing framework towards deep learning, with the introduction of deep reservoir computing and of the deep echo state network (DeepESN) model allows to develop efficiently trained models for hierarchical processing of temporal data, at the same time enabling the investigation on the inherent role of layered composition in recurrent neural networks. == Quantum reservoir computing == Quantum reservoir computing may use the nonlinear nature of quantum mechanical interactions or processes to form the characteristic nonlinear reservoirs but may also be done with linear reservoirs when the injection of the input to the reservoir creates the nonlinearity. The marriage of machine learning and quantum devices is leading to the emergence of quantum neuromorphic computing as a new research area. === Types === ==== Gaussian states of interacting quantum harmonic oscillators ==== Gaussian states are a paradigmatic class of states of continuous variable quantum systems. Although they can nowadays be created and manipulated in, e.g, state-of-the-art optical platforms, naturally robust to decoherence, it is well-known that they are not sufficient for, e.g., universal quantum computing because transformations that preserve the Gaussian nature of a state are linear. Normally, linear dynamics would not be sufficient for nontrivial reser

    Read more →
  • Bayesian network

    Bayesian network

    A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. Efficient algorithms can perform inference and learning in Bayesian networks. Bayesian networks that model sequences of variables (e.g. speech signals or protein sequences) are called dynamic Bayesian networks. Generalizations of Bayesian networks that can represent and solve decision problems under uncertainty are called influence diagrams. == Graphical model == Formally, Bayesian networks are directed acyclic graphs (DAGs) whose nodes represent variables in the Bayesian sense: they may be observable quantities, latent variables, unknown parameters or hypotheses. Each edge represents a direct conditional dependency. Any pair of nodes that are not connected (i.e. no path connects one node to the other) represent variables that are conditionally independent of each other. Each node is associated with a probability function that takes, as input, a particular set of values for the node's parent variables, and gives (as output) the probability (or probability distribution, if applicable) of the variable represented by the node. For example, if m {\displaystyle m} parent nodes represent m {\displaystyle m} Boolean variables, then the probability function could be represented by a table of 2 m {\displaystyle 2^{m}} entries, one entry for each of the 2 m {\displaystyle 2^{m}} possible parent combinations. Similar ideas may be applied to undirected, and possibly cyclic, graphs such as Markov networks. == Example == Suppose we want to model the dependencies between three variables: the sprinkler (or more appropriately, its state - whether it is on or not), the presence or absence of rain and whether the grass is wet or not. Observe that two events can cause the grass to become wet: an active sprinkler or rain. Rain has a direct effect on the use of the sprinkler (namely that when it rains, the sprinkler usually is not active). This situation can be modeled with a Bayesian network (shown to the right). Each variable has two possible values, T (for true) and F (for false). The joint probability function is, by the chain rule of probability, Pr ( G , S , R ) = Pr ( G ∣ S , R ) Pr ( S ∣ R ) Pr ( R ) {\displaystyle \Pr(G,S,R)=\Pr(G\mid S,R)\Pr(S\mid R)\Pr(R)} where G = "Grass wet (true/false)", S = "Sprinkler turned on (true/false)", and R = "Raining (true/false)". The model can answer questions about the presence of a cause given the presence of an effect (so-called inverse probability) like "What is the probability that it is raining, given the grass is wet?" by using the conditional probability formula and summing over all nuisance variables: Pr ( R = T ∣ G = T ) = Pr ( G = T , R = T ) Pr ( G = T ) = ∑ x ∈ { T , F } Pr ( G = T , S = x , R = T ) ∑ x , y ∈ { T , F } Pr ( G = T , S = x , R = y ) {\displaystyle \Pr(R=T\mid G=T)={\frac {\Pr(G=T,R=T)}{\Pr(G=T)}}={\frac {\sum _{x\in \{T,F\}}\Pr(G=T,S=x,R=T)}{\sum _{x,y\in \{T,F\}}\Pr(G=T,S=x,R=y)}}} Using the expansion for the joint probability function Pr ( G , S , R ) {\displaystyle \Pr(G,S,R)} and the conditional probabilities from the conditional probability tables (CPTs) stated in the diagram, one can evaluate each term in the sums in the numerator and denominator. For example, Pr ( G = T , S = T , R = T ) = Pr ( G = T ∣ S = T , R = T ) Pr ( S = T ∣ R = T ) Pr ( R = T ) = 0.99 × 0.01 × 0.2 = 0.00198. {\displaystyle {\begin{aligned}\Pr(G=T,S=T,R=T)&=\Pr(G=T\mid S=T,R=T)\Pr(S=T\mid R=T)\Pr(R=T)\\&=0.99\times 0.01\times 0.2\\&=0.00198.\end{aligned}}} Then the numerical results (subscripted by the associated variable values) are Pr ( R = T ∣ G = T ) = 0.00198 T T T + 0.1584 T F T 0.00198 T T T + 0.288 T T F + 0.1584 T F T + 0.0 T F F = 891 2491 ≈ 35.77 % . {\displaystyle \Pr(R=T\mid G=T)={\frac {0.00198_{TTT}+0.1584_{TFT}}{0.00198_{TTT}+0.288_{TTF}+0.1584_{TFT}+0.0_{TFF}}}={\frac {891}{2491}}\approx 35.77\%.} To answer an interventional question, such as "What is the probability that it would rain, given that we wet the grass?" the answer is governed by the post-intervention joint distribution function Pr ( S , R ∣ do ( G = T ) ) = Pr ( S ∣ R ) Pr ( R ) {\displaystyle \Pr(S,R\mid {\text{do}}(G=T))=\Pr(S\mid R)\Pr(R)} obtained by removing the factor Pr ( G ∣ S , R ) {\displaystyle \Pr(G\mid S,R)} from the pre-intervention distribution. The do operator forces the value of G to be true. The probability of rain is unaffected by the action: Pr ( R ∣ do ( G = T ) ) = Pr ( R ) . {\displaystyle \Pr(R\mid {\text{do}}(G=T))=\Pr(R).} To predict the impact of turning the sprinkler on: Pr ( R , G ∣ do ( S = T ) ) = Pr ( R ) Pr ( G ∣ R , S = T ) {\displaystyle \Pr(R,G\mid {\text{do}}(S=T))=\Pr(R)\Pr(G\mid R,S=T)} with the term Pr ( S = T ∣ R ) {\displaystyle \Pr(S=T\mid R)} removed, showing that the action affects the grass but not the rain. These predictions may not be feasible given unobserved variables, as in most policy evaluation problems. The effect of the action do ( x ) {\displaystyle {\text{do}}(x)} can still be predicted, however, whenever the back-door criterion is satisfied. It states that, if a set Z of nodes can be observed that d-separates (or blocks) all back-door paths from X to Y then Pr ( Y , Z ∣ do ( x ) ) = Pr ( Y , Z , X = x ) Pr ( X = x ∣ Z ) . {\displaystyle \Pr(Y,Z\mid {\text{do}}(x))={\frac {\Pr(Y,Z,X=x)}{\Pr(X=x\mid Z)}}.} A back-door path is one that ends with an arrow into X. Sets that satisfy the back-door criterion are called "sufficient" or "admissible." For example, the set Z = R is admissible for predicting the effect of S = T on G, because R d-separates the (only) back-door path S ← R → G. However, if S is not observed, no other set d-separates this path and the effect of turning the sprinkler on (S = T) on the grass (G) cannot be predicted from passive observations. In that case P(G | do(S = T)) is not "identified". This reflects the fact that, lacking interventional data, the observed dependence between S and G is due to a causal connection or is spurious (apparent dependence arising from a common cause, R). (see Simpson's paradox) To determine whether a causal relation is identified from an arbitrary Bayesian network with unobserved variables, one can use the three rules of "do-calculus" and test whether all do terms can be removed from the expression of that relation, thus confirming that the desired quantity is estimable from frequency data. Using a Bayesian network can save considerable amounts of memory over exhaustive probability tables, if the dependencies in the joint distribution are sparse. For example, a naive way of storing the conditional probabilities of 10 two-valued variables as a table requires storage space for 2 10 = 1024 {\displaystyle 2^{10}=1024} values. If no variable's local distribution depends on more than three parent variables, the Bayesian network representation stores at most 10 ⋅ 2 3 = 80 {\displaystyle 10\cdot 2^{3}=80} values. One advantage of Bayesian networks is that it is intuitively easier for a human to understand (a sparse set of) direct dependencies and local distributions than complete joint distributions. == Inference and learning == Bayesian networks perform three main inference tasks: Inferring unobserved variables Parameter learning for the probability distributions of each node in the network Structure learning of the graphical network === Inferring unobserved variables === Because a Bayesian network is a complete model for its variables and their relationships, it can be used to answer probabilistic queries about them. For example, the network can be used to update knowledge of the state of a subset of variables when other variables (the evidence variables) are observed. This process of computing the posterior distribution of variables given evidence is called probabilistic inference. The posterior gives a universal sufficient statistic for detection applications, when choosing values for the variable subset that minimize some expected loss function, for instance the probability of decision error. A Bayesian network can thus be considered a mechanism for automatically applying Bayes' theorem to complex problems. The most common exact inference methods are: variable elimination, which eliminates (by integration or summation) the non-observed non-query variables one by one by distributing the sum over the prod

    Read more →
  • Perceptron

    Perceptron

    In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function that can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector. == History == The artificial neuron and artificial neural network were invented in 1943 by Warren McCulloch and Walter Pitts in their seminal paper "A Logical Calculus of the Ideas Immanent in Nervous Activity". In 1957, Frank Rosenblatt was at the Cornell Aeronautical Laboratory. He simulated the perceptron on an IBM 704. Later, he obtained funding by the Information Systems Branch of the United States Office of Naval Research and the Rome Air Development Center, to build a custom-made computer, the Mark I Perceptron. It was first publicly demonstrated on 23 June 1960. The machine was "part of a previously secret four-year NPIC [the US' National Photographic Interpretation Center] effort from 1963 through 1966 to develop this algorithm into a useful tool for photo-interpreters". Rosenblatt described the details of the perceptron in a 1958 paper. His organization of a perceptron is constructed of three kinds of cells ("units"): S, A, R, which stand for "sensory", "association" and "response". He presented at the first international symposium on AI, Mechanisation of Thought Processes, which took place in 1958 November. Rosenblatt's project was funded under Contract Nonr-401(40) "Cognitive Systems Research Program", which lasted from 1959 to 1970, and Contract Nonr-2381(00) "Project PARA" ("PARA" means "Perceiving and Recognition Automata"), which lasted from 1957 to 1963. In 1959, the Institute for Defense Analysis awarded his group a $10,000 contract. By September 1961, the ONR awarded further $153,000 worth of contracts, with $108,000 committed for 1962. The ONR research manager, Marvin Denicoff, stated that ONR, instead of ARPA, funded the Perceptron project, because the project was unlikely to produce technological results in the near or medium term. Funding from ARPA go up to the order of millions dollars, while from ONR are on the order of 10,000 dollars. Meanwhile, the head of IPTO at ARPA, J.C.R. Licklider, was interested in 'self-organizing', 'adaptive' and other biologically-inspired methods in the 1950s; but by the mid-1960s he was openly critical of these, including the perceptron. Instead he strongly favored the logical AI approach of Simon and Newell. === Mark I Perceptron machine === The perceptron was intended to be a machine, rather than a program, and while its first implementation was in software for the IBM 704, it was subsequently implemented in custom-built hardware as the Mark I Perceptron with the project name "Project PARA", designed for image recognition. The machine is currently in Smithsonian National Museum of American History. The Mark I Perceptron had three layers. One version was implemented as follows: An array of 400 photocells arranged in a 20x20 grid, named "sensory units" (S-units), or "input retina". Each S-unit can connect to up to 40 A-units. A hidden layer of 512 perceptrons, named "association units" (A-units). An output layer of eight perceptrons, named "response units" (R-units). Rosenblatt called this three-layered perceptron network the alpha-perceptron, to distinguish it from other perceptron models he experimented with. The S-units are connected to the A-units randomly (according to a table of random numbers) via a plugboard (see photo), to "eliminate any particular intentional bias in the perceptron". The connection weights are fixed, not learned. Rosenblatt was adamant about the random connections, as he believed the retina was randomly connected to the visual cortex, and he wanted his perceptron machine to resemble human visual perception. The A-units are connected to the R-units, with adjustable weights encoded in potentiometers, and weight updates during learning were performed by electric motors.The hardware details are in an operators' manual. In a 1958 press conference organized by the US Navy, Rosenblatt made statements about the perceptron that caused a heated controversy among the fledgling AI community; based on Rosenblatt's statements, The New York Times reported the perceptron to be "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." The Photo Division of Central Intelligence Agency, from 1960 to 1964, studied the use of Mark I Perceptron machine for recognizing militarily interesting silhouetted targets (such as planes and ships) in aerial photos. === Principles of Neurodynamics (1962) === Rosenblatt described his experiments with many variants of the Perceptron machine in a book Principles of Neurodynamics (1962). The book is a published version of the 1961 report. Among the variants are: "cross-coupling" (connections between units within the same layer) with possibly closed loops, "back-coupling" (connections from units in a later layer to units in a previous layer), four-layer perceptrons where the last two layers have adjustable weights (and thus a proper multilayer perceptron), incorporating time-delays to perceptron units, to allow for processing sequential data, analyzing audio (instead of images). The machine was shipped from Cornell to Smithsonian in 1967, under a government transfer administered by the Office of Naval Research. === Perceptrons (1969) === Although the perceptron initially seemed promising, it was quickly proved that perceptrons could not be trained to recognise many classes of patterns. This caused the field of neural network research to stagnate for many years, before it was recognised that a feedforward neural network with two or more layers (also called a multilayer perceptron) had greater processing power than perceptrons with one layer (also called a single-layer perceptron). Single-layer perceptrons are only capable of learning linearly separable patterns. For a classification task with some step activation function, a single node will have a single line dividing the data points forming the patterns. More nodes can create more dividing lines, but those lines must somehow be combined to form more complex classifications. A second layer of perceptrons, or even linear nodes, are sufficient to solve many otherwise non-separable problems. In 1969, a famous book entitled Perceptrons by Marvin Minsky and Seymour Papert showed that it was impossible for these classes of network to learn an XOR function. It is often incorrectly believed that they also conjectured that a similar result would hold for a multi-layer perceptron network. However, this is not true, as both Minsky and Papert already knew that multi-layer perceptrons were capable of producing an XOR function. (See the page on Perceptrons (book) for more information.) Nevertheless, the often-miscited Minsky and Papert text caused a significant decline in interest and funding of neural network research. It took ten more years until neural network research experienced a resurgence in the 1980s. This text was reprinted in 1987 as "Perceptrons - Expanded Edition" where some errors in the original text are shown and corrected. === Subsequent work === Rosenblatt continued working on perceptrons despite diminishing funding. The last attempt was Tobermory, built between 1961 and 1967, built for speech recognition. It occupied an entire room. It had 4 layers with 12,000 weights implemented by toroidal magnetic cores. By the time of its completion, simulation on digital computers had become faster than purpose-built perceptron machines. He died in a boating accident in 1971. A simulation program for neural networks was written for IBM 7090/7094, and was used to study various pattern recognition applications, such as character recognition, particle tracks in bubble-chamber photographs; phoneme, isolated word, and continuous speech recognition; speaker verification; and center-of-attention mechanisms for image processing. The kernel perceptron algorithm was already introduced in 1964 by Aizerman et al. Margin bounds guarantees were given for the Perceptron algorithm in the general non-separable case first by Freund and Schapire (1998), and more recently by Mohri and Rostamizadeh (2013) who extend previous results and give new and more favorable L1 bounds. The perceptron is a simplified model of a biological neuron. While the complexity of biological neuron models is often required to fully understand neural behavior, research suggests a perceptron-like linear model can produce some behavior seen in real neurons. The solution spaces of decision boundaries for all binary functions and learning behaviors are studied in. == Definition == In the modern sense, the perceptron is an algori

    Read more →
  • Site reliability engineering

    Site reliability engineering

    Site reliability engineering (SRE) is a discipline in the field of software engineering and IT infrastructure support that monitors and improves the availability and performance of deployed software systems and large software services (which are expected to deliver reliable response times across events such as new software deployments, hardware failures, and cybersecurity attacks). There is typically a focus on automation and an infrastructure as code methodology. SRE uses elements of software engineering, IT infrastructure, web development, and operations to assist with reliability. It is similar to DevOps as they both aim to improve the reliability and availability of deployed software systems. == History == Site Reliability Engineering originated at Google with Benjamin Treynor Sloss, who founded SRE team in 2003. The concept expanded within the software development industry, leading various companies to employ site reliability engineers. By March 2016, Google had more than 1,000 site reliability engineers on staff. Dedicated SRE teams are common at larger web development companies. In middle-sized and smaller companies, DevOps teams sometimes perform SRE, as well. Organizations that have adopted the concept include Airbnb, Dropbox, IBM, LinkedIn, Netflix, and Wikimedia. == Definition == Site reliability engineers (SREs) are responsible for a combination of system availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. SREs often have backgrounds in software engineering, systems engineering, and/or system administration. The focuses of SRE include automation, system design, and improvements to system resilience. SRE is considered a specific implementation of DevOps; focusing specifically on building reliable systems, whereas DevOps covers a broader scope of operations. Despite having different focuses, some companies have rebranded their operations teams to SRE teams. == Principles and practices == Common definitions of the practices include (but are not limited to): Automation of repetitive tasks for cost-effectiveness. Defining reliability goals to prevent endless effort. Design of systems with a goal to reduce risks to availability, latency, and efficiency. Observability, the ability to ask arbitrary questions about a system without having to know ahead of time what to ask. Common definitions of the principles include (but are not limited to): Toil management, the implementation of the first principle outlined above. Defining and measuring reliability goals—SLIs, SLOs, and error budgets. Non-Abstract Large Scale Systems Design (NALSD) with a focus on reliability. Designing for and implementing observability. Defining, testing, and running an incident management process. Capacity planning. Change and release management, including CI/CD. Chaos engineering. == Deployment == SRE teams collaborate with other departments within organizations to guide the implementation of the mentioned principles. Below is an overview of common practices: === Kitchen Sink === Kitchen Sink refers to the expansive and often unbounded scope of services and workflows that SRE teams oversee. Unlike traditional roles with clearly defined boundaries, SREs are tasked with various responsibilities, including system performance optimization, incident management, and automation. This approach allows SREs to address multiple challenges, ensuring that systems run efficiently and evolve in response to changing demands and complexities. === Infrastructure === Infrastructure SRE teams focus on maintaining and improving the reliability of systems that support other teams' workflows. While they sometimes collaborate with platform engineering teams, their primary responsibility is ensuring up-time, performance, and efficiency. Platform teams, on the other hand, primarily develop the software and systems used across the organization. While reliability is a goal for both, platform teams prioritize creating and maintaining the tools and services used by internal stakeholders, whereas Infrastructure SRE teams are tasked with ensuring those systems run smoothly and meet reliability standards. === Tools === SRE teams utilize a variety of tools with the aim of measuring, maintaining, and enhancing system reliability. These tools play a role in monitoring performance, identifying issues, and facilitating proactive maintenance. For instance, Nagios Core is commonly employed for system monitoring and alerting, while Prometheus (software) is frequently used for collecting and querying metrics in cloud-native environments. === Product or Application === SRE teams dedicated to specific products or applications are common in large organizations. These teams are responsible for ensuring the reliability, scalability, and performance of key services. In larger companies, it's typical to have multiple SRE teams, each focusing on different products or applications, ensuring that each area receives specialized attention to meet performance and availability targets. === Embedded === In an embedded model, individual SREs or small SRE pairs are integrated within software engineering teams. These SREs collaborate with developers, applying core SRE principles—such as automation, monitoring, and incident response—directly to the software development lifecycle. This approach aims to enhance reliability, performance, and collaboration between SREs and developers. === Consulting === Consulting SRE teams specialize in advising organizations on the implementation of SRE principles and practices. Typically composed of seasoned SREs with a history across various implementations, these teams provide insights and guidance for specific organizational needs. When working directly with clients, these SREs are often referred to as 'Customer Reliability Engineers.' In large organizations that have adopted SRE, a hybrid model is common. This model includes various implementations, such as multiple Product/Application SRE teams dedicated to addressing the specific reliability needs of different products. An Infrastructure SRE team may collaborate with a Platform engineering group to achieve shared reliability goals for a unified platform that supports all products and applications. == Industry == Since 2014, the USENIX organization has hosted the annual SREcon conference, bringing together site reliability engineers from various industries. This conference is a platform for professionals to share knowledge, explore effective practices, and discuss trends in site reliability engineering.

    Read more →
  • Large margin nearest neighbor

    Large margin nearest neighbor

    Large margin nearest neighbor (LMNN) classification is a statistical machine learning algorithm for metric learning. It learns a pseudometric designed for k-nearest neighbor classification. The algorithm is based on semidefinite programming, a sub-class of convex optimization. The goal of supervised learning (more specifically classification) is to learn a decision rule that can categorize data instances into pre-defined classes. The k-nearest neighbor rule assumes a training data set of labeled instances (i.e. the classes are known). It classifies a new data instance with the class obtained from the majority vote of the k closest (labeled) training instances. Closeness is measured with a pre-defined metric. Large margin nearest neighbors is an algorithm that learns this global (pseudo-)metric in a supervised fashion to improve the classification accuracy of the k-nearest neighbor rule. == Setup == The main intuition behind LMNN is to learn a pseudometric under which all data instances in the training set are surrounded by at least k instances that share the same class label. If this is achieved, the leave-one-out error (a special case of cross validation) is minimized. Let the training data consist of a data set D = { ( x → 1 , y 1 ) , … , ( x → n , y n ) } ⊂ R d × C {\displaystyle D=\{({\vec {x}}_{1},y_{1}),\dots ,({\vec {x}}_{n},y_{n})\}\subset R^{d}\times C} , where the set of possible class categories is C = { 1 , … , c } {\displaystyle C=\{1,\dots ,c\}} . The algorithm learns a pseudometric of the type d ( x → i , x → j ) = ( x → i − x → j ) ⊤ M ( x → i − x → j ) {\displaystyle d({\vec {x}}_{i},{\vec {x}}_{j})=({\vec {x}}_{i}-{\vec {x}}_{j})^{\top }\mathbf {M} ({\vec {x}}_{i}-{\vec {x}}_{j})} . For d ( ⋅ , ⋅ ) {\displaystyle d(\cdot ,\cdot )} to be well defined, the matrix M {\displaystyle \mathbf {M} } needs to be positive semi-definite. The Euclidean metric is a special case, where M {\displaystyle \mathbf {M} } is the identity matrix. This generalization is often (falsely) referred to as Mahalanobis metric. Figure 1 illustrates the effect of the metric under varying M {\displaystyle \mathbf {M} } . The two circles show the set of points with equal distance to the center x → i {\displaystyle {\vec {x}}_{i}} . In the Euclidean case this set is a circle, whereas under the modified (Mahalanobis) metric it becomes an ellipsoid. The algorithm distinguishes between two types of special data points: target neighbors and impostors. === Target neighbors === Target neighbors are selected before learning. Each instance x → i {\displaystyle {\vec {x}}_{i}} has exactly k {\displaystyle k} different target neighbors within D {\displaystyle D} , which all share the same class label y i {\displaystyle y_{i}} . The target neighbors are the data points that should become nearest neighbors under the learned metric. Let us denote the set of target neighbors for a data point x → i {\displaystyle {\vec {x}}_{i}} as N i {\displaystyle N_{i}} . === Impostors === An impostor of a data point x → i {\displaystyle {\vec {x}}_{i}} is another data point x → j {\displaystyle {\vec {x}}_{j}} with a different class label (i.e. y i ≠ y j {\displaystyle y_{i}\neq y_{j}} ) which is one of the nearest neighbors of x → i {\displaystyle {\vec {x}}_{i}} . During learning the algorithm tries to minimize the number of impostors for all data instances in the training set. == Algorithm == Large margin nearest neighbors optimizes the matrix M {\displaystyle \mathbf {M} } with the help of semidefinite programming. The objective is twofold: For every data point x → i {\displaystyle {\vec {x}}_{i}} , the target neighbors should be close and the impostors should be far away. Figure 1 shows the effect of such an optimization on an illustrative example. The learned metric causes the input vector x → i {\displaystyle {\vec {x}}_{i}} to be surrounded by training instances of the same class. If it was a test point, it would be classified correctly under the k = 3 {\displaystyle k=3} nearest neighbor rule. The first optimization goal is achieved by minimizing the average distance between instances and their target neighbors ∑ i , j ∈ N i d ( x → i , x → j ) {\displaystyle \sum _{i,j\in N_{i}}d({\vec {x}}_{i},{\vec {x}}_{j})} . The second goal is achieved by penalizing distances to impostors x → l {\displaystyle {\vec {x}}_{l}} that are less than one unit further away than target neighbors x → j {\displaystyle {\vec {x}}_{j}} (and therefore pushing them out of the local neighborhood of x → i {\displaystyle {\vec {x}}_{i}} ). The resulting value to be minimized can be stated as: ∑ i , j ∈ N i , l , y l ≠ y i [ d ( x → i , x → j ) + 1 − d ( x → i , x → l ) ] + {\displaystyle \sum _{i,j\in N_{i},l,y_{l}\neq y_{i}}[d({\vec {x}}_{i},{\vec {x}}_{j})+1-d({\vec {x}}_{i},{\vec {x}}_{l})]_{+}} With a hinge loss function [ ⋅ ] + = max ( ⋅ , 0 ) {\textstyle [\cdot ]_{+}=\max(\cdot ,0)} , which ensures that impostor proximity is not penalized when outside the margin. The margin of exactly one unit fixes the scale of the matrix M {\displaystyle M} . Any alternative choice c > 0 {\displaystyle c>0} would result in a rescaling of M {\displaystyle M} by a factor of 1 / c {\displaystyle 1/c} . The final optimization problem becomes: min M ∑ i , j ∈ N i d ( x → i , x → j ) + λ ∑ i , j , l ξ i j l {\displaystyle \min _{\mathbf {M} }\sum _{i,j\in N_{i}}d({\vec {x}}_{i},{\vec {x}}_{j})+\lambda \sum _{i,j,l}\xi _{ijl}} ∀ i , j ∈ N i , l , y l ≠ y i {\displaystyle \forall _{i,j\in N_{i},l,y_{l}\neq y_{i}}} d ( x → i , x → j ) + 1 − d ( x → i , x → l ) ≤ ξ i j l {\displaystyle d({\vec {x}}_{i},{\vec {x}}_{j})+1-d({\vec {x}}_{i},{\vec {x}}_{l})\leq \xi _{ijl}} ξ i j l ≥ 0 {\displaystyle \xi _{ijl}\geq 0} M ⪰ 0 {\displaystyle \mathbf {M} \succeq 0} The hyperparameter λ > 0 {\textstyle \lambda >0} is some positive constant (typically set through cross-validation). Here the variables ξ i j l {\displaystyle \xi _{ijl}} (together with two types of constraints) replace the term in the cost function. They play a role similar to slack variables to absorb the extent of violations of the impostor constraints. The last constraint ensures that M {\displaystyle \mathbf {M} } is positive semi-definite. The optimization problem is an instance of semidefinite programming (SDP). Although SDPs tend to suffer from high computational complexity, this particular SDP instance can be solved very efficiently due to the underlying geometric properties of the problem. In particular, most impostor constraints are naturally satisfied and do not need to be enforced during runtime (i.e. the set of variables ξ i j l {\displaystyle \xi _{ijl}} is sparse). A particularly well suited solver technique is the working set method, which keeps a small set of constraints that are actively enforced and monitors the remaining (likely satisfied) constraints only occasionally to ensure correctness. == Extensions and efficient solvers == LMNN was extended to multiple local metrics in the 2008 paper. This extension significantly improves the classification error, but involves a more expensive optimization problem. In their 2009 publication in the Journal of Machine Learning Research, Weinberger and Saul derive an efficient solver for the semi-definite program. It can learn a metric for the MNIST handwritten digit data set in several hours, involving billions of pairwise constraints. An open source Matlab implementation is freely available at the authors web page. Kumal et al. extended the algorithm to incorporate local invariances to multivariate polynomial transformations and improved regularization.

    Read more →
  • Random neural network

    Random neural network

    The Random Neural Network (RNN) is a mathematical representation of an interconnected network of neurons or cells which exchange spiking signals. It was invented by Erol Gelenbe and is linked to the G-network model of queueing networks which Erol Gelenbe also invented, and with his Gene Regulatory Network models. In this model, each neuronal cell state is represented by an integer whose value rises when the cell receives an excitatory spike and drops when it receives an inhibitory spike. The spikes can originate outside the network itself, or they can come from other cells in the networks. Cells whose internal excitatory state has a positive value are allowed to send out spikes of either kind to other cells in the network according to specific cell-dependent spiking rates. The model has a mathematical solution in steady-state which provides the joint probability distribution of the network in terms of the individual probabilities that each cell is excited and able to send out spikes. Computing this solution is based on solving a set of non-linear algebraic equations whose parameters are related to the spiking rates of individual cells and their connectivity to other cells, as well as the arrival rates of spikes from outside the network. The RNN is a recurrent model, i.e. a neural network that is allowed to have complex feedback loops. A highly energy-efficient implementation of random neural networks was demonstrated by Krishna Palem et al. using the Probabilistic CMOS or PCMOS technology and was shown to be c. 226–300 times more efficient in terms of Energy-Performance-Product. RNNs are also related to artificial neural networks, which (like the random neural network) have gradient-based learning algorithms. The learning algorithm for an n-node random neural network that includes feedback loops (it is also a recurrent neural network) is of computational complexity O(n^3) (the number of computations is proportional to the cube of n, the number of neurons). The random neural network can also be used with other learning algorithms such as reinforcement learning. The RNN has been shown to be a universal approximator for bounded and continuous functions.

    Read more →
  • International Conference on Acoustics, Speech, and Signal Processing

    International Conference on Acoustics, Speech, and Signal Processing

    ICASSP, the International Conference on Acoustics, Speech, and Signal Processing, is an annual flagship conference organized by IEEE Signal Processing Society. Ei Compendex has indexed all papers included in its proceedings. The first ICASSP was held in 1976 in Philadelphia, Pennsylvania, based on the success of a conference in Massachusetts four years earlier that had focused specifically on speech signals. As ranked by Google Scholar's h-index metric in 2016, ICASSP has the highest h-index of any conference in the Signal Processing field. The Brazilian ministry of education gave the conference an 'A1' rating based on its h-index. == Conference list ==

    Read more →
  • Spyglass (app)

    Spyglass (app)

    Spyglass is a navigation and orientation mobile application developed by Pavel Ahafonau. It combines data from a digital compass, GNSS positioning, motion sensors, maps, and the device camera to provide direction finding, waypoint navigation, and measurement tools. The application is designed for offline and off-road use and is used in outdoor navigation, orientation tasks, astronomy, and fieldwork. == History == Spyglass was created by independent software developer Pavel Ahafonau as a personal project in 2009, following the introduction of a digital compass sensor in the iPhone. It initially focused on combining compass, GPS, and camera data into an augmented-reality tool for navigation and orientation. In September 2009, a public prototype was demonstrated, showing a live camera view combined with a digital compass overlay aligned to device orientation, presenting an early augmented-reality, location-aware heads-up display. The application was released on the Apple App Store in October 2009. In February 2010, a major update introduced target-based navigation, allowing users to navigate to saved locations, bearings, and selected celestial objects. The update also added visual measurement tools, including an optical-style rangefinder, as well as a vertical speed indicator displaying ascent and descent rates derived from device sensor data. In December 2010, Spyglass was featured by Apple in iTunes Rewind 2010 under augmented-reality applications. The application expanded to Android on 28 October 2017. In May 2021, Spyglass expanded its offline mapping capabilities by adding support for additional map styles by Thunderforest, extending the range of available cartographic themes for offline use. Also in 2021, navigation satellite tracking was introduced, allowing visualization and tracking of major GPS/GNSS satellite constellations. In 2022, a searchable offline database of major locations was added, including airports, seaports, mountains, castles, and landmarks, along with nearest-airport tracking functionality. In July 2024, previously separate iOS editions (Spyglass, Commander Compass, and Commander Compass Go) were consolidated into a single Spyglass application. At the same time, the app transitioned to a freemium model. == Features == Spyglass provides navigation and orientation functions by combining sensor data from the device. Core functionality includes a digital compass, GNSS-based positioning, waypoint creation and tracking, and map-based navigation with offline support. The application includes an augmented-reality viewfinder mode that overlays navigation and sensor information onto the live camera view. Displayed data may include heading, bearing, distance to targets, pitch, roll, yaw, altitude, speed, and estimated time of arrival. Additional tools include an altimeter, speedometer, vertical speed indicator, inclinometer, artificial horizon, coordinate conversion utilities, optical rangefinding, and angular measurement tools. Spyglass also supports celestial navigation features, such as tracking of the Sun, Moon, stars, and global navigation satellite systems. Spyglass uses data from the device's GNSS receiver, digital compass, gyroscope, accelerometer, barometer (when available), and camera. Sensor data are combined to calculate position, orientation, movement, and measurement overlays. The application is designed to function without an internet connection. Navigation tools, sensor readings, waypoint tracking, augmented-reality features, celestial tracking, and the built-in location database operate offline. Internet access is required only for loading online map tiles; previously downloaded offline maps remain available without connectivity.

    Read more →
  • Probit model

    Probit model

    In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model. A probit model is a popular specification for a binary response model. As such it treats the same set of problems as does logistic regression using similar techniques. When viewed in the generalized linear model framework, the probit model employs a probit link function. It is most often estimated using the maximum likelihood procedure, such an estimation being called a probit regression. == Conceptual framework == Suppose a response variable Y is binary, that is it can have only two possible outcomes which we will denote as 1 and 0. For example, Y may represent presence/absence of a certain condition, success/failure of some device, answer yes/no on a survey, etc. We also have a vector of regressors X, which are assumed to influence the outcome Y. Specifically, we assume that the model takes the form P ( Y = 1 ∣ X ) = Φ ( X T β ) , {\displaystyle P(Y=1\mid X)=\Phi (X^{\operatorname {T} }\beta ),} where P is the probability and Φ {\displaystyle \Phi } is the cumulative distribution function (CDF) of the standard normal distribution. The parameters β are typically estimated by maximum likelihood. It is possible to motivate the probit model as a latent variable model. Suppose there exists an auxiliary random variable Y ∗ = X T β + ε , {\displaystyle Y^{\ast }=X^{T}\beta +\varepsilon ,} where ε ~ N(0, 1). Then Y can be viewed as an indicator for whether this latent variable is positive: Y = { 1 Y ∗ > 0 0 otherwise } = { 1 X T β + ε > 0 0 otherwise } {\displaystyle Y=\left.{\begin{cases}1&Y^{}>0\\0&{\text{otherwise}}\end{cases}}\right\}=\left.{\begin{cases}1&X^{\operatorname {T} }\beta +\varepsilon >0\\0&{\text{otherwise}}\end{cases}}\right\}} The use of the standard normal distribution causes no loss of generality compared with the use of a normal distribution with an arbitrary mean and standard deviation, because adding a fixed amount to the mean can be compensated by subtracting the same amount from the intercept, and multiplying the standard deviation by a fixed amount can be compensated by multiplying the weights by the same amount. To see that the two models are equivalent, note that P ( Y = 1 ∣ X ) = P ( Y ∗ > 0 ) = P ( X T β + ε > 0 ) = P ( ε > − X T β ) = P ( ε < X T β ) by symmetry of the normal distribution = Φ ( X T β ) {\displaystyle {\begin{aligned}P(Y=1\mid X)&=P(Y^{\ast }>0)\\&=P(X^{\operatorname {T} }\beta +\varepsilon >0)\\&=P(\varepsilon >-X^{\operatorname {T} }\beta )\\&=P(\varepsilon 0 {\displaystyle t,\lim _{n\rightarrow \infty }n_{t}/n=c_{t}>0} . Denote p ^ t = r t / n t {\displaystyle {\hat {p}}_{t}=r_{t}/n_{t}} σ ^ t 2 = 1 n t p ^ t ( 1 − p ^ t ) φ 2 ( Φ − 1 ( p ^ t ) ) {\displaystyle {\hat {\sigma }}_{t}^{2}={\frac {1}{n_{t}}}{\frac {{\hat {p}}_{t}(1-{\hat {p}}_{t})}{\varphi ^{2}{\big (}\Phi ^{-1}({\hat {p}}_{t}){\big )}}}} Then Berkson's minimum chi-square estimator is a generalized least squares estimator in a regression of Φ − 1 ( p ^ t ) {\displaystyle \Phi ^{-1}({\hat {p}}_{t})} on x ( t ) {\displaystyle x_{(t)}} with weights σ ^ t − 2 {\displaystyle {\hat {\sigma }}_{t}^{-2}} : β ^ = ( ∑ t = 1 T σ ^ t − 2 x ( t ) x ( t ) T ) − 1 ∑ t = 1 T σ ^ t − 2 x ( t ) Φ − 1 ( p ^ t ) {\displaystyle {\hat {\beta }}={\Bigg (}\sum _{t=1}^{T}{\hat {\sigma }}_{t}^{-2}x_{(t)}x_{(t)}^{\operatorname {T} }{\Bigg )}^{-1}\sum _{t=1}^{T}{\hat {\sigma }}_{t}^{-2}x_{(t)}\Phi ^{-1}({\hat {p}}_{t})} It can be shown that this estimator is consistent (as n→∞ and T fixed), asymptotically normal and efficient. Its advantage is the presence of a closed-form formula for the estimator. However, it is only meaningful to carry out this analysis when individual observations are not available, only their aggregated counts r t {\displaystyle r_{t}} , n t {\disp

    Read more →
  • Prototype methods

    Prototype methods

    Prototype methods are machine learning methods that use data prototypes. A data prototype is a data value that reflects other values in its class, e.g., the centroid in a K-means clustering problem. == Methods == The following are some prototype methods K-means clustering Learning vector quantization (LVQ) Gaussian mixtures == Related Methods == While K-nearest neighbor's does not use prototypes, it is similar to prototype methods like K-means clustering.

    Read more →