AI Code Visualizer

AI Code Visualizer — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • IEEE Transactions on Visualization and Computer Graphics

    IEEE Transactions on Visualization and Computer Graphics

    IEEE Transactions on Visualization and Computer Graphics is a peer-reviewed scientific journal published by the IEEE Computer Society. It covers subjects related to computer graphics and visualization techniques, systems, software, hardware, and user interface issues. TVCG has been considered the top journal in the field of visualization. Since 2011, TVCG has allowed authors to present recently accepted papers at partner conferences. These include: IEEE Visualization (VIS), including VAST, InfoVis, and SciVis. IEEE Virtual Reality Conference (IEEE VR) IEEE International Symposium on Mixed and Augmented Reality (ISMAR) ACM Symposium on Interactive 3D Graphics and Games (I3D) IEEE Pacific Visualization Conference (IEEE PacificVis) ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA) Eurographics Symposium on Geometry Processing (SGP) Pacific Graphics Conference (PG) Eurovis - The EG and VGTC Conference on Visualization Graphics Interfaces (GI)

    Read more →
  • Unique negative dimension

    Unique negative dimension

    Unique negative dimension (UND) is a complexity measure for the model of learning from positive examples. The unique negative dimension of a class C {\displaystyle C} of concepts is the size of the maximum subclass D ⊆ C {\displaystyle D\subseteq C} such that for every concept c ∈ D {\displaystyle c\in D} , we have ∩ ( D ∖ { c } ) ∖ c {\displaystyle \cap (D\setminus \{c\})\setminus c} is nonempty. This concept was originally proposed by M. Gereb-Graus in "Complexity of learning from one-side examples", Technical Report TR-20-89, Harvard University Division of Engineering and Applied Science, 1989.

    Read more →
  • Receptron

    Receptron

    The receptron (short for "reservoir perceptron") is a neuromorphic data processing model — specifically neuromorphic computing — that generalizes the traditional perceptron, by incorporating non-linear interactions between inputs. Unlike classical perceptron, which rely on linearly independent weights, the receptron leverages complexity in physical substrates, such as the electric conduction properties of nanostructured materials or optical speckle fields, to perform classification tasks. The receptron bridges unconventional computing and neural network principles, enabling solutions that do not require the training approaches typical of artificial neural networks based on the perceptron model. == Algorithm == The receptron is an algorithm for supervised learning of binary classifiers, so a classification algorithm that makes its predictions based on a predictor function, combining a set of weights with the feature vector. The mathematical model is based on the sum of inputs with non-linear interactions: S = ∑ k = 1 n x j w ~ j ( x → ) | S ∈ R {\displaystyle S=\sum _{k=1}^{n}x_{j}{\widetilde {w}}_{j}({\vec {x}})|S\in R} (1) where j ∈ [ 1 , n ] {\displaystyle j\in [1,n]} and w ~ j {\displaystyle {\widetilde {w}}_{j}} are non-linear weight functions depending on the inputs, x → {\displaystyle {\vec {x}}} . Nonlinearity will typically make the system extremely complex, and allowing for the solution of problems not solvable through the simpler rules of a linear system, such as the perceptron or McCulloch Pitts neurons, which is based on the sum of linearly independent weights: S = ∑ k = 1 n x j w j p {\displaystyle S=\sum _{k=1}^{n}x_{j}w_{j}^{p}} (2) where w j {\displaystyle w_{j}} are constant real values. A consequence of this simplicity is the limitation to linearly separable functions, which necessitates multi-layer architectures and training algorithms like backpropagation As in the perceptron case, the summation in Eq. 1 origins the activation of the receptron output through the thresholding process, Y ( x 1 , . . . , x n ) = { 1 if S > th 0 if S ≤ th {\displaystyle Y(x_{1},...,x_{n})={\begin{cases}1&{\text{if }}S>{\text{th}}\\0&{\text{if }}S\leq {\text{th}}\end{cases}}} (3) where th is a constant threshold parameter. Equation 3 can be written by using the Heaviside step function. The weight functions w ~ ( x → ) {\displaystyle {\widetilde {w}}({\vec {x}})} can be written with a finite number of parameters w j 1 . . . j n {\displaystyle w_{j_{1}...j_{n}}} , simplifying the model representation. One can Taylor-expand w ~ ( x → ) {\displaystyle {\widetilde {w}}({\vec {x}})} and use the idempotency of Boolean variables ( x j ) q = x j ∀ q ≥ 1 {\displaystyle (x_{j})^{q}=x_{j}\forall q\geq 1} such that S ′ = b + ∑ k = 1 n x j w ~ j ( x → ) {\displaystyle S'=b+\sum _{k=1}^{n}x_{j}{\widetilde {w}}_{j}({\vec {x}})} can be written as S ′ ( x → ) = b + ∑ j w j x j + ∑ j < k w j k x j x k + ∑ j < k < l w j k l x j x k x l + . . . {\displaystyle S'({\vec {x}})=b+\sum _{j}w_{j}x_{j}+\sum _{j Read more →

  • Swish function

    Swish function

    The swish function is a family of mathematical function defined as follows: swish β ⁡ ( x ) = x sigmoid ⁡ ( β x ) = x 1 + e − β x . {\displaystyle \operatorname {swish} _{\beta }(x)=x\operatorname {sigmoid} (\beta x)={\frac {x}{1+e^{-\beta x}}}.} where β {\displaystyle \beta } can be constant (usually set to 1) or trainable and "sigmoid" refers to the logistic function. The swish family was designed to smoothly interpolate between a linear function and the Rectified linear unit (ReLU) function. When considering positive values, Swish is a particular case of doubly parameterized sigmoid shrinkage function defined in . Variants of the swish function include Mish. == Special values == For β = 0, the function is linear: f(x) = x/2. For β = 1, the function is the Sigmoid Linear Unit (SiLU). For β = 1.702, the function approximates GeLU. With β → ∞, the function converges to ReLU. Thus, the swish family smoothly interpolates between a linear function and the ReLU function. Since swish β ⁡ ( x ) = swish 1 ⁡ ( β x ) / β {\displaystyle \operatorname {swish} _{\beta }(x)=\operatorname {swish} _{1}(\beta x)/\beta } , all instances of swish have the same shape as the default swish 1 {\displaystyle \operatorname {swish} _{1}} , zoomed by β {\displaystyle \beta } . One usually sets β > 0 {\displaystyle \beta >0} . When β {\displaystyle \beta } is trainable, this constraint can be enforced by β = e b {\displaystyle \beta =e^{b}} , where b {\displaystyle b} is trainable. swish 1 ⁡ ( x ) = x 2 + x 2 4 − x 4 48 + x 6 480 + O ( x 8 ) {\displaystyle \operatorname {swish} _{1}(x)={\frac {x}{2}}+{\frac {x^{2}}{4}}-{\frac {x^{4}}{48}}+{\frac {x^{6}}{480}}+O\left(x^{8}\right)} swish 1 ⁡ ( x ) = x 2 tanh ⁡ ( x 2 ) + x 2 swish 1 ⁡ ( x ) + swish − 1 ⁡ ( x ) = x tanh ⁡ ( x 2 ) swish 1 ⁡ ( x ) − swish − 1 ⁡ ( x ) = x {\displaystyle {\begin{aligned}\operatorname {swish} _{1}(x)&={\frac {x}{2}}\tanh \left({\frac {x}{2}}\right)+{\frac {x}{2}}\\\operatorname {swish} _{1}(x)+\operatorname {swish} _{-1}(x)&=x\tanh \left({\frac {x}{2}}\right)\\\operatorname {swish} _{1}(x)-\operatorname {swish} _{-1}(x)&=x\end{aligned}}} == Derivatives == Because swish β ⁡ ( x ) = swish 1 ⁡ ( β x ) / β {\displaystyle \operatorname {swish} _{\beta }(x)=\operatorname {swish} _{1}(\beta x)/\beta } , it suffices to calculate its derivatives for the default case. swish 1 ′ ⁡ ( x ) = x + sinh ⁡ ( x ) 4 cosh 2 ⁡ ( x 2 ) + 1 2 {\displaystyle \operatorname {swish} _{1}'(x)={\frac {x+\sinh(x)}{4\cosh ^{2}\left({\frac {x}{2}}\right)}}+{\frac {1}{2}}} so swish 1 ′ ⁡ ( x ) − 1 2 {\displaystyle \operatorname {swish} _{1}'(x)-{\frac {1}{2}}} is odd. swish 1 ″ ⁡ ( x ) = 1 − x 2 tanh ⁡ ( x 2 ) 2 cosh 2 ⁡ ( x 2 ) {\displaystyle \operatorname {swish} _{1}''(x)={\frac {1-{\frac {x}{2}}\tanh \left({\frac {x}{2}}\right)}{2\cosh ^{2}\left({\frac {x}{2}}\right)}}} so swish 1 ″ ⁡ ( x ) {\displaystyle \operatorname {swish} _{1}''(x)} is even. == History == SiLU was first proposed alongside the GELU in 2016, then again proposed in 2017 as the Sigmoid-weighted Linear Unit (SiL) in reinforcement learning. The SiLU/SiL was then again proposed as the SWISH over a year after its initial discovery, originally proposed without the learnable parameter β, so that β implicitly equaled 1. The swish paper was then updated to propose the activation with the learnable parameter β. In 2017, after performing analysis on ImageNet data, researchers from Google indicated that using this function as an activation function in artificial neural networks improves the performance, compared to ReLU and sigmoid functions. It is believed that one reason for the improvement is that the swish function helps alleviate the vanishing gradient problem during backpropagation.

    Read more →
  • ParkMobile

    ParkMobile

    ParkMobile is a mobile and web app providing parking payments in North America. Headquartered in Atlanta, Georgia, users can pay for on-street and off-street parking via app on their smartphone, web browser, or through calling a phone number. ParkMobile also offers parking reservations at stadiums or venues for concerts and sporting events, and in metro area garages. == History == ParkMobile was founded in the United States in 2008 by Albert Bogaard after originally starting in the Netherlands. The initial product served only zone (on-demand) parkers and payment for the parking spot was made via a phone call through an IVR system. In 2009, the ParkMobile app was released and the product launched in its first city, Grand Rapids, Michigan. Parking payments have since been accepted through a user's account by connecting a credit card. ParkMobile deployed in Washington, D.C., in 2011. As of 2023, ParkMobile now has over 50 million users. Parking reservations were introduced in 2017, allowing users to reserve parking in advance. In 2018, the company recapitalized with BMW as the shareholder. ParkMobile was then acquired by a joint venture with BMW and Daimler. Under this joint venture, ParkMobile parking payment functionality was available and integrated with BMW's navigation system in many of its 2018 models. EasyPark Group, the Swedish-based parking solutions company, acquired ParkMobile in 2021 and is the current owner rebranded as Arrive. In 2022, ParkMobile launched in the City of Boston with a city-wide parking app, ParkBoston, powered by ParkMobile. == Operations == === Products === ParkMobile's product offerings include zone (on-demand) parking payments, parking reservations, and a self-service reporting engine. Zone parking is the company's most widely used service. Users can use the app on their smartphone to pay parking fees. In 2017, ParkMobile began offering parking reservations. The service is provided in addition to on-demand parking options at stadiums and venues, as well as metro area parking garages. After launching the reservations feature, ParkMobile became the first mobile parking app provider in North America to have a consolidated app with both on-demand and reservations parking in one. ParkMobile 360, the company's self-service management and reporting platform for operators, launched in 2018. It is a web-based application for parking operators to manage parking inventory, adjust rates, create special parking events, and track analytics. In 2020, ParkMobile began offering an option to pay for parking with Google through integrating the ParkMobile experience with Google Maps In 2021, ParkMobile launched its web application, allowing users to complete their parking transactions directly from the mobile website without having to download the app or have an account. ParkMobile integrates with parking gate equipment so customers can use their app to pay for parking and scan to enter and exit the garage. === Locations === ParkMobile has over 50 million users across the United States, Canada, and Puerto Rico. The app is available in over 550 cities in the U.S. and over 150 colleges and universities. == Controversies == === Predatory towing and excessive ticketing === Since all paid parking sessions from a single supplier are able to be viewed together, the ease of viewing and enforcing parking violations has caused controversy. Parking Enforcement Services in Birmingham, Alabama, has been the subject complaints by users of the ParkMobile app who had paid for a parking session and still had their vehicle towed. Customers often use old or expired license plates and forget to update to the correct number, or mistype when entering their information into the ParkMobile app. The complaints are that the towing companies offer no lenience for these mistakes. They return to their car as the session expires, and find their car has been towed. Additionally, other municipality across the country have received complaints about excessive parking ticket issuing when inputting their information incorrectly in the ParkMobile app. In Stone Harbor, New Jersey, parking ticket violations increased by over 1,600% from the previous year since launching with the ParkMobile app. Police officers refute complaints of being "too strict" on writing tickets by admitting the ParkMobile system allows officers to "more seamlessly enforce" the city's parking laws. === Data security breach === In March 2021, ParkMobile suffered a cybersecurity incident "linked to a vulnerability in a third-party software," potentially exposing users' email addresses, phone numbers, and license plate numbers. ParkMobile responded by launching an investigation and notifying law enforcement authorities and affected municipalities. The investigation concluded "no sensitive data or Payment Card Information was affected" but ParkMobile confirmed that basic account information, such as license plate numbers and possibly email addresses or phone numbers, was accessed.

    Read more →
  • Andrej Mrvar

    Andrej Mrvar

    Andrej Mrvar is a Slovenian computer scientist and a professor at the University of Ljubljana's Faculty of Social Sciences. He is known for his work in network analysis, graph drawing, decision making, virtual reality, timing and data processing of sports competitions. == Education and career == He is well known for his work on Pajek, a free software for analysis and visualization of large networks. Mrvar began work on Pajek in 1996 with Vladimir Batagelj. His book Exploratory Social Network Analysis with Pajek, coauthored with Wouter de Nooy and Vladimir Batagelj, is his most cited work. It was published by Cambridge University Press in three editions (first 2005, second 2011, and third 2018). The book was translated into Japanese (2009) and Chinese (first edition 2012, second 2014). With Anuška Ferligoj, he was a founding co-editor-in-chief of the Metodološki zvezki - Advances in Methodology and Statistics journal. == Awards and honors == Vidmar Award (Faculty of Electrical and Computer Engineering, University of Ljubljana): 1988, 1990 First prizes for contributions (with Vladimir Batagelj) to Graph Drawing Contests in years: 1995, 1996, 1997, 1998, 1999, 2000 and 2005 / Graph Drawing Hall of Fame. Award of University of Ljubljana for contributions in education and research (Svečana listina Univerze v Ljubljani za pomembne dosežke na področju vzgojnoizobraževalnega in znanstvenoraziskovalega dela): 2001 The INSNA's William D. Richards Software award for work on Pajek (with Vladimir Batagelj): 2013 Award of Faculty of Social Sciences, University of Ljubljana for scientific excellence (Priznanje za znanstveno odličnost): 2013 == Selected publications == Wouter de Nooy, Andrej Mrvar, Vladimir Batagelj, Mark Granovetter (Series Editor), Exploratory Social Network Analysis with Pajek (Structural Analysis in the Social Sciences), Cambridge University Press (First Edition: 2005, Second Edition: 2011, Third Edition: 2018 ). Japanese Translation (2010). Chinese Translation (First Edition: 2012, Second Edition: 2014) Andrej Mrvar and Vladimir Batagelj, Analysis and visualization of large networks with program package Pajek. Complex Adaptive Systems Modeling, 4:6. SpringerOpen, 2016 Vladimir Batagelj and Andrej Mrvar, Some Analyses of Erdős Collaboration Graph, Social Networks, 22, 173–186, 2000 Vladimir Batagelj and Andrej Mrvar, A Subquadratic Triad Census Algorithm for Large Sparse Networks with Small Maximum Degree. Social Networks, 23, 237–243, 2001 Patrick Doreian and Andrej Mrvar, A Partitioning Approach to Structural Balance, Social Networks, 18, 149–168, 1996 Patrick Doreian and Andrej Mrvar, Partitioning Signed Social Networks, Social Networks, 31, 1–11, 2009 Andrej Mrvar and Patrick Doreian, Partitioning Signed Two-Mode Networks, Journal of Mathematical Sociology, 33, 196–221, 2009 Patrick Doreian and Andrej Mrvar, The international reach of the Koch brothers network. In: Antonyuk, A. and Basov, N. (Eds.): Networks in the Global World V. NetGloW 2020. Lecture Notes in Networks and Systems, 181, 225–235. Springer, 2021 Patrick Doreian and Andrej Mrvar, Delineating Changes in the Fundamental Structure of Signed Networks, Frontiers in Physics, 294, 1–11, 2021 Patrick Doreian and Andrej Mrvar, Hubs and Authorities in the Koch Brothers Network. Social Networks, Social Networks, 64, 148–157, 2021 Patrick Doreian and Andrej Mrvar, Public issues, policy proposals, social movements, and the interests of the Koch Brothers network of allies, Quality and Quantity, 56, 305–322, 2022 Douglas R. White, Vladimir Batagelj, Andrej Mrvar, Analyzing Large Kinship and Marriage Networks with Pgraph and Pajek. Social Science Computer Review, 17, 245–274, 1999 Ion Georgiou, Ronald Concer, Andrej Mrvar, A Systemic Approach to Sociometric Group Research: Advancing The Work of Leslie Day Zeleny, 1939–1947, Social Networks, 63, 174–200, 2020

    Read more →
  • Evolutionary attractor

    Evolutionary attractor

    An evolutionary attractor is a point in an evolutionary space where a selection process will always drive trait values towards that point from the region around it. Because of the importance of evolution through natural selection, often such an evolutionary space will be defined by genetic or phenotypic traits, or possibly both. In this case the selection process will be a form of natural selection. The existence of an evolutionary attractor in a biological evolutionary space does not always imply that it can be reached from all points in that evolutionary space, nor does it identify what will happen when the evolutionary attractor is reached. While an evolutionary attractor may represent a point in evolutionary space that is resistant to further selection, such as an evolutionarily stable strategy, other possibilities are available. Because identification of an evolutionary attractor on its own does not describe everything about the evolutionary space in which it lies, this has led to interest in the evolutionary dynamics surrounding evolutionary attractors and in evolutionary spaces in general. (Theoretical biologists and mathematicians working in the area may prefer the terms adaptive dynamics or evolutionary invasion analysis to evolutionary dynamics.) These fields use differential equations which allows a more complete understanding of the dynamics in evolutionary spaces including the existence or otherwise of evolutionary attractors. Advances in the study of molecular evolution have also led to the identification of evolutionary attractors at a molecular level. Because biological evolutionary processes have been studied using evolutionary game theory, a technique inspired by game theory originally derived to address economic problems, not only can evolutionary attractors be found in biology but economists studying evolutionary economic models have also identified evolutionary attractors. Evolution in biology has also inspired evolutionary computation in computer science. Many algorithms in this field use a form of selection inspired by natural selection to generate results through evolutionary algorithms. This is therefore another area in which evolutionary attractors have been identified. == Evolutionary attractors in biology == It is not probably not surprising that biology is the field where most examples of evolutionary attractors have been identified, given the importance of evolution through natural selection. === Evolutionary attractors in adaptive landscapes === An evolutionary attractor is a point in genetic and/or phenotypic trait space, that evolution will always drive trait values towards via a selection process. The concept of an evolutionary attractor arose in population genetics following the origin of the adaptive landscape originally proposed by Sewall Wright in 1932. The height of a point in an adaptive landscape is a measure of evolutionary fitness. If a point in an adaptive landscape is a peak, then selection will always drive traits towards it and it will be an evolutionary attractor. While population genetics deals with discrete genetic traits, quantitative genetics extended such concepts to deal with continuous genetic traits, where the concept of evolutionary attractor is also valid. === Evolutionary attractors in evolutionary game models === Evolutionary game theory introduced into evolutionary biology concepts originally used in economics, with the advantage that evolution could be studied in relation to strategic choices made in animal conflicts. This is of particular interest because of the concept of the evolutionarily stable strategy or ESS, a strategy that once established is resistant to invasion by other strategies. ESSs will not always be evolutionary attractors, but if they are they will persist over evolutionary time. === Dynamics around evolutionary attractors in biology === Evolutionary attractors in biology do not exist in isolation. By definition they must exist in an evolutionary trait space where selection drives all traits towards them from a region immediately around them. That is, they must be convergence stable. Eshel (1983) modified the definition of an ESS by considering individually advantageous reduction from a majority deviation: he created the term continuous stability. A continuously stable ESS can be shown to be convergence stable, therefore it will act as an evolutionary attractor. But the nature of evolutionary trait spaces in biology means that it is not possible to guarantee that the region of convergence to the evolutionary attractor covers the whole of the trait space, nor that there is only one evolutionary attractor in a particular trait space. These issues have led to the emergence of the related fields of evolutionary dynamics, adaptive dynamics and evolutionary invasion analysis, all of which use differential equations to understand the dynamics in evolutionary trait spaces. Hence, if one or more evolutionary attractor exists in an evolutionary trait space, they provide techniques to understand the dynamics in that trait space around the evolutionary attractor. === Evolutionary attractors in an ecological context === Evolution in biology does not take place in single species in isolation. Ecological interaction of species leads to coevolution. Important examples of this are host-parasite or host-pathogen interaction, which can make both the dynamics around evolutionary attractors more complex, and the occurrence and number of evolutionary attractors more diverse. Evolutionary attractors have been identified in the analysis of evolutionary epidemiology of plant pathogens. In the above study working on plant populations the authors were able to identify evolutionary attractors using methods from adaptive dynamics. A model applied to the analysis of a maize (Zea mays L.) virus identified convergence stable equilibria through simulation modelling. A related model identified evolutionary attractors in the interaction of plants with fungal pathogens. === Evolutionary attractors in molecular genetics === As mentioned above much of the consideration of evolutionary attractors in biology has been through investigation of selection at a genetic or phenotypic level or both, in a single species or in coevolving species. Advances in the study of molecular genetics now allow the study of evolutionary attractors to be taken to a molecular genetic level. Wilson et. al (2019) studied the evolution of gene regulatory networks and identified the emergence of evolutionary attractors. == Evolutionary attractors in economics == Evolutionary game theory as applied in biology was inspired by game theory originally devised for applications in economics. Game theory remains an active field of research outside of biology, and thus it is not surprising that researchers in evolutionary economics use evolutionary game theory. Evolutionary attractors have been demonstrated by economists studying the evolutionary dynamics of market entry with market dynamics based on the replicator dynamics of biological evolutionary games. == Evolutionary attractors in computing == Evolutionary computation is a branch of computer science inspired by biological evolution. Many algorithms in evolutionary computation use a form of selection. Thus evolutionary attractors have been identified in computer science as well as in biology and economics. Evolutionary algorithms have generated evolutionary attractors, probably because of the similarity between adaptive hill-climbing in evolutionary heuristics and the adaptive landscape originated to explain evolution through natural selection.

    Read more →
  • Margin-infused relaxed algorithm

    Margin-infused relaxed algorithm

    Margin-infused relaxed algorithm (MIRA) is a machine learning and online algorithm for multiclass classification problems. It is designed to learn a set of parameters (vector or matrix) by processing all the given training examples one-by-one and updating the parameters according to each training example, so that the current training example is classified correctly with a margin against incorrect classifications at least as large as their loss. The change of the parameters is kept as small as possible. A two-class version called binary MIRA simplifies the algorithm by not requiring the solution of a quadratic programming problem (see below). When used in a one-vs-all configuration, binary MIRA can be extended to a multiclass learner that approximates full MIRA, but may be faster to train. The flow of the algorithm looks as follows: The update step is then formalized as a quadratic programming problem: Find m i n ‖ w ( i + 1 ) − w ( i ) ‖ {\displaystyle min\|w^{(i+1)}-w^{(i)}\|} , so that s c o r e ( x t , y t ) − s c o r e ( x t , y ′ ) ≥ L ( y t , y ′ ) ∀ y ′ {\displaystyle score(x_{t},y_{t})-score(x_{t},y')\geq L(y_{t},y')\ \forall y'} , i.e. the score of the current correct training y {\displaystyle y} must be greater than the score of any other possible y ′ {\displaystyle y'} by at least the loss (number of errors) of that y ′ {\displaystyle y'} in comparison to y {\displaystyle y} .

    Read more →
  • Biorobotics

    Biorobotics

    Biorobotics is an interdisciplinary science that combines the fields of biomedical engineering, cybernetics, and robotics to develop new technologies that integrate biology with mechanical systems to develop more efficient communication, alter genetic information, and create machines that imitate biological systems. == Cybernetics == Cybernetics focuses on the communication and system of living organisms and machines that can be applied and combined with multiple fields of study such as biology, mathematics, computer science, engineering, and much more. This discipline falls under the branch of biorobotics because of its combined field of study between biological bodies and mechanical systems. Studying these two systems allows for advanced analysis on the functions and processes of each system as well as the interactions between them. === History === Cybernetic theory is a concept that has existed for centuries, dating back to the era of Plato where he applied the term to refer to the "governance of people". The term cybernetique is seen in the mid-1800s used by physicist André-Marie Ampère. The term cybernetics was popularized in the late 1940s to refer to a discipline that touched on, but was separate, from established disciplines, such as electrical engineering, mathematics, and biology. === Science === Cybernetics is often misunderstood because of the breadth of disciplines it covers. In the early 20th century, it was coined as an interdisciplinary field of study that combines biology, science, network theory, and engineering. Today, it covers all scientific fields with system related processes. The goal of cybernetics is to analyze systems and processes of any system or systems in an attempt to make them more efficient and effective. === Applications === Cybernetics is used as an umbrella term so applications extend to all systems related scientific fields such as biology, mathematics, computer science, engineering, management, psychology, sociology, art, and more. Cybernetics is used amongst several fields to discover principles of systems, adaptation of organisms, information analysis and much more. == Genetic engineering == Genetic engineering is a field that uses advances in technology to modify biological organisms. Through different methods, scientists are able to alter the genetic material of microorganisms, plants and animals to provide them with desirable traits. For example, making plants grow bigger, better, and faster. Genetic engineering is included in biorobotics because it uses new technologies to alter biology and change an organism's DNA for their and society's benefit. === History === Although humans have modified genetic material of animals and plants through artificial selection for millennia (such as the genetic mutations that developed teosinte into corn and wolves into dogs), genetic engineering refers to the deliberate alteration or insertion of specific genes to an organism's DNA. The first successful case of genetic engineering occurred in 1973 when Herbert Boyer and Stanley Cohen were able to transfer a gene with antibiotic resistance to a bacterium. === Science === There are three main techniques used in genetic engineering: The plasmid method, the vector method and the biolistic method. ==== Plasmid method ==== This technique is used mainly for microorganisms such as bacteria. Through this method, DNA molecules called plasmids are extracted from bacteria and placed in a lab where restriction enzymes break them down. As the enzymes do this, some develop a rough edge that resembles that of a staircase which is considered 'sticky' and capable of reconnecting. These 'sticky' molecules are inserted into another bacteria where they will connect to the DNA rings with the altered genetic material. ==== Vector method ==== The vector method is considered a more precise technique than the plasmid method as it involves the transfer of a specific gene instead of a whole sequence. In the vector method, a specific gene from a DNA strand is isolated through restriction enzymes in a laboratory and is inserted into a vector. Once the vector accepts the genetic code, it is inserted into the host cell where the DNA will be transferred. ==== Biolistic method ==== The biolistic method is typically used to alter the genetic material of plants. This method embeds the desired DNA with a metallic particle such as gold or tungsten in a high speed gun. The particle is then bombarded into the plant. Due to the high velocities and the vacuum generated during bombardment, the particle is able to penetrate the cell wall and inserts the new DNA into the cell. === Applications === Genetic engineering has many uses in the fields of medicine, research and agriculture. In the medical field, genetically modified bacteria are used to produce drugs such as insulin, human growth hormones and vaccines. In research, scientists genetically modify organisms to observe physical and behavioral changes to understand the function of specific genes. In agriculture, genetic engineering is extremely important as it is used by farmers to grow crops that are resistant to herbicides and to insects such as BTCorn. == Bionics == Bionics is a medical engineering field and a branch of biorobotics consisting of electrical and mechanical systems that imitate biological systems, such as prosthetics and hearing aids. It's a portmanteau that combines biology and electronics. === History === The history of bionics goes as far back in time as ancient Egypt. A prosthetic toe made out of wood and leather was found on the foot of a mummy. The time period of the mummy corpse was estimated to be from around the fifteenth century B.C. Bionics can also be witnessed in ancient Greece and Rome. Prosthetic legs and arms were made for amputee soldiers. In the early 16th century, a French military surgeon by the name of Ambroise Pare became a pioneer in the field of bionics. He was known for making various types of upper and lower prosthetics. One of his most famous prosthetics, Le Petit Lorrain, was a mechanical hand operated by catches and springs. During the early 19th century, Alessandro Volta further progressed bionics. He set the foundation for the creation of hearing aids with his experiments. He found that electrical stimulation could restore hearing by inserting an electrical implant to the saccular nerve of a patient's ear. In 1945, the National Academy of Sciences created the Artificial Limb Program, which focused on improving prosthetics since there were a large number of World War II amputee soldiers. Since this creation, prosthetic materials, computer design methods, and surgical procedures have improved, creating modern-day bionics. === Science === ==== Prosthetics ==== The important components that make up modern-day prosthetics are the pylon, the socket, and the suspension system. The pylon is the internal frame of the prosthetic that is made up of metal rods or carbon-fiber composites. The socket is the part of the prosthetic that connects the prosthetic to the person's missing limb. The socket consists of a soft liner that makes the fit comfortable, but also snug enough to stay on the limb. The suspension system is important in keeping the prosthetic on the limb. The suspension system is usually a harness system made up of straps, belts or sleeves that are used to keep the limb attached. The operation of a prosthetic could be designed in various ways. The prosthetic could be body-powered, externally-powered, or myoelectrically powered. Body-powered prosthetics consist of cables attached to a strap or harness, which is placed on the person's functional shoulder, allowing the person to manipulate and control the prosthetic as he or she deems fit. Externally-powered prosthetics consist of motors to power the prosthetic and buttons and switches to control the prosthetic. Myoelectrically powered prosthetics are new, advanced forms of prosthetics where electrodes are placed on the muscles above the limb. The electrodes will detect the muscle contractions and send electrical signals to the prosthetic to move the prosthetic. The downside to this type of prosthetic is that if the sensors are not placed correctly on the limb then the electrical impulses will fail to move the prosthetic. TrueLimb is a specific brand of prosthetics that uses myoelectrical sensors which enable a person to have control of their bionic limb. ==== Hearing aids ==== Four major components make up the hearing aid: the microphone, the amplifier, the receiver, and the battery. The microphone takes in outside sound, turns that sound to electrical signals, and sends those signals to the amplifier. The amplifier increases the sound and sends that sound to the receiver. The receiver changes the electrical signal back into sound and sends the sound into the ear. Hair cells in the ear will sense the vibrations from the sound, convert the vibrations into nerve signals, and send it to the brain so

    Read more →
  • AdaBoost

    AdaBoost

    AdaBoost (short for Adaptive Boosting) is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many types of learning algorithm to improve performance. The output of multiple weak learners is combined into a weighted sum that represents the final output of the boosted classifier. Usually, AdaBoost is presented for binary classification, although it can be generalized to multiple classes or bounded intervals of real values. AdaBoost is adaptive in the sense that subsequent weak learners (models) are adjusted in favor of instances misclassified by previous models. In some problems, it can be less susceptible to overfitting than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing, the final model can be proven to converge to a strong learner. Although AdaBoost is typically used to combine weak base learners (such as decision stumps), it has been shown to also effectively combine strong base learners (such as deeper decision trees), producing an even more accurate model. Every learning algorithm tends to suit some problem types better than others, and typically has many different parameters and configurations to adjust before it achieves optimal performance on a dataset. AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier. When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree-growing algorithm such that later trees tend to focus on harder-to-classify examples. == Training == AdaBoost refers to a particular method of training a boosted classifier. A boosted classifier is a classifier of the form F T ( x ) = ∑ t = 1 T f t ( x ) {\displaystyle F_{T}(x)=\sum _{t=1}^{T}f_{t}(x)} where each f t {\displaystyle f_{t}} is a weak learner that takes an object x {\displaystyle x} as input and returns a value indicating the class of the object. For example, in the two-class problem, the sign of the weak learner's output identifies the predicted object class and the absolute value gives the confidence in that classification. Each weak learner produces an output hypothesis h {\displaystyle h} which fixes a prediction h ( x i ) {\displaystyle h(x_{i})} for each sample in the training set. At each iteration t {\displaystyle t} , a weak learner is selected and assigned a coefficient α t {\displaystyle \alpha _{t}} such that the total training error E t {\displaystyle E_{t}} of the resulting t {\displaystyle t} -stage boosted classifier is minimized. E t = ∑ i E [ F t − 1 ( x i ) + α t h ( x i ) ] {\displaystyle E_{t}=\sum _{i}E[F_{t-1}(x_{i})+\alpha _{t}h(x_{i})]} Here F t − 1 ( x ) {\displaystyle F_{t-1}(x)} is the boosted classifier that has been built up to the previous stage of training and f t ( x ) = α t h ( x ) {\displaystyle f_{t}(x)=\alpha _{t}h(x)} is the weak learner that is being considered for addition to the final classifier. === Weighting === At each iteration of the training process, a weight w i , t {\displaystyle w_{i,t}} is assigned to each sample in the training set equal to the current error E ( F t − 1 ( x i ) ) {\displaystyle E(F_{t-1}(x_{i}))} on that sample. These weights can be used in the training of the weak learner. For instance, decision trees can be grown which favor the splitting of sets of samples with large weights. == Derivation == This derivation follows Rojas (2009): Suppose we have a data set { ( x 1 , y 1 ) , … , ( x N , y N ) } {\displaystyle \{(x_{1},y_{1}),\ldots ,(x_{N},y_{N})\}} where each item x i {\displaystyle x_{i}} has an associated class y i ∈ { − 1 , 1 } {\displaystyle y_{i}\in \{-1,1\}} , and a set of weak classifiers { k 1 , … , k L } {\displaystyle \{k_{1},\ldots ,k_{L}\}} each of which outputs a classification k j ( x i ) ∈ { − 1 , 1 } {\displaystyle k_{j}(x_{i})\in \{-1,1\}} for each item. After the ( m − 1 ) {\displaystyle (m-1)} -th iteration our boosted classifier is a linear combination of the weak classifiers of the form: C ( m − 1 ) ( x i ) = α 1 k 1 ( x i ) + ⋯ + α m − 1 k m − 1 ( x i ) , {\displaystyle C_{(m-1)}(x_{i})=\alpha _{1}k_{1}(x_{i})+\cdots +\alpha _{m-1}k_{m-1}(x_{i}),} where the class will be the sign of C ( m − 1 ) ( x i ) {\displaystyle C_{(m-1)}(x_{i})} . At the m {\displaystyle m} -th iteration we want to extend this to a better boosted classifier by adding another weak classifier k m {\displaystyle k_{m}} , with another weight α m {\displaystyle \alpha _{m}} : C m ( x i ) = C ( m − 1 ) ( x i ) + α m k m ( x i ) {\displaystyle C_{m}(x_{i})=C_{(m-1)}(x_{i})+\alpha _{m}k_{m}(x_{i})} So it remains to determine which weak classifier is the best choice for k m {\displaystyle k_{m}} , and what its weight α m {\displaystyle \alpha _{m}} should be. We define the total error E {\displaystyle E} of C m {\displaystyle C_{m}} as the sum of its exponential loss on each data point, given as follows: E = ∑ i = 1 N e − y i C m ( x i ) = ∑ i = 1 N e − y i C ( m − 1 ) ( x i ) e − y i α m k m ( x i ) {\displaystyle E=\sum _{i=1}^{N}e^{-y_{i}C_{m}(x_{i})}=\sum _{i=1}^{N}e^{-y_{i}C_{(m-1)}(x_{i})}e^{-y_{i}\alpha _{m}k_{m}(x_{i})}} Letting w i ( 1 ) = 1 {\displaystyle w_{i}^{(1)}=1} and w i ( m ) = e − y i C m − 1 ( x i ) {\displaystyle w_{i}^{(m)}=e^{-y_{i}C_{m-1}(x_{i})}} for m > 1 {\displaystyle m>1} , we have: E = ∑ i = 1 N w i ( m ) e − y i α m k m ( x i ) {\displaystyle E=\sum _{i=1}^{N}w_{i}^{(m)}e^{-y_{i}\alpha _{m}k_{m}(x_{i})}} We can split this summation between those data points that are correctly classified by k m {\displaystyle k_{m}} (so y i k m ( x i ) = 1 {\displaystyle y_{i}k_{m}(x_{i})=1} ) and those that are misclassified (so y i k m ( x i ) = − 1 {\displaystyle y_{i}k_{m}(x_{i})=-1} ): E = ∑ y i = k m ( x i ) w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) e α m = ∑ i = 1 N w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) ( e α m − e − α m ) {\displaystyle {\begin{aligned}E&=\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}e^{\alpha _{m}}\\&=\sum _{i=1}^{N}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}\left(e^{\alpha _{m}}-e^{-\alpha _{m}}\right)\end{aligned}}} Since the only part of the right-hand side of this equation that depends on k m {\displaystyle k_{m}} is ∑ y i ≠ k m ( x i ) w i ( m ) {\textstyle \sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}} , we see that the k m {\displaystyle k_{m}} that minimizes E {\displaystyle E} is the one in the set { k 1 , … , k L } {\displaystyle \{k_{1},\ldots ,k_{L}\}} that minimizes ∑ y i ≠ k m ( x i ) w i ( m ) {\textstyle \sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}} [assuming that α m > 0 {\displaystyle \alpha _{m}>0} ], i.e. the weak classifier with the lowest weighted error (with weights w i ( m ) = e − y i C m − 1 ( x i ) {\displaystyle w_{i}^{(m)}=e^{-y_{i}C_{m-1}(x_{i})}} ). To determine the desired weight α m {\displaystyle \alpha _{m}} that minimizes E {\displaystyle E} with the k m {\displaystyle k_{m}} that we just determined, we differentiate: d E d α m = d ( ∑ y i = k m ( x i ) w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) e α m ) d α m {\displaystyle {\frac {dE}{d\alpha _{m}}}={\frac {d(\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}e^{\alpha _{m}})}{d\alpha _{m}}}} The value of α m {\displaystyle \alpha _{m}} that minimizes the above expression is: α m = 1 2 ln ⁡ ( ∑ y i = k m ( x i ) w i ( m ) ∑ y i ≠ k m ( x i ) w i ( m ) ) {\displaystyle \alpha _{m}={\frac {1}{2}}\ln \left({\frac {\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}}\right)} We calculate the weighted error rate of the weak classifier to be ϵ m = ∑ y i ≠ k m ( x i ) w i ( m ) ∑ i = 1 N w i ( m ) {\displaystyle \epsilon _{m}={\frac {\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{i=1}^{N}w_{i}^{(m)}}}} , so it follows that: α m = 1 2 ln ⁡ ( 1 − ϵ m ϵ m ) {\displaystyle \alpha _{m}={\frac {1}{2}}\ln \left({\frac {1-\epsilon _{m}}{\epsilon _{m}}}\right)} which is the negative logit function multiplied by 0.5. Due to the convexity of E {\displaystyle E} as a function of α m {\displaystyle \alpha _{m}} , this new expression for α m {\displaystyle \alpha _{m}} gives the global minimum of the loss function. Note: This derivation only applies when k m ( x i ) ∈ { − 1 , 1 } {\displaystyle k_{m}(x_{i})\in \{-1,1\}} , though it can be a good starting guess in other cases, such as when the weak learner is biased ( k m ( x ) ∈ { a , b } , a ≠ − b {\displaystyle k_{m}(x)\in \{a,b\},a\neq -b} ), has multiple leaves ( k m ( x ) ∈ { a , b , … , n } {\displaystyle k_{m}(x)\in \{a,b,\dots ,n\}} ) or is some other function k m ( x ) ∈ R {\displaystyle k_{m}(x)\in \mathbb {R} } . Thus we have derived the AdaBoost algorithm: At each

    Read more →
  • Huber loss

    Huber loss

    In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss. A variant for classification is also sometimes used. == Definition == The Huber loss function describes the penalty incurred by an estimation procedure f. Huber (1964) defines the loss function piecewise by L δ ( a ) = { 1 2 a 2 for | a | ≤ δ , δ ⋅ ( | a | − 1 2 δ ) , otherwise. {\displaystyle L_{\delta }(a)={\begin{cases}{\frac {1}{2}}{a^{2}}&{\text{for }}|a|\leq \delta ,\\[4pt]\delta \cdot \left(|a|-{\frac {1}{2}}\delta \right),&{\text{otherwise.}}\end{cases}}} This function is quadratic for small values of a, and linear for large values, with equal values and slopes of the different sections at the two points where | a | = δ {\displaystyle |a|=\delta } . The variable a often refers to the residuals, that is to the difference between the observed and predicted values a = y − f ( x ) {\displaystyle a=y-f(x)} , so the former can be expanded to L δ ( y , f ( x ) ) = { 1 2 ( y − f ( x ) ) 2 for | y − f ( x ) | ≤ δ , δ ⋅ ( | y − f ( x ) | − 1 2 δ ) , otherwise. {\displaystyle L_{\delta }(y,f(x))={\begin{cases}{\frac {1}{2}}{\left(y-f(x)\right)}^{2}&{\text{for }}\left|y-f(x)\right|\leq \delta ,\\[4pt]\delta \ \cdot \left(\left|y-f(x)\right|-{\frac {1}{2}}\delta \right),&{\text{otherwise.}}\end{cases}}} The Huber loss is the convolution of the absolute value function with the rectangular function, scaled and translated. Thus it "smoothens out" the former's corner at the origin. == Motivation == Two very commonly used loss functions are the squared loss, L ( a ) = a 2 {\displaystyle L(a)=a^{2}} , and the absolute loss, L ( a ) = | a | {\displaystyle L(a)=|a|} . The squared loss function results in an arithmetic mean-unbiased estimator, and the absolute-value loss function results in a median-unbiased estimator (in the one-dimensional case, and a geometric median-unbiased estimator for the multi-dimensional case). The squared loss has the disadvantage that it has the tendency to be dominated by outliers—when summing over a set of a {\displaystyle a} 's (as in ∑ i = 1 n L ( a i ) {\textstyle \sum _{i=1}^{n}L(a_{i})} ), the sample mean is influenced too much by a few particularly large a {\displaystyle a} -values when the distribution is heavy tailed: in terms of estimation theory, the asymptotic relative efficiency of the mean is poor for heavy-tailed distributions. As defined above, the Huber loss function is strongly convex in a uniform neighborhood of its minimum a = 0 {\displaystyle a=0} ; at the boundary of this uniform neighborhood, the Huber loss function has a differentiable extension to an affine function at points a = − δ {\displaystyle a=-\delta } and a = δ {\displaystyle a=\delta } . These properties allow it to combine much of the sensitivity of the mean-unbiased, minimum-variance estimator of the mean (using the quadratic loss function) and the robustness of the median-unbiased estimator (using the absolute value function). == Pseudo-Huber loss function == The Pseudo-Huber loss function can be used as a smooth approximation of the Huber loss function. It combines the best properties of L2 squared loss and L1 absolute loss by being strongly convex when close to the target/minimum and less steep for extreme values. The scale at which the Pseudo-Huber loss function transitions from L2 loss for values close to the minimum to L1 loss for extreme values and the steepness at extreme values can be controlled by the δ {\displaystyle \delta } value. The Pseudo-Huber loss function ensures that derivatives are continuous for all degrees. It is defined as L δ ( a ) = δ 2 ( 1 + ( a / δ ) 2 − 1 ) . {\displaystyle L_{\delta }(a)=\delta ^{2}\left({\sqrt {1+(a/\delta )^{2}}}-1\right).} As such, this function approximates a 2 / 2 {\displaystyle a^{2}/2} for small values of a {\displaystyle a} , and approximates a straight line with slope δ {\displaystyle \delta } for large values of a {\displaystyle a} . While the above is the most common form, other smooth approximations of the Huber loss function also exist. == Variant for classification == For classification purposes, a variant of the Huber loss called modified Huber is sometimes used. Given a prediction f ( x ) {\displaystyle f(x)} (a real-valued classifier score) and a true binary class label y ∈ { + 1 , − 1 } {\displaystyle y\in \{+1,-1\}} , the modified Huber loss is defined as L ( y , f ( x ) ) = { max ( 0 , 1 − y f ( x ) ) 2 for y f ( x ) > − 1 , − 4 y f ( x ) otherwise. {\displaystyle L(y,f(x))={\begin{cases}\max(0,1-y\,f(x))^{2}&{\text{for }}\,\,y\,f(x)>-1,\\[4pt]-4y\,f(x)&{\text{otherwise.}}\end{cases}}} The term max ( 0 , 1 − y f ( x ) ) {\displaystyle \max(0,1-y\,f(x))} is the hinge loss used by support vector machines; the quadratically smoothed hinge loss is a generalization of L {\displaystyle L} . == Applications == The Huber loss function is used in robust statistics, M-estimation and additive modelling.

    Read more →
  • Vanishing gradient problem

    Vanishing gradient problem

    In machine learning, the vanishing gradient problem is the problem of greatly diverging gradient magnitudes between earlier and later layers encountered when training neural networks with backpropagation. In such methods, neural network weights are updated proportional to their partial derivative of the loss function. As the number of forward propagation steps in a network increases, for instance due to greater network depth, the gradients of earlier weights are calculated with increasingly many multiplications. These multiplications shrink the gradient magnitude. Consequently, the gradients of earlier weights will be exponentially smaller than the gradients of later weights. This difference in gradient magnitude might introduce instability in the training process, slow it, or halt it entirely. For instance, consider the hyperbolic tangent activation function. The gradients of this function are in range [0,1]. The product of repeated multiplication with such gradients decreases exponentially. The inverse problem, when weight gradients at earlier layers get exponentially larger, is called the exploding gradient problem. Backpropagation allowed researchers to train supervised deep artificial neural networks from scratch, initially with little success. Hochreiter's diplom thesis of 1991 formally identified the reason for this failure in the "vanishing gradient problem", which not only affects many-layered feedforward networks, but also recurrent networks. The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time-step of an input sequence processed by the network (the combination of unfolding and backpropagation is termed backpropagation through time). == Prototypical models == This section is based on the paper On the difficulty of training Recurrent Neural Networks by Pascanu, Mikolov, and Bengio. === Recurrent network model === A generic recurrent network has hidden states h 1 , h 2 , … {\displaystyle h_{1},h_{2},\dots } , inputs u 1 , u 2 , … {\displaystyle u_{1},u_{2},\dots } , and outputs x 1 , x 2 , … {\displaystyle x_{1},x_{2},\dots } . Let it be parameterized by θ {\displaystyle \theta } , so that the system evolves as ( h t , x t ) = F ( h t − 1 , u t , θ ) {\displaystyle (h_{t},x_{t})=F(h_{t-1},u_{t},\theta )} Often, the output x t {\displaystyle x_{t}} is a function of h t {\displaystyle h_{t}} , as some x t = G ( h t ) {\displaystyle x_{t}=G(h_{t})} . The vanishing gradient problem already presents itself clearly when x t = h t {\displaystyle x_{t}=h_{t}} , so we simplify our notation to the special case with: x t = F ( x t − 1 , u t , θ ) {\displaystyle x_{t}=F(x_{t-1},u_{t},\theta )} Now, take its differential: d x t = ∇ θ F ( x t − 1 , u t , θ ) d θ + ∇ x F ( x t − 1 , u t , θ ) d x t − 1 = ∇ θ F ( x t − 1 , u t , θ ) d θ + ∇ x F ( x t − 1 , u t , θ ) [ ∇ θ F ( x t − 2 , u t − 1 , θ ) d θ + ∇ x F ( x t − 2 , u t − 1 , θ ) d x t − 2 ] ⋮ = [ ∇ θ F ( x t − 1 , u t , θ ) + ∇ x F ( x t − 1 , u t , θ ) ∇ θ F ( x t − 2 , u t − 1 , θ ) + ⋯ ] d θ {\displaystyle {\begin{aligned}dx_{t}&=\nabla _{\theta }F(x_{t-1},u_{t},\theta )d\theta +\nabla _{x}F(x_{t-1},u_{t},\theta )dx_{t-1}\\&=\nabla _{\theta }F(x_{t-1},u_{t},\theta )d\theta +\nabla _{x}F(x_{t-1},u_{t},\theta )\left[\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )d\theta +\nabla _{x}F(x_{t-2},u_{t-1},\theta )dx_{t-2}\right]\\&\;\;\vdots \\&=\left[\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right]d\theta \end{aligned}}} Training the network requires us to define a loss function to be minimized. Let it be L ( x T , u 1 , … , u T ) {\displaystyle L(x_{T},u_{1},\dots ,u_{T})} , then minimizing it by gradient descent gives Δ θ = − η ⋅ [ ∇ x L ( x T ) ( ∇ θ F ( x t − 1 , u t , θ ) + ∇ x F ( x t − 1 , u t , θ ) ∇ θ F ( x t − 2 , u t − 1 , θ ) + ⋯ ) ] T {\displaystyle \Delta \theta =-\eta \cdot \left[\nabla _{x}L(x_{T})\left(\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right)\right]^{T}} where η {\displaystyle \eta } is the learning rate. The vanishing/exploding gradient problem appears because there are repeated multiplications, of the form ∇ x F ( x t − 1 , u t , θ ) ∇ x F ( x t − 2 , u t − 1 , θ ) ∇ x F ( x t − 3 , u t − 2 , θ ) ⋯ {\displaystyle \nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{x}F(x_{t-2},u_{t-1},\theta )\nabla _{x}F(x_{t-3},u_{t-2},\theta )\cdots } ==== Example: recurrent network with sigmoid activation ==== For a concrete example, consider a typical recurrent network defined by x t = F ( x t − 1 , u t , θ ) = W rec σ ( x t − 1 ) + W in u t + b {\displaystyle x_{t}=F(x_{t-1},u_{t},\theta )=W_{\text{rec}}\sigma (x_{t-1})+W_{\text{in}}u_{t}+b} where θ = ( W rec , W in ) {\displaystyle \theta =(W_{\text{rec}},W_{\text{in}})} is the network parameter, σ {\displaystyle \sigma } is the sigmoid activation function, applied to each vector coordinate separately, and b {\displaystyle b} is the bias vector. Then, ∇ x F ( x t − 1 , u t , θ ) = W rec diag ⁡ ( σ ′ ( x t − 1 ) ) {\displaystyle \nabla _{x}F(x_{t-1},u_{t},\theta )=W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-1}))} , and so ∇ x F ( x t − 1 , u t , θ ) ∇ x F ( x t − 2 , u t − 1 , θ ) ⋯ ∇ x F ( x t − k , u t − k + 1 , θ ) = W rec diag ⁡ ( σ ′ ( x t − 1 ) ) W rec diag ⁡ ( σ ′ ( x t − 2 ) ) ⋯ W rec diag ⁡ ( σ ′ ( x t − k ) ) {\displaystyle {\begin{aligned}&\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{x}F(x_{t-2},u_{t-1},\theta )\cdots \nabla _{x}F(x_{t-k},u_{t-k+1},\theta )\\&=W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-1}))W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-2}))\cdots W_{\text{rec}}\operatorname {diag} (\sigma '(x_{t-k}))\end{aligned}}} Since | σ ′ | ≤ 1 {\displaystyle \left|\sigma '\right|\leq 1} , the operator norm of the above multiplication is bounded above by ‖ W rec ‖ k {\displaystyle \left\|W_{\text{rec}}\right\|^{k}} . So if the spectral radius of W rec {\displaystyle W_{\text{rec}}} is γ < 1 {\displaystyle \gamma <1} , then at large k {\displaystyle k} , the above multiplication has operator norm bounded above by γ k → 0 {\displaystyle \gamma ^{k}\to 0} . This is the prototypical vanishing gradient problem. The effect of a vanishing gradient is that the network cannot learn long-range effects. Recall Equation (loss differential): ∇ θ L = ∇ x L ( x T , u 1 , … , u T ) [ ∇ θ F ( x t − 1 , u t , θ ) + ∇ x F ( x t − 1 , u t , θ ) ∇ θ F ( x t − 2 , u t − 1 , θ ) + ⋯ ] {\displaystyle \nabla _{\theta }L=\nabla _{x}L(x_{T},u_{1},\dots ,u_{T})\left[\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right]} The components of ∇ θ F ( x , u , θ ) {\displaystyle \nabla _{\theta }F(x,u,\theta )} are just components of σ ( x ) {\displaystyle \sigma (x)} and u {\displaystyle u} , so if u t , u t − 1 , … {\displaystyle u_{t},u_{t-1},\dots } are bounded, then ‖ ∇ θ F ( x t − k − 1 , u t − k , θ ) ‖ {\displaystyle \left\|\nabla _{\theta }F(x_{t-k-1},u_{t-k},\theta )\right\|} is also bounded by some M > 0 {\displaystyle M>0} , and so the terms in ∇ θ L {\displaystyle \nabla _{\theta }L} decay as M γ k {\displaystyle M\gamma ^{k}} . This means that, effectively, ∇ θ L {\displaystyle \nabla _{\theta }L} is affected only by the first O ( γ − 1 ) {\displaystyle O(\gamma ^{-1})} terms in the sum. If γ ≥ 1 {\displaystyle \gamma \geq 1} , the above analysis does not quite work. For the prototypical exploding gradient problem, the next model is clearer. === Dynamical systems model === Following (Doya, 1993), consider this one-neuron recurrent network with sigmoid activation: x t + 1 = ( 1 − ε ) x t + ε σ ( w x t + b ) + ε w ′ u t {\displaystyle x_{t+1}=(1-\varepsilon )x_{t}+\varepsilon \sigma (wx_{t}+b)+\varepsilon w'u_{t}} At the small ε {\displaystyle \varepsilon } limit, the dynamics of the network becomes d x d t = − x ( t ) + σ ( w x ( t ) + b ) + w ′ u ( t ) {\displaystyle {\frac {dx}{dt}}=-x(t)+\sigma (wx(t)+b)+w'u(t)} Consider first the autonomous case, with u = 0 {\displaystyle u=0} . Set w = 5.0 {\displaystyle w=5.0} , and vary b {\displaystyle b} in [ − 3 , − 2 ] {\displaystyle [-3,-2]} . As b {\displaystyle b} decreases, the system has 1 stable point, then has 2 stable points and 1 unstable point, and finally has 1 stable point again. Explicitly, the stable points are ( x , b ) = ( x , ln ⁡ ( x 1 − x ) − 5 x ) {\displaystyle (x,b)=\left(x,\ln \left({\frac {x}{1-x}}\right)-5x\right)} . Now consider Δ x ( T ) Δ x ( 0 ) {\displaystyle {\frac {\Delta x(T)}{\Delta x(0)}}} and Δ x ( T ) Δ b {\displaystyle {\frac {\Delta x(T)}{\Delta b}}} , where T {\displaystyle T} is large enough that the system has settled into one of the stable points. If ( x ( 0 ) , b ) {\displaystyle (x(0),b)} puts the system very close to an unstable point, then a tiny variation in x ( 0 ) {\displaystyle x(0)} or b {\displaystyle b} wo

    Read more →
  • Outline of automation

    Outline of automation

    The following outline is provided as an overview of and topical guide to automation: Automation – use of control systems and information technologies to reduce the need for human work in the production of goods and services. In the scope of industrialization, automation is a step beyond mechanization. == Essence of automation == Control system – a device, or set of devices to manage, command, direct or regulate the behavior of other devices or systems. Industrial control system (ICS) – encompasses several types of control systems used in industrial production, including supervisory control and data acquisition (SCADA) systems, distributed control systems (DCS), and other smaller control system configurations such as skid-mounted programmable logic controllers (PLC) often found in industrial sectors and critical infrastructures. Industrialization – period of social and economic change that transforms a human group from an agrarian society into an industrial one. Numerical control (NC) – refers to the automation of machine tools that are operated by abstractly programmed commands encoded on a storage medium, as opposed to controlled manually via handwheels or levers, or mechanically automated via cams alone. Robotics – the branch of technology that deals with the design, construction, operation, structural disposition, manufacture and application of robots and computer systems for their control, sensory feedback, and information processing. == Branches of automation == === General purpose === Autonomous automation – autonomous software agents to adapt the controllers of computer controlled industrial machinery and processes Banking automation Broadcast automation Building automation – advanced functionality provided by the control system of a building. A building automation system (BAS) is an example of a distributed control system. Home automation – control system of a home. Office automation – the varied computer machinery and software used to digitally create, collect, store, manipulate, and relay office information needed for accomplishing basic tasks such as business process automation and robotic process automation. Console automation Database automation Integrated library system Laboratory automation === Specific purpose === Automated attendant Automated guided vehicle Autonomous mobile robot Automated highway system Automated pool cleaner Automated teller machine Automatic painting (robotic) Pop music automation Remotely operated vehicle Robotic lawn mower Telephone switchboard Vending machine == Fields contributing to automation == Cybernetics – the interdisciplinary study of the structure of regulatory systems. Cognitive science – interdisciplinary scientific study of the mind and its processes. It examines what cognition is, what it does and how it works. Robotics – the branch of technology that deals with the design, construction, operation, structural disposition, manufacture and application of robots and computer systems for their control, sensory feedback, and information processing. == History of automation == History of mass production – Prerequisites of mass production were interchangeable parts, machine tools and power, especially in the form of electricity. Mass production was popularized in the 1910s and 1920s by Henry Ford's Ford Motor Company, which introduced electric motors to the then-well-known technique of chain or sequential production. History of home automation == Automated machines == Machine to Machine OLE for process control (OPC) Process control – a statistics and engineering discipline that deals with architectures, mechanisms and algorithms for maintaining the output of a specific process within a desired range. Run Book Automation (RBA) Robot – a mechanical or virtual intelligent agent that can perform tasks automatically or with guidance, typically by remote control. == Automated machine components == Artificial intelligence – the intelligence of machines and the branch of computer science that aims to create it. Friendly artificial intelligence – an artificial intelligence that has a positive rather than negative effect on humanity, and the field of knowledge required to build such an artificial intelligence. === Automation tools === Artificial neural network (ANN) – mathematical model or computational model that is inspired by the structure or functional aspects of biological neural networks. Human machine interface (HMI) – operator level local control panel that monitors field devices Laboratory information management system (LIMS) – software package that offers a set of key features that support a modern laboratory's operations. Industrial control system – encompasses several types of control systems used in industrial production, including supervisory control and data acquisition (SCADA) systems, distributed control systems (DCS), and other smaller control system configurations such as skid-mounted programmable logic controllers (PLC) often found in the industrial sectors and critical infrastructures. Distributed control system (DCS) – control system usually of a manufacturing system, process or any kind of dynamic system, in which the controller elements are not central in location (like the brain) but are distributed throughout the system with each component sub-system controlled by one or more controllers. Manufacturing execution system (MES) – system that manages manufacturing operations in a factory, including management of resources, scheduling production processes, dispatching production orders, execution of production orders, etc. Programmable automation controller (PAC) – digital computer used for automation of electromechanical processes, such as control of machinery on factory assembly lines, amusement rides, or light fixtures. Programmable logic controller (PLC)A Programmable Logic Controller, PLC or Programmable Controller is a digital computer used for automation of electromechanical processes, such as control of machinery on factory assembly lines, amusement rides, or light fixtures. The abbreviation "PLC" and the term "Programmable Logic Controller" are registered trademarks of the Allen-Bradley Company (Rockwell Automation). PLCs are used in many industries and machines. Unlike general-purpose computers, the PLC is designed for multiple inputs and output arrangements, extended temperature ranges, immunity to electrical noise, and resistance to vibration and impact. Programs to control machine operation are typically stored in battery-backed-up or non-volatile memory. A PLC is an example of a hard real time system since output results must be produced in response to input conditions within a limited time, otherwise unintended operation will result. Supervisory control and data acquisition (SCADA) – generally refers to industrial control systems (ICS): computer systems that monitor and control industrial, infrastructure, or facility-based processes, as described below: Industrial processes include those of manufacturing, production, power generation, fabrication, and refining, and may run in continuous, batch, repetitive, or discrete modes. Simulation § Engineering Technology simulation or Process simulation == Social movements == Automation-related social movement – a movement that advocates semi- or fully automatic systems to provide for human needs globally. For example, automation of farming and food distribution throughout the world so that no one will go hungry. One goal is to automate all mundane labor, to free humans to engage in more creative activities (or less work). The Technocracy movement – social movement active from the Great Depression (1930s) to date that proposes replacing politicians and business people with scientists and engineers who have the technical expertise to manage the economy. The Zeitgeist Movement – movement advocating the replacement of the market economy with an economy in which all resources are equitably, commonly and sustainably shared. == Automation in the future == Android – a robot or synthetic organism designed to look and act like a human, and with a body having a flesh-like resemblance Technological singularity – the hypothetical future emergence of greater-than-human intelligence through technological means Semi-automation – using a centralized computer controller to orchestrate the activities of man and machine. == Automation-related publications == IEEE Spectrum – the flagship publication of the Institute of Electrical and Electronics Engineers (IEEE), explores the development, applications and implications of new technologies, and provides a forum for understanding, discussion and leadership in these areas. IEEE Transactions on Information Theory – peer-reviewed scientific journal published by the Institute of Electrical and Electronics Engineers (IEEE), focused on the study of information theory, the mathematics of communications, including computer communications, robotics communications, etc. IEEE Transactions on Control S

    Read more →
  • Swish function

    Swish function

    The swish function is a family of mathematical function defined as follows: swish β ⁡ ( x ) = x sigmoid ⁡ ( β x ) = x 1 + e − β x . {\displaystyle \operatorname {swish} _{\beta }(x)=x\operatorname {sigmoid} (\beta x)={\frac {x}{1+e^{-\beta x}}}.} where β {\displaystyle \beta } can be constant (usually set to 1) or trainable and "sigmoid" refers to the logistic function. The swish family was designed to smoothly interpolate between a linear function and the Rectified linear unit (ReLU) function. When considering positive values, Swish is a particular case of doubly parameterized sigmoid shrinkage function defined in . Variants of the swish function include Mish. == Special values == For β = 0, the function is linear: f(x) = x/2. For β = 1, the function is the Sigmoid Linear Unit (SiLU). For β = 1.702, the function approximates GeLU. With β → ∞, the function converges to ReLU. Thus, the swish family smoothly interpolates between a linear function and the ReLU function. Since swish β ⁡ ( x ) = swish 1 ⁡ ( β x ) / β {\displaystyle \operatorname {swish} _{\beta }(x)=\operatorname {swish} _{1}(\beta x)/\beta } , all instances of swish have the same shape as the default swish 1 {\displaystyle \operatorname {swish} _{1}} , zoomed by β {\displaystyle \beta } . One usually sets β > 0 {\displaystyle \beta >0} . When β {\displaystyle \beta } is trainable, this constraint can be enforced by β = e b {\displaystyle \beta =e^{b}} , where b {\displaystyle b} is trainable. swish 1 ⁡ ( x ) = x 2 + x 2 4 − x 4 48 + x 6 480 + O ( x 8 ) {\displaystyle \operatorname {swish} _{1}(x)={\frac {x}{2}}+{\frac {x^{2}}{4}}-{\frac {x^{4}}{48}}+{\frac {x^{6}}{480}}+O\left(x^{8}\right)} swish 1 ⁡ ( x ) = x 2 tanh ⁡ ( x 2 ) + x 2 swish 1 ⁡ ( x ) + swish − 1 ⁡ ( x ) = x tanh ⁡ ( x 2 ) swish 1 ⁡ ( x ) − swish − 1 ⁡ ( x ) = x {\displaystyle {\begin{aligned}\operatorname {swish} _{1}(x)&={\frac {x}{2}}\tanh \left({\frac {x}{2}}\right)+{\frac {x}{2}}\\\operatorname {swish} _{1}(x)+\operatorname {swish} _{-1}(x)&=x\tanh \left({\frac {x}{2}}\right)\\\operatorname {swish} _{1}(x)-\operatorname {swish} _{-1}(x)&=x\end{aligned}}} == Derivatives == Because swish β ⁡ ( x ) = swish 1 ⁡ ( β x ) / β {\displaystyle \operatorname {swish} _{\beta }(x)=\operatorname {swish} _{1}(\beta x)/\beta } , it suffices to calculate its derivatives for the default case. swish 1 ′ ⁡ ( x ) = x + sinh ⁡ ( x ) 4 cosh 2 ⁡ ( x 2 ) + 1 2 {\displaystyle \operatorname {swish} _{1}'(x)={\frac {x+\sinh(x)}{4\cosh ^{2}\left({\frac {x}{2}}\right)}}+{\frac {1}{2}}} so swish 1 ′ ⁡ ( x ) − 1 2 {\displaystyle \operatorname {swish} _{1}'(x)-{\frac {1}{2}}} is odd. swish 1 ″ ⁡ ( x ) = 1 − x 2 tanh ⁡ ( x 2 ) 2 cosh 2 ⁡ ( x 2 ) {\displaystyle \operatorname {swish} _{1}''(x)={\frac {1-{\frac {x}{2}}\tanh \left({\frac {x}{2}}\right)}{2\cosh ^{2}\left({\frac {x}{2}}\right)}}} so swish 1 ″ ⁡ ( x ) {\displaystyle \operatorname {swish} _{1}''(x)} is even. == History == SiLU was first proposed alongside the GELU in 2016, then again proposed in 2017 as the Sigmoid-weighted Linear Unit (SiL) in reinforcement learning. The SiLU/SiL was then again proposed as the SWISH over a year after its initial discovery, originally proposed without the learnable parameter β, so that β implicitly equaled 1. The swish paper was then updated to propose the activation with the learnable parameter β. In 2017, after performing analysis on ImageNet data, researchers from Google indicated that using this function as an activation function in artificial neural networks improves the performance, compared to ReLU and sigmoid functions. It is believed that one reason for the improvement is that the swish function helps alleviate the vanishing gradient problem during backpropagation.

    Read more →
  • Consensus clustering

    Consensus clustering

    Consensus clustering is a method of aggregating (potentially conflicting) results from multiple clustering algorithms. Also called cluster ensembles or aggregation of clustering (or partitions), it refers to the situation in which a number of different (input) clusterings have been obtained for a particular dataset and it is desired to find a single (consensus) clustering which is a better fit in some sense than the existing clusterings. Consensus clustering is thus the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. When cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be NP-complete, even when the number of input clusterings is three. Consensus clustering for unsupervised learning is analogous to ensemble learning in supervised learning. == Issues with existing clustering techniques == Current clustering techniques do not address all the requirements adequately. Dealing with large number of dimensions and large number of data items can be problematic because of time complexity; Effectiveness of the method depends on the definition of "distance" (for distance-based clustering) If an obvious distance measure doesn't exist, we must "define" it, which is not always easy, especially in multidimensional spaces. The result of the clustering algorithm (that, in many cases, can be arbitrary itself) can be interpreted in different ways. == Justification for using consensus clustering == There are potential shortcomings for all existing clustering techniques. This may cause interpretation of results to become difficult, especially when there is no knowledge about the number of clusters. Clustering methods are also very sensitive to the initial clustering settings, which can cause non-significant data to be amplified in non-reiterative methods. An extremely important issue in cluster analysis is the validation of the clustering results, that is, how to gain confidence about the significance of the clusters provided by the clustering technique (cluster numbers and cluster assignments). Lacking an external objective criterion (the equivalent of a known class label in supervised analysis), this validation becomes somewhat elusive. Iterative descent clustering methods, such as the SOM and k-means clustering circumvent some of the shortcomings of hierarchical clustering by providing for univocally defined clusters and cluster boundaries. Consensus clustering provides a method that represents the consensus across multiple runs of a clustering algorithm, to determine the number of clusters in the data, and to assess the stability of the discovered clusters. The method can also be used to represent the consensus over multiple runs of a clustering algorithm with random restart (such as K-means, model-based Bayesian clustering, SOM, etc.), so as to account for its sensitivity to the initial conditions. It can provide data for a visualization tool to inspect cluster number, membership, and boundaries. However, they lack the intuitive and visual appeal of hierarchical clustering dendrograms, and the number of clusters must be chosen a priori. == The Monti consensus clustering algorithm == The Monti consensus clustering algorithm is one of the most popular consensus clustering algorithms and is used to determine the number of clusters, K {\displaystyle K} . Given a dataset of N {\displaystyle N} total number of points to cluster, this algorithm works by resampling and clustering the data, for each K {\displaystyle K} and a N × N {\displaystyle N\times N} consensus matrix is calculated, where each element represents the fraction of times two samples clustered together. A perfectly stable matrix would consist entirely of zeros and ones, representing all sample pairs always clustering together or not together over all resampling iterations. The relative stability of the consensus matrices can be used to infer the optimal K {\displaystyle K} . More specifically, given a set of points to cluster, D = { e 1 , e 2 , . . . e N } {\displaystyle D=\{e_{1},e_{2},...e_{N}\}} , let D 1 , D 2 , . . . , D H {\displaystyle D^{1},D^{2},...,D^{H}} be the list of H {\displaystyle H} perturbed (resampled) datasets of the original dataset D {\displaystyle D} , and let M h {\displaystyle M^{h}} denote the N × N {\displaystyle N\times N} connectivity matrix resulting from applying a clustering algorithm to the dataset D h {\displaystyle D^{h}} . The entries of M h {\displaystyle M^{h}} are defined as follows: M h ( i , j ) = { 1 , if points i and j belong to the same cluster 0 , otherwise {\displaystyle M^{h}(i,j)={\begin{cases}1,&{\text{if}}{\text{ points i and j belong to the same cluster}}\\0,&{\text{otherwise}}\end{cases}}} Let I h {\displaystyle I^{h}} be the N × N {\displaystyle N\times N} identicator matrix where the ( i , j ) {\displaystyle (i,j)} -th entry is equal to 1 if points i {\displaystyle i} and j {\displaystyle j} are in the same perturbed dataset D h {\displaystyle D^{h}} , and 0 otherwise. The indicator matrix is used to keep track of which samples were selected during each resampling iteration for the normalisation step. The consensus matrix C {\displaystyle C} is defined as the normalised sum of all connectivity matrices of all the perturbed datasets and a different one is calculated for every K {\displaystyle K} . C ( i , j ) = ( ∑ h = 1 H M h ( i , j ) ∑ h = 1 H I h ( i , j ) ) {\displaystyle C(i,j)=\left({\frac {\textstyle \sum _{h=1}^{H}M^{h}(i,j)\displaystyle }{\sum _{h=1}^{H}I^{h}(i,j)}}\right)} That is the entry ( i , j ) {\displaystyle (i,j)} in the consensus matrix is the number of times points i {\displaystyle i} and j {\displaystyle j} were clustered together divided by the total number of times they were selected together. The matrix is symmetric and each element is defined within the range [ 0 , 1 ] {\displaystyle [0,1]} . A consensus matrix is calculated for each K {\displaystyle K} to be tested, and the stability of each matrix, that is how far the matrix is towards a matrix of perfect stability (just zeros and ones) is used to determine the optimal K {\displaystyle K} . One way of quantifying the stability of the K {\displaystyle K} th consensus matrix is examining its CDF curve (see below). == Over-interpretation potential of the Monti consensus clustering algorithm == Monti consensus clustering can be a powerful tool for identifying clusters, but it needs to be applied with caution as shown by Şenbabaoğlu et al. It has been shown that the Monti consensus clustering algorithm is able to claim apparent stability of chance partitioning of null datasets drawn from a unimodal distribution, and thus has the potential to lead to over-interpretation of cluster stability in a real study. If clusters are not well separated, consensus clustering could lead one to conclude apparent structure when there is none, or declare cluster stability when it is subtle. Identifying false positive clusters is a common problem throughout cluster research, and has been addressed by methods such as SigClust and the GAP-statistic. However, these methods rely on certain assumptions for the null model that may not always be appropriate. Şenbabaoğlu et al demonstrated the original delta K metric to decide K {\displaystyle K} in the Monti algorithm performed poorly, and proposed a new superior metric for measuring the stability of consensus matrices using their CDF curves. In the CDF curve of a consensus matrix, the lower left portion represents sample pairs rarely clustered together, the upper right portion represents those almost always clustered together, whereas the middle segment represent those with ambiguous assignments in different clustering runs. The proportion of ambiguous clustering (PAC) score measure quantifies this middle segment; and is defined as the fraction of sample pairs with consensus indices falling in the interval (u1, u2) ∈ [0, 1] where u1 is a value close to 0 and u2 is a value close to 1 (for instance u1=0.1 and u2=0.9). A low value of PAC indicates a flat middle segment, and a low rate of discordant assignments across permuted clustering runs. One can therefore infer the optimal number of clusters by the K {\displaystyle K} value having the lowest PAC. == Related work == Clustering ensemble (Strehl and Ghosh): They considered various formulations for the problem, most of which reduce the problem to a hyper-graph partitioning problem. In one of their formulations they considered the same graph as in the correlation clustering problem. The solution they proposed is to compute the best k-partition of the graph, which does not take into account the penalty for merging two nodes that are far apart. Clustering aggregation (Fern and Brodley): They applied the clustering aggregation idea to a collection of soft clusterings they obtained by random projections. They used an agglomerative algorithm

    Read more →