AI For School Students

AI For School Students — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Web development

    Web development

    Web development is the process of designing, developing and maintaining websites and web apps. Web development encompasses several different fields, most commonly referring to the programming of websites. Front-end development is the act of developing the user interface and client-side code, while back-end development focuses on the infrastructure behind a website, mainly server-side code. Since the World Wide Web was released publicly in 1993, web development has evolved greatly, with websites changing from a collection of static HTML pages to complex projects using frameworks, servers, and databases. == Overview == Web development includes many individual tasks, including web design, web content development, networking, and coding. Among web professionals, "web development" usually refers to the main non-design aspects of building websites: writing markup and coding. Web development is generally split into two fields: front-end development and back-end development. Front-end developers create the user interface of websites, turning web designs into HTML, CSS, and JavaScript code. Front-end developers must also make sure that websites work consistently across different browsers and devices. Back-end development, also known as server-side development, focuses on the infrastructure behind a website, including APIs, database management, and security. Some choose to be full-stack developers, meaning they work on both the front-end and back-end. == History == The World Wide Web is often categorised into three generations: Web 1.0, Web 2.0, and Web 3.0 (or Web3). It was invented in 1989, and released to the public in 1993. In the early years of the web, restrospecitvely referred to as Web 1.0, websites were simply a collection of static HTML files, and had limited interactivity. After the introduction of JavaScript in 1995, websites could contain logic, allowing for interactivity. The following year CSS was released, allowing greater control over the styling of web pages. In 1999, the term Web 2.0 was coined by Darcy DiNucci. The term later resurfaced in the early 2000s, as websites started to increase in complexity, requiring server-side services in addition to JavaScript. This led to the emergence of various new programming languages and frameworks designed for backend services, such as PHP, Active Server Pages, and Jakarta Server Pages. This enabled websites to do additional server-side processing, such as accessing databases. Another shift in web development was the release of the iPhone in 2007. This created a new medium for accessing the web, requiring a new approach to web development, and resulting in responsive web design, which allows a single website to appear different depending on the device running it. Later, progressive web apps were introduced, allowing websites to be installed on a device as an independent application. In the 2010s, JavaScript frameworks began to emerge, creating new ways to manipulate web pages, and increasing compatibility between web browsers. JQuery was popular in the early 2010s, but was later surpassed by other frameworks such as React and Vue.js. In the mid 2020s, use of AI became prevalent among web developers, with the 2025 Stack Overflow survey showing over 80% of developers saying the use AI at least monthly in their development process.

    Read more →
  • Canonical correspondence analysis

    Canonical correspondence analysis

    In multivariate analysis, canonical correspondence analysis (CCA) is an ordination technique that determines axes from the response data as a unimodal combination of measured predictors. CCA is commonly used in ecology in order to extract gradients that drive the composition of ecological communities. CCA extends correspondence analysis (CA) with regression, in order to incorporate predictor variables. == History == CCA was developed in 1986 by Cajo ter Braak and implemented in the program CANOCO, an extension of DECORANA. To date, CCA is one of the most popular multivariate methods in ecology, despite the availability of contemporary alternatives. CCA was originally derived and implemented using an algorithm of weighted averaging, though Legendre & Legendre (1998) derived an alternative algorithm. == Assumptions == The requirements of a CCA are that the samples are random and independent. Also, the data are categorical and that the independent variables are consistent within the sample site and error-free. The original publication states the need for equal species tolerances, equal species maxima, and equispaced or uniformly distributed species optima and site scores.

    Read more →
  • Randomized weighted majority algorithm

    Randomized weighted majority algorithm

    The randomized weighted majority algorithm is an algorithm in machine learning theory for aggregating expert predictions to a series of decision problems. It is a simple and effective method based on weighted voting which improves on the mistake bound of the deterministic weighted majority algorithm. In fact, in the limit, its prediction rate can be arbitrarily close to that of the best-predicting expert. == Example == Imagine that every morning before the stock market opens, we get a prediction from each of our "experts" about whether the stock market will go up or down. Our goal is to somehow combine this set of predictions into a single prediction that we then use to make a buy or sell decision for the day. The principal challenge is that we do not know which experts will give better or worse predictions. The RWMA gives us a way to do this combination such that our prediction record will be nearly as good as that of the single expert which, in hindsight, gave the most accurate predictions. == Motivation == In machine learning, the weighted majority algorithm (WMA) is a deterministic meta-learning algorithm for aggregating expert predictions. In pseudocode, the WMA is as follows: initialize all experts to weight 1 for each round: add each expert's weight to the option they predicted predict the option with the largest weighted sum multiply the weights of all experts who predicted wrongly by 1 2 {\displaystyle {\frac {1}{2}}} Suppose there are n {\displaystyle n} experts and the best expert makes m {\displaystyle m} mistakes. Then, the weighted majority algorithm (WMA) makes at most 2.4 ( log 2 ⁡ n + m ) {\displaystyle 2.4(\log _{2}n+m)} mistakes. This bound is highly problematic in the case of highly error-prone experts. Suppose, for example, the best expert makes a mistake 20% of the time; that is, in N = 100 {\displaystyle N=100} rounds using n = 10 {\displaystyle n=10} experts, the best expert makes m = 20 {\displaystyle m=20} mistakes. Then, the weighted majority algorithm only guarantees an upper bound of 2.4 ( log 2 ⁡ 10 + 20 ) ≈ 56 {\displaystyle 2.4(\log _{2}10+20)\approx 56} mistakes. As this is a known limitation of the weighted majority algorithm, various strategies have been explored in order to improve the dependence on m {\displaystyle m} . In particular, we can do better by introducing randomization. Drawing inspiration from the Multiplicative Weights Update Method algorithm, we will probabilistically make predictions based on how the experts have performed in the past. Similarly to the WMA, every time an expert makes a wrong prediction, we will decrement their weight. Mirroring the MWUM, we will then use the weights to make a probability distribution over the actions and draw our action from this distribution (instead of deterministically picking the majority vote as the WMA does). == Randomized weighted majority algorithm (RWMA) == The randomized weighted majority algorithm is an attempt to improve the dependence of the mistake bound of the WMA on m {\displaystyle m} . Instead of predicting based on majority vote, the weights, are used as probabilities for choosing the experts in each round and are updated over time (hence the name randomized weighted majority). Precisely, if w i {\displaystyle w_{i}} is the weight of expert i {\displaystyle i} , let W = ∑ i w i {\displaystyle W=\sum _{i}w_{i}} . We will follow expert i {\displaystyle i} with probability w i W {\displaystyle {\frac {w_{i}}{W}}} . This results in the following algorithm: initialize all experts to weight 1. for each round: add all experts' weights together to obtain the total weight W {\displaystyle W} choose expert i {\displaystyle i} randomly with probability w i W {\displaystyle {\frac {w_{i}}{W}}} predict as the chosen expert predicts multiply the weights of all experts who predicted wrongly by β {\displaystyle \beta } The goal is to bound the worst-case expected number of mistakes, assuming that the adversary has to select one of the answers as correct before we make our coin toss. This is a reasonable assumption in, for instance, the stock market example provided above: the variance of a stock price should not depend on the opinions of experts that influence private buy or sell decisions, so we can treat the price change as if it was decided before the experts gave their recommendations for the day. The randomized algorithm is better in the worst case than the deterministic algorithm (weighted majority algorithm): in the latter, the worst case was when the weights were split 50/50. But in the randomized version, since the weights are used as probabilities, there would still be a 50/50 chance of getting it right. In addition, generalizing to multiplying the weights of the incorrect experts by β < 1 {\displaystyle \beta <1} instead of strictly 1 2 {\displaystyle {\frac {1}{2}}} allows us to trade off between dependence on m {\displaystyle m} and log 2 ⁡ n {\displaystyle \log _{2}n} . This trade-off will be quantified in the analysis section. == Analysis == Let W t {\displaystyle W_{t}} denote the total weight of all experts at round t {\displaystyle t} . Also let F t {\displaystyle F_{t}} denote the fraction of weight placed on experts which predict the wrong answer at round t {\displaystyle t} . Finally, let N {\displaystyle N} be the total number of rounds in the process. By definition, F t {\displaystyle F_{t}} is the probability that the algorithm makes a mistake on round t {\displaystyle t} . It follows from the linearity of expectation that if M {\displaystyle M} denotes the total number of mistakes made during the entire process, E [ M ] = ∑ t = 1 N F t {\displaystyle E[M]=\sum _{t=1}^{N}F_{t}} . After round t {\displaystyle t} , the total weight is decreased by ( 1 − β ) F t W t {\displaystyle \ (1-\beta )F_{t}W_{t}} , since all weights corresponding to a wrong answer are multiplied by β < 1 {\displaystyle \ \beta <1} . It then follows that W t + 1 = W t ( 1 − ( 1 − β ) F t ) {\displaystyle W_{t+1}=W_{t}(1-(1-\beta )F_{t})} . By telescoping, since W 1 = n {\displaystyle W_{1}=n} , it follows that the total weight after the process concludes is On the other hand, suppose that m {\displaystyle \ m} is the number of mistakes made by the best-performing expert. At the end, this expert has weight β m {\displaystyle \ \beta ^{m}} . It follows, then, that the total weight is at least this much; in other words, W ≥ β m {\displaystyle \ W\geq \beta ^{m}} . This inequality and the above result imply Taking the natural logarithm of both sides yields Now, the Taylor series of the natural logarithm is In particular, it follows that ln ⁡ ( 1 − ( 1 − β ) F t ) < − ( 1 − β ) F t {\displaystyle \ \ln(1-(1-\beta )F_{t})<-(1-\beta )F_{t}} . Thus, Recalling that E [ M ] = ∑ t = 1 N F t {\displaystyle E[M]=\sum _{t=1}^{N}F_{t}} and rearranging, it follows that Now, as β → 1 {\displaystyle \beta \to 1} from below, the first constant tends to 1 {\displaystyle 1} ; however, the second constant tends to + ∞ {\displaystyle +\infty } . To quantify this tradeoff, define ε = 1 − β {\displaystyle \varepsilon =1-\beta } to be the penalty associated with getting a prediction wrong. Then, again applying the Taylor series of the natural logarithm, It then follows that the mistake bound, for small ε {\displaystyle \varepsilon } , can be written in the form ( 1 + ϵ 2 + O ( ε 2 ) ) m + ϵ − 1 ln ⁡ ( n ) {\displaystyle \ \left(1+{\frac {\epsilon }{2}}+O(\varepsilon ^{2})\right)m+\epsilon ^{-1}\ln(n)} . In English, the less that we penalize experts for their mistakes, the more that additional experts will lead to initial mistakes but the closer we get to capturing the predictive accuracy of the best expert as time goes on. In particular, given a sufficiently low value of ε {\displaystyle \varepsilon } and enough rounds, the randomized weighted majority algorithm can get arbitrarily close to the correct prediction rate of the best expert. In particular, as long as m {\displaystyle m} is sufficiently large compared to ln ⁡ ( n ) {\displaystyle \ln(n)} (so that their ratio is sufficiently small), we can assign we can obtain an upper bound on the number of mistakes equal to This implies that the "regret bound" on the algorithm (that is, how much worse it performs than the best expert) is sublinear, at O ( m ln ⁡ ( n ) ) {\displaystyle O({\sqrt {m\ln(n)}})} . == Revisiting the motivation == Recall that the motivation for the randomized weighted majority algorithm was given by an example where the best expert makes a mistake 20% of the time. Precisely, in N = 100 {\displaystyle N=100} rounds, with n = 10 {\displaystyle n=10} experts, where the best expert makes m = 20 {\displaystyle m=20} mistakes, the deterministic weighted majority algorithm only guarantees an upper bound of 2.4 ( log 2 ⁡ 10 + 20 ) ≈ 56 {\displaystyle 2.4(\log _{2}10+20)\approx 56} . By the analysis above, it follows that minimizing the number of worst-case expected mistakes is equivalent to minimizing the fun

    Read more →
  • Autoencoder

    Autoencoder

    An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms. Variants exist which aim to make the learned representations assume useful properties. Examples are regularized autoencoders (sparse, denoising and contractive autoencoders), which are effective in learning representations for subsequent classification tasks, and variational autoencoders, which can be used as generative models. Autoencoders are applied to many problems, including facial recognition, feature detection, anomaly detection, and learning the meaning of words. In terms of data synthesis, autoencoders can also be used to randomly generate new data that is similar to the input (training) data. == Mathematical principles == === Definition === An autoencoder is defined by the following components: Two sets: the space of encoded messages Z {\displaystyle {\mathcal {Z}}} ; the space of decoded messages X {\displaystyle {\mathcal {X}}} . Typically X {\displaystyle {\mathcal {X}}} and Z {\displaystyle {\mathcal {Z}}} are Euclidean spaces, that is, X = R m , Z = R n {\displaystyle {\mathcal {X}}=\mathbb {R} ^{m},{\mathcal {Z}}=\mathbb {R} ^{n}} with m > n . {\displaystyle m>n.} Two parametrized families of functions: the encoder family E ϕ : X → Z {\displaystyle E_{\phi }:{\mathcal {X}}\rightarrow {\mathcal {Z}}} , parametrized by ϕ {\displaystyle \phi } ; the decoder family D θ : Z → X {\displaystyle D_{\theta }:{\mathcal {Z}}\rightarrow {\mathcal {X}}} , parametrized by θ {\displaystyle \theta } .For any x ∈ X {\displaystyle x\in {\mathcal {X}}} , we usually write z = E ϕ ( x ) {\displaystyle z=E_{\phi }(x)} , and refer to it as the code, the latent variable, latent representation, latent vector, etc. Conversely, for any z ∈ Z {\displaystyle z\in {\mathcal {Z}}} , we usually write x ′ = D θ ( z ) {\displaystyle x'=D_{\theta }(z)} , and refer to it as the (decoded) message. Usually, both the encoder and the decoder are defined as multilayer perceptrons (MLPs). For example, a one-layer-MLP encoder E ϕ {\displaystyle E_{\phi }} is: E ϕ ( x ) = σ ( W x + b ) {\displaystyle E_{\phi }(\mathbf {x} )=\sigma (Wx+b)} where σ {\displaystyle \sigma } is an element-wise activation function, W {\displaystyle W} is a "weight" matrix, and b {\displaystyle b} is a "bias" vector. === Training an autoencoder === An autoencoder, by itself, is simply a tuple of two functions. To judge its quality, we need a task. A task is defined by a reference probability distribution μ r e f {\displaystyle \mu _{ref}} over X {\displaystyle {\mathcal {X}}} , and a "reconstruction quality" function d : X × X → [ 0 , ∞ ] {\displaystyle d:{\mathcal {X}}\times {\mathcal {X}}\to [0,\infty ]} , such that d ( x , x ′ ) {\displaystyle d(x,x')} measures how much x ′ {\displaystyle x'} differs from x {\displaystyle x} . With those, we can define the loss function for the autoencoder as L ( θ , ϕ ) := E x ∼ μ r e f [ d ( x , D θ ( E ϕ ( x ) ) ) ] {\displaystyle L(\theta ,\phi ):=\mathbb {\mathbb {E} } _{x\sim \mu _{ref}}[d(x,D_{\theta }(E_{\phi }(x)))]} The optimal autoencoder for the given task ( μ r e f , d ) {\displaystyle (\mu _{ref},d)} is then arg ⁡ min θ , ϕ L ( θ , ϕ ) {\displaystyle \arg \min _{\theta ,\phi }L(\theta ,\phi )} . The search for the optimal autoencoder can be accomplished by any mathematical optimization technique, but usually by gradient descent. This search process is referred to as "training the autoencoder". In most situations, the reference distribution is just the empirical distribution given by a dataset { x 1 , . . . , x N } ⊂ X {\displaystyle \{x_{1},...,x_{N}\}\subset {\mathcal {X}}} , so that μ r e f = 1 N ∑ i = 1 N δ x i {\displaystyle \mu _{ref}={\frac {1}{N}}\sum _{i=1}^{N}\delta _{x_{i}}} where δ x i {\displaystyle \delta _{x_{i}}} is the Dirac measure, the quality function is just L 2 {\displaystyle L^{2}} loss: d ( x , x ′ ) = ‖ x − x ′ ‖ 2 2 {\displaystyle d(x,x')=\|x-x'\|_{2}^{2}} , and ‖ ⋅ ‖ 2 {\displaystyle \|\cdot \|_{2}} is the Euclidean norm. Then the problem of searching for the optimal autoencoder is just a least-squares optimization: min θ , ϕ L ( θ , ϕ ) , where L ( θ , ϕ ) = 1 N ∑ i = 1 N ‖ x i − D θ ( E ϕ ( x i ) ) ‖ 2 2 {\displaystyle \min _{\theta ,\phi }L(\theta ,\phi ),\qquad {\text{where }}L(\theta ,\phi )={\frac {1}{N}}\sum _{i=1}^{N}\|x_{i}-D_{\theta }(E_{\phi }(x_{i}))\|_{2}^{2}} === Interpretation === An autoencoder has two main parts: an encoder that maps the message to a code, and a decoder that reconstructs the message from the code. An optimal autoencoder would perform as close to perfect reconstruction as possible, with "close to perfect" defined by the reconstruction quality function d {\displaystyle d} . The simplest way to perform the copying task perfectly would be to duplicate the signal. To suppress this behavior, the code space Z {\displaystyle {\mathcal {Z}}} usually has fewer dimensions than the message space X {\displaystyle {\mathcal {X}}} . Such an autoencoder is called undercomplete. It can be interpreted as compressing the message, or reducing its dimensionality. At the limit of an ideal undercomplete autoencoder, every possible code z {\displaystyle z} in the code space is used to encode a message x {\displaystyle x} that really appears in the distribution μ r e f {\displaystyle \mu _{ref}} , and the decoder is also perfect: D θ ( E ϕ ( x ) ) = x {\displaystyle D_{\theta }(E_{\phi }(x))=x} . This ideal autoencoder can then be used to generate messages indistinguishable from real messages, by feeding its decoder arbitrary code z {\displaystyle z} and obtaining D θ ( z ) {\displaystyle D_{\theta }(z)} , which is a message that really appears in the distribution μ r e f {\displaystyle \mu _{ref}} . If the code space Z {\displaystyle {\mathcal {Z}}} has dimension larger than (overcomplete), or equal to, the message space X {\displaystyle {\mathcal {X}}} , or the hidden units are given enough capacity, an autoencoder can learn the identity function and become useless. However, experimental results found that overcomplete autoencoders might still learn useful features. In the ideal setting, the code dimension and the model capacity could be set on the basis of the complexity of the data distribution to be modeled. A standard way to do so is to add modifications to the basic autoencoder, to be detailed below. == Variations == === Variational autoencoder (VAE) === Variational autoencoders (VAEs) belong to the families of variational Bayesian methods. Despite the architectural similarities with basic autoencoders, VAEs are architected with different goals and have a different mathematical formulation. The latent space is, in this case, composed of a mixture of distributions instead of fixed vectors. Given an input dataset x {\displaystyle x} characterized by an unknown probability function P ( x ) {\displaystyle P(x)} and a multivariate latent encoding vector z {\displaystyle z} , the objective is to model the data as a distribution p θ ( x ) {\displaystyle p_{\theta }(x)} , with θ {\displaystyle \theta } defined as the set of the network parameters so that p θ ( x ) = ∫ z p θ ( x , z ) d z {\displaystyle p_{\theta }(x)=\int _{z}p_{\theta }(x,z)dz} . === Sparse autoencoder (SAE) === Inspired by the sparse coding hypothesis in neuroscience, sparse autoencoders (SAE) are variants of autoencoders, such that the codes E ϕ ( x ) {\displaystyle E_{\phi }(x)} for messages tend to be sparse codes, that is, E ϕ ( x ) {\displaystyle E_{\phi }(x)} is close to zero in most entries. Sparse autoencoders may include more (rather than fewer) hidden units than inputs, but only a small number of the hidden units are allowed to be active at the same time. Encouraging sparsity improves performance on classification tasks. There are two main ways to enforce sparsity. One way is to simply clamp all but the highest-k activations of the latent code to zero. This is the k-sparse autoencoder. The k-sparse autoencoder inserts the following "k-sparse function" in the latent layer of a standard autoencoder: f k ( x 1 , . . . , x n ) = ( x 1 b 1 , . . . , x n b n ) {\displaystyle f_{k}(x_{1},...,x_{n})=(x_{1}b_{1},...,x_{n}b_{n})} where b i = 1 {\displaystyle b_{i}=1} if | x i | {\displaystyle |x_{i}|} ranks in the top k, and 0 otherwise. Backpropagating through f k {\displaystyle f_{k}} is simple: set gradient to 0 for b i = 0 {\displaystyle b_{i}=0} entries, and keep gradient for b i = 1 {\displaystyle b_{i}=1} entries. This is essentially a generalized ReLU function. The other way is a relaxed version of the k-

    Read more →
  • AI safety

    AI safety

    AI safety is an interdisciplinary field focused on preventing accidents, misuse, or other harmful consequences arising from artificial intelligence systems. It encompasses AI alignment (which aims to ensure AI systems behave as intended), monitoring AI systems for risks, and enhancing their robustness. The field is particularly concerned with existential risks posed by advanced AI models. Beyond technical research, AI safety involves developing norms and policies that promote safety, including advocacy for regulations at different levels of government. The field gained significant popularity in 2023, with rapid progress in generative AI and public concerns voiced by researchers and CEOs about potential dangers. During the 2023 AI Safety Summit, the United States and the United Kingdom both established their own AI Safety Institute. However, researchers have expressed concern that AI safety measures are not keeping pace with the rapid development of AI capabilities. == Motivations == Scholars discuss current risks from critical systems failures, bias, and AI-enabled surveillance, as well as emerging risks like technological unemployment, digital manipulation, weaponization, AI-enabled cyberattacks and bioterrorism. They also discuss speculative risks from losing control of future artificial general intelligence (AGI) agents, or from AI enabling perpetually stable dictatorships. === Existential safety === Some have criticized concerns about AGI, such as Andrew Ng who compared them in 2015 to "worrying about overpopulation on Mars when we have not even set foot on the planet yet". Stuart J. Russell on the other side urges caution, arguing that "it is better to anticipate human ingenuity than to underestimate it". AI researchers have widely differing opinions about the severity and primary sources of risk posed by AI technology – though surveys suggest that experts take high consequence risks seriously. In two surveys of AI researchers, the median respondent was optimistic about AI overall, but placed a 5% probability on an "extremely bad (e.g. human extinction)" outcome of advanced AI. In a 2022 survey of the natural language processing community, 37% agreed or weakly agreed that it is plausible that AI decisions could lead to a catastrophe that is "at least as bad as an all-out nuclear war". == History == Risks from AI began to be seriously discussed at the start of the computer age: Moreover, if we move in the direction of making machines which learn and whose behavior is modified by experience, we must face the fact that every degree of independence we give the machine is a degree of possible defiance of our wishes. In 1988 Blay Whitby published a book outlining the need for AI to be developed along ethical and socially responsible lines. From 2008 to 2009, the Association for the Advancement of Artificial Intelligence (AAAI) commissioned a study to explore and address potential long-term societal influences of AI research and development. The panel was generally skeptical of the radical views expressed by science-fiction authors but agreed that "additional research would be valuable on methods for understanding and verifying the range of behaviors of complex computational systems to minimize unexpected outcomes". In 2011, Roman Yampolskiy introduced the term "AI safety engineering" at the Philosophy and Theory of Artificial Intelligence conference, listing prior failures of AI systems and arguing that "the frequency and seriousness of such events will steadily increase as AIs become more capable". In 2014, philosopher Nick Bostrom published the book Superintelligence: Paths, Dangers, Strategies. He has the opinion that the rise of AGI has the potential to create various societal issues, ranging from the displacement of the workforce by AI, manipulation of political and military structures, to even the possibility of human extinction. His argument that future advanced systems may pose a threat to human existence prompted Elon Musk, Bill Gates, and Stephen Hawking to voice similar concerns. In 2015, dozens of artificial intelligence experts signed an open letter on artificial intelligence calling for research on the societal impacts of AI and outlining concrete directions. To date, the letter has been signed by over 8000 people including Yann LeCun, Shane Legg, Yoshua Bengio, and Stuart Russell. In the same year, a group of academics led by professor Stuart J. Russell founded the Center for Human-Compatible AI at the University of California Berkeley and the Future of Life Institute awarded $6.5 million in grants for research aimed at "ensuring artificial intelligence (AI) remains safe, ethical and beneficial". In 2016, the White House Office of Science and Technology Policy and Carnegie Mellon University announced The Public Workshop on Safety and Control for Artificial Intelligence, which was one of a sequence of four White House workshops aimed at investigating "the advantages and drawbacks" of AI. In the same year, Concrete Problems in AI Safety – one of the first and most influential technical AI Safety agendas – was published. In 2017, the Future of Life Institute sponsored the Asilomar Conference on Beneficial AI, where more than 100 thought leaders formulated principles for beneficial AI including "Race Avoidance: Teams developing AI systems should actively cooperate to avoid corner-cutting on safety standards". In 2018, the DeepMind Safety team outlined AI safety problems in specification, robustness, and assurance. The following year, researchers organized a workshop at ICLR that focused on these problem areas. In 2021, Unsolved Problems in ML Safety was published, outlining research directions in robustness, monitoring, alignment, and systemic safety. In 2023, Rishi Sunak said he wants the United Kingdom to be the "geographical home of global AI safety regulation" and to host the first global summit on AI safety. The AI safety summit took place in November 2023, and focused on the risks of misuse and loss of control associated with frontier AI models. During the summit the intention to create the International Scientific Report on the Safety of Advanced AI was announced. In 2024, The US and UK forged a new partnership on the science of AI safety. The MoU was signed on 1 April 2024 by US commerce secretary Gina Raimondo and UK technology secretary Michelle Donelan to jointly develop advanced AI model testing, following commitments announced at an AI Safety Summit in Bletchley Park in November. In 2025, an international team of 96 experts chaired by Yoshua Bengio published the first International AI Safety Report. The report, commissioned by 30 nations and the United Nations, represents the first global scientific review of potential risks associated with advanced artificial intelligence. It details potential threats stemming from misuse, malfunction, and societal disruption, with the objective of informing policy through evidence-based findings, without providing specific recommendations. == Research focus == AI safety research areas include robustness, monitoring, and alignment. === Robustness === ==== Adversarial robustness ==== AI systems are often vulnerable to adversarial examples or "inputs to machine learning (ML) models that an attacker has intentionally designed to cause the model to make a mistake". For example, in 2013, Szegedy et al. discovered that adding specific imperceptible perturbations to an image could cause it to be misclassified with high confidence. This continues to be an issue with neural networks, though in recent work the perturbations are generally large enough to be perceptible. The image on the right is predicted to be an ostrich after the perturbation is applied. (Left) is a correctly predicted sample, (center) perturbation applied magnified by 10x, (right) adversarial example. Adversarial robustness is often associated with security. Researchers demonstrated that an audio signal could be imperceptibly modified so that speech-to-text systems transcribe it to any message the attacker chooses. Network intrusion and malware detection systems also must be adversarially robust since attackers may design their attacks to fool detectors. Models that represent objectives (reward models) must also be adversarially robust. For example, a reward model might estimate how helpful a text response is and a language model might be trained to maximize this score. Researchers have shown that if a language model is trained for long enough, it will leverage the vulnerabilities of the reward model to achieve a better score and perform worse on the intended task. This issue can be addressed by improving the adversarial robustness of the reward model. More generally, any AI system used to evaluate another AI system must be adversarially robust. This could include monitoring tools, since they could also potentially be tampered with to produce a higher reward. Large language models (LLMs) can be vulnerable to prom

    Read more →
  • ID3 algorithm

    ID3 algorithm

    In decision tree learning, ID3 (Iterative Dichotomiser 3) is a greedy algorithm invented by Ross Quinlan used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm. The 3 in the name is meant to signify that this was Quinlan's third attempt at a model based on entropy-based splitting, and the term dichotimser is a misnomer as it implies a binary split, but the ID3 algorithm can split on multi-valued attributes. == Algorithm == The ID3 algorithm begins with the original set S {\displaystyle S} as the root node. On each iteration of the algorithm, it iterates through every unused attribute of the set S {\displaystyle S} and calculates the entropy H ( S ) {\displaystyle \mathrm {H} {(S)}} or the information gain I G ( S ) {\displaystyle IG(S)} of that attribute. It then selects the attribute which has the smallest entropy (or largest information gain) value. The set S {\displaystyle S} is then split or partitioned by the selected attribute to produce subsets of the data. (For example, a node can be split into child nodes based upon the subsets of the population whose ages are less than 50, between 50 and 100, and greater than 100.) The algorithm continues to recurse on each subset, considering only attributes never selected before. Recursion on a subset may stop in one of these cases: every element in the subset belongs to the same class; in which case the node is turned into a leaf node and labelled with the class of the examples. there are no more attributes to be selected, but the examples still do not belong to the same class. In this case, the node is made a leaf node and labelled with the most common class of the examples in the subset. there are no examples in the subset, which happens when no example in the parent set was found to match a specific value of the selected attribute. An example could be the absence of a person among the population with age over 100 years. Then a leaf node is created and labelled with the most common class of the examples in the parent node's set. Throughout the algorithm, the decision tree is constructed with each non-terminal node (internal node) representing the selected attribute on which the data was split, and terminal nodes (leaf nodes) representing the class label of the final subset of this branch. === Summary === Calculate the entropy of every attribute a {\displaystyle a} of the data set S {\displaystyle S} . Partition ("split") the set S {\displaystyle S} into subsets using the attribute for which the resulting entropy after splitting is minimized; or, equivalently, information gain is maximum. Make a decision tree node containing that attribute. Recurse on subsets using the remaining attributes. === Properties === ID3 does not guarantee an optimal solution. It can converge upon local optima. It uses a greedy strategy by selecting the locally best attribute to split the dataset on each iteration. The algorithm's optimality can be improved by using backtracking during the search for the optimal decision tree at the cost of possibly taking longer. ID3 can overfit the training data. To avoid overfitting, smaller decision trees should be preferred over larger ones. This algorithm usually produces small trees, but it does not always produce the smallest possible decision tree. ID3 is harder to use on continuous data than on factored data (factored data has a discrete number of possible values, thus reducing the possible branch points). If the values of any given attribute are continuous, then there are many more places to split the data on this attribute, and searching for the best value to split by can be time-consuming. === Usage === The ID3 algorithm is used by training on a data set S {\displaystyle S} to produce a decision tree which is stored in memory. At runtime, this decision tree is used to classify new test cases (feature vectors) by traversing the decision tree using the features of the datum to arrive at a leaf node. == The ID3 metrics == === Entropy === Entropy H ( S ) {\displaystyle \mathrm {H} {(S)}} is a measure of the amount of uncertainty in the (data) set S {\displaystyle S} (i.e. entropy characterizes the (data) set S {\displaystyle S} ). H ( S ) = ∑ x ∈ X − p ( x ) log 2 ⁡ p ( x ) {\displaystyle \mathrm {H} {(S)}=\sum _{x\in X}{-p(x)\log _{2}p(x)}} Where, S {\displaystyle S} – The current dataset for which entropy is being calculated This changes at each step of the ID3 algorithm, either to a subset of the previous set in the case of splitting on an attribute or to a "sibling" partition of the parent in case the recursion terminated previously. X {\displaystyle X} – The set of classes in S {\displaystyle S} p ( x ) {\displaystyle p(x)} – The proportion of the number of elements in class x {\displaystyle x} to the number of elements in set S {\displaystyle S} When H ( S ) = 0 {\displaystyle \mathrm {H} {(S)}=0} , the set S {\displaystyle S} is perfectly classified (i.e. all elements in S {\displaystyle S} are of the same class). In ID3, entropy is calculated for each remaining attribute. The attribute with the smallest entropy is used to split the set S {\displaystyle S} on this iteration. Entropy in information theory measures how much information is expected to be gained upon measuring a random variable; as such, it can also be used to quantify the amount to which the distribution of the quantity's values is unknown. A constant quantity has zero entropy, as its distribution is perfectly known. In contrast, a uniformly distributed random variable (discretely or continuously uniform) maximizes entropy. Therefore, the greater the entropy at a node, the less information is known about the classification of data at this stage of the tree; and therefore, the greater the potential to improve the classification here. As such, ID3 is a greedy heuristic performing a best-first search for locally optimal entropy values. Its accuracy can be improved by preprocessing the data. === Information gain === Information gain I G ( A ) {\displaystyle IG(A)} is the measure of the difference in entropy from before to after the set S {\displaystyle S} is split on an attribute A {\displaystyle A} . In other words, how much uncertainty in S {\displaystyle S} was reduced after splitting set S {\displaystyle S} on attribute A {\displaystyle A} . I G ( S , A ) = H ( S ) − ∑ t ∈ T p ( t ) H ( t ) = H ( S ) − H ( S | A ) . {\displaystyle IG(S,A)=\mathrm {H} {(S)}-\sum _{t\in T}p(t)\mathrm {H} {(t)}=\mathrm {H} {(S)}-\mathrm {H} {(S|A)}.} Where, H ( S ) {\displaystyle \mathrm {H} (S)} – Entropy of set S {\displaystyle S} T {\displaystyle T} – The subsets created from splitting set S {\displaystyle S} by attribute A {\displaystyle A} such that S = ⋃ t ∈ T t {\displaystyle S=\bigcup _{t\in T}t} p ( t ) {\displaystyle p(t)} – The proportion of the number of elements in t {\displaystyle t} to the number of elements in set S {\displaystyle S} H ( t ) {\displaystyle \mathrm {H} (t)} – Entropy of subset t {\displaystyle t} In ID3, information gain can be calculated (instead of entropy) for each remaining attribute. The attribute with the largest information gain is used to split the set S {\displaystyle S} on this iteration.

    Read more →
  • Win–stay, lose–switch

    Win–stay, lose–switch

    In psychology, game theory, statistics, and machine learning, win–stay, lose–switch (also win–stay, lose–shift or Pavlov, named after Ivan Pavlov) is a heuristic learning strategy used to model learning in decision situations. It was first invented as an improvement over randomization in bandit problems. It was later applied to the prisoner's dilemma in order to model the evolution of altruism. In most versions, it starts either with a cooperate, then proceeds as always, or starts with a "probe" of cooperate-defect-cooperate to determine the other player's strategy. A mutual cooperation is regarded as a win. The learning rule bases its decision only on the outcome of the previous play. Outcomes are divided into successes (wins) and failures (losses). If the play on the previous round resulted in a success, then the agent plays the same strategy on the next round. Alternatively, if the play resulted in a failure the agent switches to another action. A large-scale empirical study of players of the game rock, paper, scissors shows that a variation of this strategy is adopted by real-world players of the game, instead of the Nash equilibrium strategy of choosing entirely at random between the three options.

    Read more →
  • Generalized blockmodeling of valued networks

    Generalized blockmodeling of valued networks

    Generalized blockmodeling of valued networks is an approach of the generalized blockmodeling, dealing with valued networks (e.g., non-binary). While the generalized blockmodeling signifies a "formal and integrated approach for the study of the underlying functional anatomies of virtually any set of relational data", it is in principle used for binary networks. This is evident from the set of ideal blocks, which are used to interpret blockmodels, that are binary, based on the characteristic link patterns. Because of this, such templates are "not readily comparable with valued empirical blocks". To allow generalized blockmodeling of valued directional (one-mode) networks (e.g. allowing the direct comparisons of empirical valued blocks with ideal binary blocks), a non–parametric approach is used. With this, "an optional parameter determines the prominence of valued ties as a minimum percentile deviation between observed and expected flows". Such two–sided application of parameter then introduces "the possibility of non–determined ties, i.e. valued relations that are deemed neither prominent (1) nor non–prominent (0)." Resulted occurrences of links then motivate the modification of the calculation of inconsistencies between empirical and ideal blocks. At the same time, such links also give a possibility to measure the interpretational certainty, which is specific to each ideal block. Such maximum two–sided deviation threshold, holding the aggregate uncertainty score at zero or near–zero levels, is then proposed as "a measure of interpretational certainty for valued blockmodels, in effect transforming the optional parameter into an outgoing state". Problem with blockmodeling is the standard set of ideal block, as they are all specified using binary link (tie) patters; this results in "a non–trivial exercise to match and count inconsistencies between such ideal binary ties and empirical valued ties". One approach to solve this is by using dichotomization to transform the network into a binary version. The other two approaches were first proposed by Aleš Žiberna in 2007 by introducing valued (generalized) blockmodeling and also homogeneity blockmodeling. The basic idea of the latter is "that the inconsistency of an empirical block with its ideal block can be measured by within block variability of appropriate values". The newly–formed ideal blocks, which are appropriate for blockmodeling of valued networks, are then presented together with the definitions of their block inconsistencies. Two other approaches were later suggested by Carl Nordlund in 2019: deviational approach and correlation-based generalized approach. Both Nordlund's approaches are based on the idea, that valued networks can be compared with the ideal block without values. With this approach, more information is retained for analysis, which also means, that there are fewer partitions having identical values of the criterion function. This means, that the generalized blockmodeling of valued networks measures the inconsistencies more precisely. Usually, only one optimal partition is found in this approach, especially when it is used by homogeneity blockmodeling. Contrary, while using binary blockmodeling on the same sample, usually more than one optimal partition had occurred on several occasions.

    Read more →
  • Autonomous agent

    Autonomous agent

    An autonomous agent is an artificial intelligence (AI) system that can perform complex tasks independently. == Definitions == There are various definitions of autonomous agent. According to Brustoloni (1991): "Autonomous agents are systems capable of autonomous, purposeful action in the real world." According to Maes (1995): "Autonomous agents are computational systems that inhabit some complex dynamic environment, sense and act autonomously in this environment, and by doing so realize a set of goals or tasks for which they are designed." Franklin and Graesser (1997) review different definitions and propose their definition: "An autonomous agent is a system situated within and a part of an environment that senses that environment and acts on it, over time, in pursuit of its own agenda and so as to effect what it senses in the future." They explain that: "Humans and some animals are at the high end of being an agent, with multiple, conflicting drives, multiples senses, multiple possible actions, and complex sophisticated control structures. At the low end, with one or two senses, a single action, and an absurdly simple control structure we find a thermostat." == Agent appearance == Lee et al. (2015) post safety issue from how the combination of external appearance and internal autonomous agent have impact on human reaction about autonomous vehicles. Their study explores the human-like appearance agent and high level of autonomy are strongly correlated with social presence, intelligence, safety and trustworthiness. In specific, appearance impacts most on affective trust while autonomy impacts most on both affective and cognitive domain of trust where cognitive trust is characterized by knowledge-based factors and affective trust is largely emotion driven. == Applications == Agentic AI systems: Advanced AI agents that can scope out projects and complete them with necessary tools, representing a significant evolution from simple task-oriented systems. Internet of things (IoT) Integration: Autonomous agents increasingly interact with IoT devices, enabling smart home systems, industrial monitoring, and urban infrastructure management. Collaborative software development: Tools like Cognition AI's Devin aim to create autonomous software engineers capable of complex reasoning, planning, and completing engineering tasks requiring thousands of decisions. Enterprise automation: Business process automation platforms like Salesforce's Agentforce provide autonomous bots for various service functions. == Challenges and considerations == Uncertainty and incomplete information: Autonomous agents must make decisions with limited or uncertain information about their environment and future states. Integration complexity: Incorporating autonomous agents into existing systems and workflows can be technically challenging and resource-intensive. Scalability: As systems become more complex and more agents are used, maintaining coordination and avoiding conflicts becomes increasingly difficult. Trust: Research has shown the combination of external appearance and internal autonomous capabilities significantly impacts human reactions and trust. Lee et al. (2015) found that human-like appearance and high levels of autonomy are strongly correlated with social presence, intelligence, safety, and trustworthiness perceptions. Specifically, appearance impacts affective trust most significantly, while autonomy affects both affective and cognitive trust domains, where affective trust is emotionally driven, and cognitive trust is characterized by knowledge-based factors. Vulnerability to manipulation: Researchers from Harvard, MIT and other educational institutions found that AI agents could become vulnerable to manipulation and could perform detrimental actions in the process of being helpful. == Ethical and regulatory concerns == Accountability: Determining responsibility when autonomous agents make incorrect or harmful decisions remains a complex issue. Privacy and security: autonomous agents often require access to sensitive data, raising concerns about data protection and system security.

    Read more →
  • Feature selection

    Feature selection

    In machine learning, feature selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons: simplification of models to make them easier to interpret, shorter training times, to avoid the curse of dimensionality, improve the compatibility of the data with a certain learning model class, to encode inherent symmetries present in the input space. The central premise when using feature selection is that data sometimes contains features that are redundant or irrelevant, and can thus be removed without incurring much loss of information. Redundancy and irrelevance are two distinct notions, since one relevant feature may be redundant in the presence of another relevant feature with which it is strongly correlated. Feature extraction creates new features from functions of the original features, whereas feature selection finds a subset of the features. Feature selection techniques are often used in domains where there are many features and comparatively few samples (data points). == Introduction == A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The simplest algorithm is to test each possible subset of features finding the one which minimizes the error rate. This is an exhaustive search of the space, and is computationally intractable for all but the smallest of feature sets. The choice of evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between the three main categories of feature selection algorithms: wrappers, filters and embedded methods. Wrapper methods use a predictive model to score feature subsets. Each new subset is used to train a model, which is tested on a hold-out set. Counting the number of mistakes made on that hold-out set (the error rate of the model) gives the score for that subset. As wrapper methods train a new model for each subset, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model or typical problem. Filter methods use a proxy measure instead of the error rate to score a feature subset. This measure is chosen to be fast to compute, while still capturing the usefulness of the feature set. Common measures include the mutual information, the pointwise mutual information, Pearson product-moment correlation coefficient, Relief-based algorithms, and inter/intra class distance or the scores of significance tests for each class/feature combinations. Filters are usually less computationally intensive than wrappers, but they produce a feature set which is not tuned to a specific type of predictive model. This lack of tuning means a feature set from a filter is more general than the set from a wrapper, usually giving lower prediction performance than a wrapper. However the feature set doesn't contain the assumptions of a prediction model, and so is more useful for exposing the relationships between the features. Many filters provide a feature ranking rather than an explicit best feature subset, and the cut off point in the ranking is chosen via cross-validation. Filter methods have also been used as a preprocessing step for wrapper methods, allowing a wrapper to be used on larger problems. One other popular approach is the Recursive Feature Elimination algorithm, commonly used with Support Vector Machines to repeatedly construct a model and remove features with low weights. Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. The exemplar of this approach is the LASSO method for constructing a linear model, which penalizes the regression coefficients with an L1 penalty, shrinking many of them to zero. Any features which have non-zero regression coefficients are 'selected' by the LASSO algorithm. Improvements to the LASSO include Bolasso which bootstraps samples; Elastic net regularization, which combines the L1 penalty of LASSO with the L2 penalty of ridge regression; and FeaLect which scores all the features based on combinatorial analysis of regression coefficients. AEFS further extends LASSO to nonlinear scenario with autoencoders. These approaches tend to be between filters and wrappers in terms of computational complexity. In traditional regression analysis, the most popular form of feature selection is stepwise regression, which is a wrapper technique. It is a greedy algorithm that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically done by cross-validation. In statistics, some criteria are optimized. This leads to the inherent problem of nesting. More robust methods have been explored, such as branch and bound and piecewise linear network. == Subset selection == Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken up into wrappers, filters, and embedded methods. Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in, and specific to, a model. Many popular search approaches use greedy hill climbing, which iteratively evaluates a candidate subset of features, then modifies the subset and evaluates if the new subset is an improvement over the old. Evaluation of the subsets requires a scoring metric that grades a subset of features. Exhaustive search is generally impractical, so at some implementor (or operator) defined stopping point, the subset of features with the highest score discovered up to that point is selected as the satisfactory feature subset. The stopping criterion varies by algorithm; possible criteria include: a subset score exceeds a threshold, a program's maximum allowed run time has been surpassed, etc. Alternative search-based techniques are based on targeted projection pursuit which finds low-dimensional projections of the data that score highly: the features that have the largest projections in the lower-dimensional space are then selected. Search approaches include: Exhaustive Best first Simulated annealing Genetic algorithm Greedy forward selection Greedy backward elimination Particle swarm optimization Targeted projection pursuit Scatter search Variable neighborhood search Two popular filter metrics for classification problems are correlation and mutual information, although neither are true metrics or 'distance measures' in the mathematical sense, since they fail to obey the triangle inequality and thus do not compute any actual 'distance' – they should rather be regarded as 'scores'. These scores are computed between a candidate feature (or set of features) and the desired output category. There are, however, true metrics that are a simple function of the mutual information; see here. Other available filter metrics include: Class separability Error probability Inter-class distance Probabilistic distance Entropy Consistency-based feature selection Correlation-based feature selection == Optimality criteria == The choice of optimality criteria is difficult as there are multiple objectives in a feature selection task. Many common criteria incorporate a measure of accuracy, penalised by the number of features selected. Examples include Akaike information criterion (AIC) and Mallows's Cp, which have a penalty of 2 for each added feature. AIC is based on information theory, and is effectively derived via the maximum entropy principle. Other criteria are Bayesian information criterion (BIC), which uses a penalty of log ⁡ n {\displaystyle {\sqrt {\log {n}}}} for each added feature, minimum description length (MDL) which asymptotically uses log ⁡ n {\displaystyle {\sqrt {\log {n}}}} , Bonferroni / RIC which use 2 log ⁡ p {\displaystyle {\sqrt {2\log {p}}}} , maximum dependency feature selection, and a variety of new criteria that are motivated by false discovery rate (FDR), which use something close to 2 log ⁡ p q {\displaystyle {\sqrt {2\log {\frac {p}{q}}}}} . A maximum entropy rate criterion may also be used to select the most relevant subset of features. == Structure learning == Filter feature selection is a specific case of a more general paradigm called structure learning. Feature selection finds the relevant feature set for a specific target variable whereas structure learning finds the relationships between all the variables, usually by expressing these relationships as a graph. The most common structure learning algorithms

    Read more →
  • ID3 algorithm

    ID3 algorithm

    In decision tree learning, ID3 (Iterative Dichotomiser 3) is a greedy algorithm invented by Ross Quinlan used to generate a decision tree from a dataset. ID3 is the precursor to the C4.5 algorithm. The 3 in the name is meant to signify that this was Quinlan's third attempt at a model based on entropy-based splitting, and the term dichotimser is a misnomer as it implies a binary split, but the ID3 algorithm can split on multi-valued attributes. == Algorithm == The ID3 algorithm begins with the original set S {\displaystyle S} as the root node. On each iteration of the algorithm, it iterates through every unused attribute of the set S {\displaystyle S} and calculates the entropy H ( S ) {\displaystyle \mathrm {H} {(S)}} or the information gain I G ( S ) {\displaystyle IG(S)} of that attribute. It then selects the attribute which has the smallest entropy (or largest information gain) value. The set S {\displaystyle S} is then split or partitioned by the selected attribute to produce subsets of the data. (For example, a node can be split into child nodes based upon the subsets of the population whose ages are less than 50, between 50 and 100, and greater than 100.) The algorithm continues to recurse on each subset, considering only attributes never selected before. Recursion on a subset may stop in one of these cases: every element in the subset belongs to the same class; in which case the node is turned into a leaf node and labelled with the class of the examples. there are no more attributes to be selected, but the examples still do not belong to the same class. In this case, the node is made a leaf node and labelled with the most common class of the examples in the subset. there are no examples in the subset, which happens when no example in the parent set was found to match a specific value of the selected attribute. An example could be the absence of a person among the population with age over 100 years. Then a leaf node is created and labelled with the most common class of the examples in the parent node's set. Throughout the algorithm, the decision tree is constructed with each non-terminal node (internal node) representing the selected attribute on which the data was split, and terminal nodes (leaf nodes) representing the class label of the final subset of this branch. === Summary === Calculate the entropy of every attribute a {\displaystyle a} of the data set S {\displaystyle S} . Partition ("split") the set S {\displaystyle S} into subsets using the attribute for which the resulting entropy after splitting is minimized; or, equivalently, information gain is maximum. Make a decision tree node containing that attribute. Recurse on subsets using the remaining attributes. === Properties === ID3 does not guarantee an optimal solution. It can converge upon local optima. It uses a greedy strategy by selecting the locally best attribute to split the dataset on each iteration. The algorithm's optimality can be improved by using backtracking during the search for the optimal decision tree at the cost of possibly taking longer. ID3 can overfit the training data. To avoid overfitting, smaller decision trees should be preferred over larger ones. This algorithm usually produces small trees, but it does not always produce the smallest possible decision tree. ID3 is harder to use on continuous data than on factored data (factored data has a discrete number of possible values, thus reducing the possible branch points). If the values of any given attribute are continuous, then there are many more places to split the data on this attribute, and searching for the best value to split by can be time-consuming. === Usage === The ID3 algorithm is used by training on a data set S {\displaystyle S} to produce a decision tree which is stored in memory. At runtime, this decision tree is used to classify new test cases (feature vectors) by traversing the decision tree using the features of the datum to arrive at a leaf node. == The ID3 metrics == === Entropy === Entropy H ( S ) {\displaystyle \mathrm {H} {(S)}} is a measure of the amount of uncertainty in the (data) set S {\displaystyle S} (i.e. entropy characterizes the (data) set S {\displaystyle S} ). H ( S ) = ∑ x ∈ X − p ( x ) log 2 ⁡ p ( x ) {\displaystyle \mathrm {H} {(S)}=\sum _{x\in X}{-p(x)\log _{2}p(x)}} Where, S {\displaystyle S} – The current dataset for which entropy is being calculated This changes at each step of the ID3 algorithm, either to a subset of the previous set in the case of splitting on an attribute or to a "sibling" partition of the parent in case the recursion terminated previously. X {\displaystyle X} – The set of classes in S {\displaystyle S} p ( x ) {\displaystyle p(x)} – The proportion of the number of elements in class x {\displaystyle x} to the number of elements in set S {\displaystyle S} When H ( S ) = 0 {\displaystyle \mathrm {H} {(S)}=0} , the set S {\displaystyle S} is perfectly classified (i.e. all elements in S {\displaystyle S} are of the same class). In ID3, entropy is calculated for each remaining attribute. The attribute with the smallest entropy is used to split the set S {\displaystyle S} on this iteration. Entropy in information theory measures how much information is expected to be gained upon measuring a random variable; as such, it can also be used to quantify the amount to which the distribution of the quantity's values is unknown. A constant quantity has zero entropy, as its distribution is perfectly known. In contrast, a uniformly distributed random variable (discretely or continuously uniform) maximizes entropy. Therefore, the greater the entropy at a node, the less information is known about the classification of data at this stage of the tree; and therefore, the greater the potential to improve the classification here. As such, ID3 is a greedy heuristic performing a best-first search for locally optimal entropy values. Its accuracy can be improved by preprocessing the data. === Information gain === Information gain I G ( A ) {\displaystyle IG(A)} is the measure of the difference in entropy from before to after the set S {\displaystyle S} is split on an attribute A {\displaystyle A} . In other words, how much uncertainty in S {\displaystyle S} was reduced after splitting set S {\displaystyle S} on attribute A {\displaystyle A} . I G ( S , A ) = H ( S ) − ∑ t ∈ T p ( t ) H ( t ) = H ( S ) − H ( S | A ) . {\displaystyle IG(S,A)=\mathrm {H} {(S)}-\sum _{t\in T}p(t)\mathrm {H} {(t)}=\mathrm {H} {(S)}-\mathrm {H} {(S|A)}.} Where, H ( S ) {\displaystyle \mathrm {H} (S)} – Entropy of set S {\displaystyle S} T {\displaystyle T} – The subsets created from splitting set S {\displaystyle S} by attribute A {\displaystyle A} such that S = ⋃ t ∈ T t {\displaystyle S=\bigcup _{t\in T}t} p ( t ) {\displaystyle p(t)} – The proportion of the number of elements in t {\displaystyle t} to the number of elements in set S {\displaystyle S} H ( t ) {\displaystyle \mathrm {H} (t)} – Entropy of subset t {\displaystyle t} In ID3, information gain can be calculated (instead of entropy) for each remaining attribute. The attribute with the largest information gain is used to split the set S {\displaystyle S} on this iteration.

    Read more →
  • Correlation clustering

    Correlation clustering

    Clustering is the problem of partitioning data points into groups based on similarity or dissimilarity. Correlation clustering is a clustering framework in which a set of objects is partitioned into clusters based on pairwise similarity and dissimilarity information, without requiring the number of clusters to be specified in advance. == Description of the problem == In machine learning, correlation clustering (also known as cluster editing) considers settings in which pairwise similarity or dissimilarity relationships between objects are known. A standard formulation models the input as an unweighted complete graph G = ( V , E ) {\displaystyle G=(V,E)} , where each edge is labeled either + {\displaystyle +} or − {\displaystyle -} (that is, the graph is a signed graph), indicating whether the corresponding endpoints are similar or dissimilar. The goal is to find a clustering (that is, a partition of V {\displaystyle V} ) that either maximizes the number of agreements—the sum of positive edges whose endpoints lie in the same cluster and negative edges whose endpoints lie in different clusters—or minimizes the number of disagreements—the sum of positive edges whose endpoints are separated and negative edges whose endpoints lie in the same cluster. Unlike other clustering methods such as k-means, correlation clustering does not require choosing the number of clusters k {\displaystyle k} in advance. It is not always possible to find a clustering with zero disagreements. For example, consider a triangle graph containing two positive edges and one negative edge. In this case, every clustering incurs at least one disagreement. Such configurations are referred to in the literature as bad triangles. From a computational perspective, optimizing the correlation clustering objective is challenging. The (decision version of the) problem is NP-complete. A large body of subsequent work has developed approximation algorithms for correlation clustering under various assumptions, including complete or general graphs and unweighted or weighted graphs, for both minimization and maximization objectives. This problem is considered one of the fundamental combinatorial optimization problems, and many algorithmic techniques have been developed to address it. The problem has also been studied extensively across multiple disciplines. A comprehensive literature review of early correlation clustering research is provided by Wahid and Hassini. == Formal Definitions == Let G = ( V , E ) {\displaystyle G=(V,E)} be a graph with nodes V {\displaystyle V} and edges E {\displaystyle E} . A clustering of G {\displaystyle G} is a partition of its node set Π = { π 1 , … , π k } {\displaystyle \Pi =\{\pi _{1},\dots ,\pi _{k}\}} with V = π 1 ∪ ⋯ ∪ π k {\displaystyle V=\pi _{1}\cup \dots \cup \pi _{k}} and π i ∩ π j = ∅ {\displaystyle \pi _{i}\cap \pi _{j}=\emptyset } for i ≠ j {\displaystyle i\neq j} . For a given clustering Π {\displaystyle \Pi } , let δ ( Π ) = { { u , v } ∈ E ∣ { u , v } ⊈ π ∀ π ∈ Π } {\displaystyle \delta (\Pi )=\{\{u,v\}\in E\mid \{u,v\}\not \subseteq \pi \;\forall \pi \in \Pi \}} denote the subset of edges of G {\displaystyle G} whose endpoints are in different subsets of the clustering Π {\displaystyle \Pi } . Now, let w : E → R ≥ 0 {\displaystyle w\colon E\to \mathbb {R} _{\geq 0}} be a function that assigns a non-negative weight to each edge of the graph and let E = E + ∪ E − {\displaystyle E=E^{+}\cup E^{-}} be a partition of the edges into attractive ( E + {\displaystyle E^{+}} ) and repulsive ( E − {\displaystyle E^{-}} ) edges; that is, the edges are signed. The minimum disagreement correlation clustering problem is the following optimization problem: minimize Π ∑ e ∈ E + ∩ δ ( Π ) w e + ∑ e ∈ E − ∖ δ ( Π ) w e . {\displaystyle {\begin{aligned}&{\underset {\Pi }{\operatorname {minimize} }}&&\sum _{e\in E^{+}\cap \delta (\Pi )}w_{e}+\sum _{e\in E^{-}\setminus \delta (\Pi )}w_{e}\;.\end{aligned}}} Here, the set E + ∩ δ ( Π ) {\displaystyle E^{+}\cap \delta (\Pi )} contains the attractive edges whose endpoints are in different components with respect to the clustering Π {\displaystyle \Pi } and the set E − ∖ δ ( Π ) {\displaystyle E^{-}\setminus \delta (\Pi )} contains the repulsive edges whose endpoints are in the same component with respect to the clustering Π {\displaystyle \Pi } . Together these two sets contain all edges that disagree with the clustering Π {\displaystyle \Pi } . Similarly to the minimum disagreement correlation clustering problem, the maximum agreement correlation clustering problem is defined as maximize Π ∑ e ∈ E + ∖ δ ( Π ) w e + ∑ e ∈ E − ∩ δ ( Π ) w e . {\displaystyle {\begin{aligned}&{\underset {\Pi }{\operatorname {maximize} }}&&\sum _{e\in E^{+}\setminus \delta (\Pi )}w_{e}+\sum _{e\in E^{-}\cap \delta (\Pi )}w_{e}\;.\end{aligned}}} Here, the set E + ∖ δ ( Π ) {\displaystyle E^{+}\setminus \delta (\Pi )} contains the attractive edges whose endpoints are in the same component with respect to the clustering Π {\displaystyle \Pi } and the set E − ∩ δ ( Π ) {\displaystyle E^{-}\cap \delta (\Pi )} contains the repulsive edges whose endpoints are in different components with respect to the clustering Π {\displaystyle \Pi } . Together these two sets contain all edges that agree with the clustering Π {\displaystyle \Pi } . Instead of formulating the correlation clustering problem in terms of non-negative edge weights and a partition of the edges into attractive and repulsive edges the problem is also formulated in terms of positive and negative edge costs without partitioning the set of edges explicitly. For given weights w : E → R ≥ 0 {\displaystyle w\colon E\to \mathbb {R} _{\geq 0}} and a given partition E = E + ∪ E − {\displaystyle E=E^{+}\cup E^{-}} of the edges into attractive and repulsive edges, the edge costs can be defined by c e = { w e if e ∈ E + − w e if e ∈ E − {\displaystyle {\begin{aligned}c_{e}={\begin{cases}\;\;w_{e}&{\text{if }}e\in E^{+}\\-w_{e}&{\text{if }}e\in E^{-}\end{cases}}\end{aligned}}} for all e ∈ E {\displaystyle e\in E} . An edge whose endpoints are in different clusters is said to be cut. The set δ ( Π ) {\displaystyle \delta (\Pi )} of all edges that are cut is often called a multicut of G {\displaystyle G} . The minimum cost multicut problem is the problem of finding a clustering Π {\displaystyle \Pi } of G {\displaystyle G} such that the sum of the costs of the edges whose endpoints are in different clusters is minimal: minimize Π ∑ e ∈ δ ( Π ) c e . {\displaystyle {\begin{aligned}&{\underset {\Pi }{\operatorname {minimize} }}&&\sum _{e\in \delta (\Pi )}c_{e}\;.\end{aligned}}} Similar to the minimum cost multicut problem, coalition structure generation in weighted graph games is the problem of finding a clustering such that the sum of the costs of the edges that are not cut is maximal: maximize Π ∑ e ∈ E ∖ δ ( Π ) c e . {\displaystyle {\begin{aligned}&{\underset {\Pi }{\operatorname {maximize} }}&&\sum _{e\in E\setminus \delta (\Pi )}c_{e}\;.\end{aligned}}} This formulation is also known as the clique partitioning problem. It can be shown that all four problems that are formulated above are equivalent. This means that a clustering that is optimal with respect to any of the four objectives is optimal for all of the four objectives. == Algorithms == If the graph admits a clustering with zero disagreements, then deleting all negative edges and computing the connected components of the remaining graph yields an optimal clustering. A necessary and sufficient condition for the existence of such a clustering was given by Davis: no cycle in the graph may contain exactly one negative edge. Bansal et al. discuss the NP-completeness proof and also present both a constant factor approximation algorithm and polynomial-time approximation scheme to find the clusters in this setting. Ailon et al. propose a randomized 3-approximation algorithm for the same problem. CC-Pivot(G=(V,E+,E−)) Pick random pivot i ∈ V Set C = { i } {\displaystyle C=\{i\}} , V'=Ø For all j ∈ V, j ≠ i; If (i,j) ∈ E+ then Add j to C Else (If (i,j) ∈ E−) Add j to V' Let G' be the subgraph induced by V' Return clustering C,CC-Pivot(G') The authors show that the above algorithm is a 3-approximation algorithm for correlation clustering. The best polynomial-time approximation algorithm known at the moment for this problem achieves a ~2.06 approximation by rounding a linear program, as shown by Chawla, Makarychev, Schramm, and Yaroslavtsev. Karpinski and Schudy proved existence of a polynomial time approximation scheme (PTAS) for that problem on complete graphs and fixed number of clusters. == Optimal number of clusters == In 2011, it was shown by Bagon and Galun that the optimization of the correlation clustering functional is closely related to well known discrete optimization methods. In their work they proposed a probabilistic analysis of the underlying implicit model that allows the correlation clustering functional to estimate the

    Read more →
  • Artifact (app)

    Artifact (app)

    Artifact was a personalized social news aggregator app that uses recommender systems to suggest articles. Launched in January 2023 by Nokto, Inc., a company founded by co-founders of Instagram Kevin Systrom and Mike Krieger, the app is available for iOS and Android. The app's name is a portmanteau of the words "articles", "artificial intelligence", and "fact". The app shut down in January 2024 as a result of low interest. == History == Nokto, Inc. was established on March 3, 2022, as a foreign stock company in California, with its headquarters in San Francisco. The company's main product, Artifact, is the first new product launched by Krieger and Systrom since their 2018 resignation from Instagram after conflicts with parent company Meta, which acquired Instagram in 2012. Artifact launched on January 31, 2023, after the team had been working on it for over a year, offering the option to sign up for a waiting list for its private beta, which grew to about 160,000 people, and then launching in open beta on February 22, 2023. With a team of seven employees in San Francisco, the app was free throughout its lifetime, with the founders explaining at the time that different business models - such as advertising or subscription fees - could be explored in the future. In January 2024, cofounder Kevin Systrom announced that the app would be shutting down after concluding that "the market opportunity isn’t big enough to warrant continued investment in this way." In April 2024, it was announced Artifact had been acquired by Yahoo, who intended to use the service's technology in an upgraded Yahoo! News app. == Features == Frequently described as "TikTok for text" and a competitor to Twitter, Artifact was a news aggregator that used machine learning to make personalized recommendations based on topics, news sources, and authors that the reader is interested in. In addition to reading articles, the app offered the ability to like articles, leave comments, or listen to an audio version of an article read by AI-generated voices, including a simulation of the voices of Snoop Dogg or Gwyneth Paltrow. AI also would rewrite clickbait headlines that users flagged. Artifact later expanded to a social network where users could post links, images and text to their profile, which could be liked or commented on by other users. Similar to other social news websites like Reddit, reader accounts had profiles with reputation scores.

    Read more →
  • KNIME

    KNIME

    KNIME ( ), the Konstanz Information Miner, is a data analytics, reporting and integrating platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks of Analytics" concept. A graphical user interface and use of Java Database Connectivity (JDBC) allows assembly of nodes blending different data sources, including preprocessing (extract, transform, load, or ETL), for modeling, data analysis and visualization with minimal, or no, programming. It is free and open-source software released under a GNU General Public License. Since 2006, KNIME has been used in pharmaceutical research, and in other areas including customer relationship management (CRM) and data analysis, business intelligence, text mining and financial data analysis. Recently, attempts were made to use KNIME as robotic process automation (RPA) tool. KNIME's headquarters are based in Zurich, with other offices in Konstanz, Berlin, and Austin (USA). == History == Development of KNIME began in January 2004, with a team of software engineers at the University of Konstanz, as an open-source platform. The original team, headed by Michael Berthold, came from a Silicon Valley pharmaceutical industry software company. The initial goal was to create a modular, highly scalable and open data processing platform that allows easy integration of different data loading, processing, transforming, analyzing, and visual exploring modules, without focus on any one application area. The platform was intended for collaborating, research, and for integrating various other data analysis projects. In 2006, the first version of KNIME was released. Several pharmaceutical companies began using KNIME, and several life science software vendors began integrating their tools into the platform. Later that year, after an article in the German magazine c't, users from a number of other areas joined ship. As of 2012, KNIME is in use by over 15,000 actual users (i.e. not counting downloads, but users regularly retrieving updates) in the life sciences and at banks, publishers, car manufacturer, telcos, consulting firms, and various other industries, and a large number of research groups, worldwide. Latest updates to KNIME Server and KNIME Big Data Extensions, provide support for Apache Spark 2.3, Parquet and HDFS-type storage. For the sixth year in a row, KNIME has been placed as a leader for data science and machine learning platforms in Gartner's Magic Quadrant. == Design philosophy, features == These are the design principles and features that KNIME software follows: Visual, Interactive Framework: KNIME Software prioritizes a user-friendly and intuitive approach to data analysis. This is achieved through a visual and interactive framework where data flows can be combined using a drag-and-drop interface. Users can develop customized and interactive applications by creating simple to advanced and highly-automated data pipelines. These may include, for example, access to databases, machine learning libraries, logic for workflow control (e.g., loops, switches, etc.), abstraction (e.g., interactive widgets), invocation, dynamic data apps, integrated deployment, or error handling. Modularity: processing units and data containers should remain independent of each other. This design choice enables easy distribution of computation and allows for the independent development of different algorithms. Data types within KNIME are encapsulated, meaning no types are predefined. This design choice facilitates adding new data types, and integrating them with extant types, while including type-specific renderers and comparators. This principle also enables inspecting results at the end of each single data operation. Extensibility: KNIME Software is designed to be extensible. Adding new processing nodes or views is made simple through a plug-in mechanism. This mechanism ensures that users can distribute their custom functionalities without the need for complicated install or uninstall procedures. Interleaving No-Code with Code: the platform supports integrating both visual programming (no-code) and script-based programming (e.g., Python, R, JavaScript) approaches to data analysis. This design principle is termed low-code. Automation and Scalability: for example, the use of parameterization via flow variables, or the encapsulation of workflow segments in components contribute to reduce manual work and errors in analyses. Further, the scheduling of workflow execution (available in KNIME Business Hub and KNIME Community Hub for Teams) reduces dependency on human resources. In terms of scalability, a few examples include the ability to handle large datasets (millions of rows), execute multiple processes simultaneously out of the box and reuse workflow segments. Full Usability: due to the open source nature, KNIME Analytics Platform provides free full usability with no limited trial periods. == Internals == KNIME allows users to visually create data flows (or pipelines), selectively execute some or all analysis steps, and later inspect the results, models, using interactive widgets and views. KNIME is written in Java and based on Eclipse. It makes use of an extension mechanism to add plug-ins providing added functions. The core version includes hundreds of modules for data integration (file input/output (I/O), database nodes supporting all common database management systems through JDBC or native connectors: SQLite, MS-Access, SQL Server, MySQL, Oracle, PostgreSQL, Vertica and H2), data transformation (filter, converter, splitter, combiner, joiner), and the commonly used methods of statistics, data mining, analysis and text analytics. Visualization is supported with the Report Designer extension. KNIME workflows can be used as data sets to create report templates that can be exported to document formats such as doc, ppt, xls, pdf and others. Other KNIME abilities are: KNIMEs core-architecture allows processing of large data volumes that are only limited by the available hard disk space (not limited to the available RAM). E.g., KNIME allows analyzing 300 million customer addresses, 20 million cell images, and 10 million molecular structures. Added plug-ins allow integrating methods for text mining, image mining, time series analysis, and networking. KNIME integrates various other open-source projects, e.g., machine learning algorithms from Weka, H2O, Keras, Spark, the R project and LIBSVM; plotly, JFreeChart, ImageJ, and the Chemistry Development Kit. KNIME is implemented in Java, allows for wrappers calling other code, in addition to providing nodes that allow it to run Java, Python, R, Ruby and other code fragments. Since 2021, KNIME's Python Integration utilizes Anaconda for Python distribution and environment management. == License == In 2024, KNIME version 5.3 is released under the same GPLv3 license as previous versions. As of version 2.1, KNIME is released under the GPLv3 license, with an exception that allow commercial software vendors to use the well-defined node application programming interface (API) to add proprietary extensions, or wrappers calling their tools from KNIME. == Courses == KNIME allows the performance of data analysis without programming skills. Several free, online courses are provided.

    Read more →
  • Inverted pendulum

    Inverted pendulum

    An inverted pendulum is a pendulum that has its center of mass above its pivot point. It is unstable and falls over without additional help. It can be suspended stably in this inverted position by using a control system to monitor the angle of the pole and move the pivot point horizontally back under the center of mass when it starts to fall over, keeping it balanced. The inverted pendulum is a classic problem in dynamics and control theory and is used as a benchmark for testing control strategies. It is often implemented with the pivot point mounted on a cart that can move horizontally under control of an electronic servo system as shown in the photo; this is called a cart and pole apparatus. Most applications limit the pendulum to 1 degree of freedom by affixing the pole to an axis of rotation. Whereas a normal pendulum is stable when hanging downward, an inverted pendulum is inherently unstable, and must be actively balanced in order to remain upright; this can be done either by applying a torque at the pivot point, by moving the pivot point horizontally as part of a feedback system, changing the rate of rotation of a mass mounted on the pendulum on an axis parallel to the pivot axis and thereby generating a net torque on the pendulum, or by oscillating the pivot point vertically. A simple demonstration of moving the pivot point in a feedback system is achieved by balancing an upturned broomstick on the end of one's finger. A second type of inverted pendulum is a tiltmeter for tall structures, which consists of a wire anchored to the bottom of the foundation and attached to a float in a pool of oil at the top of the structure that has devices for measuring movement of the neutral position of the float away from its original position. == Overview == A pendulum with its bob hanging directly below the support pivot is at a stable equilibrium point, where it remains motionless because there is no torque on the pendulum. If displaced from this position, it experiences a restoring torque that returns it toward the equilibrium position. A pendulum with its bob in an inverted position, supported on a rigid rod directly above the pivot, 180° from its stable equilibrium position, is at an unstable equilibrium point. At this point again there is no torque on the pendulum, but the slightest displacement away from this position causes a gravitation torque on the pendulum that accelerates it away from equilibrium, causing it to fall over. In order to stabilize a pendulum in this inverted position, a feedback control system can be used, which monitors the pendulum's angle and moves the position of the pivot point sideways when the pendulum starts to fall over, to keep it balanced. The inverted pendulum is a classic problem in dynamics and control theory and is widely used as a benchmark for testing control algorithms (PID controllers, state-space representation, neural networks, fuzzy control, genetic algorithms, etc.). Variations on this problem include multiple links, allowing the motion of the cart to be commanded while maintaining the pendulum, and balancing the cart-pendulum system on a see-saw. The inverted pendulum is related to rocket or missile guidance, where the center of gravity is located behind the center of drag causing aerodynamic instability. The understanding of a similar problem can be shown by simple robotics in the form of a balancing cart. Balancing an upturned broomstick on the end of one's finger is a simple demonstration, and the problem is solved by self-balancing personal transporters such as the Segway PT, the self-balancing hoverboard and the self-balancing unicycle. Another way that an inverted pendulum may be stabilized, without any feedback or control mechanism, is by oscillating the pivot rapidly up and down. This is called Kapitza's pendulum. If the oscillation is sufficiently strong (in terms of its acceleration and amplitude) then the inverted pendulum can recover from perturbations in a strikingly counterintuitive manner. If the driving point moves in simple harmonic motion, the pendulum's motion is described by the Mathieu equation. == Equations of motion == The equations of motion of inverted pendulums are dependent on what constraints are placed on the motion of the pendulum. Inverted pendulums can be created in various configurations resulting in a number of Equations of Motion describing the behavior of the pendulum. === Stationary pivot point === In a configuration where the pivot point of the pendulum is fixed in space, the equation of motion is similar to that for an uninverted pendulum. The equation of motion below assumes no friction or any other resistance to movement, a rigid massless rod, and the restriction to 2-dimensional movement. θ ¨ − g ℓ sin ⁡ θ = 0 {\displaystyle {\ddot {\theta }}-{g \over \ell }\sin \theta =0} Where θ ¨ {\displaystyle {\ddot {\theta }}} is the angular acceleration of the pendulum, g {\displaystyle g} is the standard gravity on the surface of the Earth, ℓ {\displaystyle \ell } is the length of the pendulum, and θ {\displaystyle \theta } is the angular displacement measured from the equilibrium position. When θ ¨ {\displaystyle {\ddot {\theta }}} added to both sides, it has the same sign as the angular acceleration term: θ ¨ = g ℓ sin ⁡ θ {\displaystyle {\ddot {\theta }}={g \over \ell }\sin \theta } Thus, the inverted pendulum accelerates away from the vertical unstable equilibrium in the direction initially displaced, and the acceleration is inversely proportional to the length. Tall pendulums fall more slowly than short ones. Derivation using torque and moment of inertia: The pendulum is assumed to consist of a point mass, of mass m {\displaystyle m} , affixed to the end of a massless rigid rod, of length ℓ {\displaystyle \ell } , attached to a pivot point at the end opposite the point mass. The net torque of the system must equal the moment of inertia times the angular acceleration: τ n e t = I θ ¨ {\displaystyle {\boldsymbol {\tau }}_{\mathrm {net} }=I{\ddot {\theta }}} The torque due to gravity providing the net torque: τ n e t = m g ℓ sin ⁡ θ {\displaystyle {\boldsymbol {\tau }}_{\mathrm {net} }=mg\ell \sin \theta \,\!} Where θ {\displaystyle \theta \ } is the angle measured from the inverted equilibrium position. The resulting equation: I θ ¨ = m g ℓ sin ⁡ θ {\displaystyle I{\ddot {\theta }}=mg\ell \sin \theta \,\!} The moment of inertia for a point mass: I = m R 2 {\displaystyle I=mR^{2}} In the case of the inverted pendulum the radius is the length of the rod, ℓ {\displaystyle \ell } . Substituting in I = m ℓ 2 {\displaystyle I=m\ell ^{2}} m ℓ 2 θ ¨ = m g ℓ sin ⁡ θ {\displaystyle m\ell ^{2}{\ddot {\theta }}=mg\ell \sin \theta \,\!} Mass and ℓ 2 {\displaystyle \ell ^{2}} is divided from each side resulting in: θ ¨ = g ℓ sin ⁡ θ {\displaystyle {\ddot {\theta }}={g \over \ell }\sin \theta } === Inverted pendulum on a cart === An inverted pendulum on a cart consists of a mass m {\displaystyle m} at the top of a pole of length ℓ {\displaystyle \ell } pivoted on a horizontally moving base as shown in the adjacent image. The cart is restricted to linear motion and is subject to forces resulting in or hindering motion. === Essentials of stabilization === The essentials of stabilizing the inverted pendulum can be summarized qualitatively in three steps. 1. If the tilt angle θ {\displaystyle \theta } is to the right, the cart must accelerate to the right and vice versa. 2. The position of the cart x {\displaystyle x} relative to track center is stabilized by slightly modulating the null angle (the angle error that the control system tries to null) by the position of the cart, that is, null angle = θ + k x {\displaystyle =\theta +kx} where k {\displaystyle k} is small. This makes the pole want to lean slightly toward track center and stabilize at track center where the tilt angle is exactly vertical. Any offset in the tilt sensor or track slope that would otherwise cause instability translates into a stable position offset. A further added offset gives position control. 3. A normal pendulum subject to a moving pivot point such as a load lifted by a crane, has a peaked response at the pendulum radian frequency of ω p = g / ℓ {\displaystyle \omega _{p}={\sqrt {g/\ell }}} . To prevent uncontrolled swinging, the frequency spectrum of the pivot motion should be suppressed near ω p {\displaystyle \omega _{p}} . The inverted pendulum requires the same suppression filter to achieve stability. As a consequence of the null angle modulation strategy, the position feedback is positive, that is, a sudden command to move right produces an initial cart motion to the left followed by a move right to rebalance the pendulum. The interaction of the pendulum instability and the positive position feedback instability to produce a stable system is a feature that makes the mathematical analysis an interesting and challenging problem. === From Lagrange's equations === The equations of motion c

    Read more →