A mind map is a diagram used to visually organize information into a hierarchy, showing relationships among pieces of the whole. It is often based on a single concept, drawn as an image in the center of a blank page, to which associated representations of ideas such as images, words and parts of words are added. Major ideas are connected directly to the central concept, and other ideas branch out from those major ideas. Mind maps can also be drawn by hand, either as "notes" during a lecture, meeting or planning session, for example, or as higher quality pictures when more time is available. Mind maps are considered to be a type of spider diagram. == Origin == Although the term "mind map" was first popularized by British popular psychology author and television personality Tony Buzan, the use of diagrams that visually "map" information using branching and radial maps traces back centuries. These pictorial methods record knowledge and model systems, and have a long history in learning, brainstorming, memory, visual thinking, and problem solving by educators, engineers, psychologists, and others. Some of the earliest examples of such graphical records were developed by Porphyry of Tyros, a noted thinker of the 3rd century, as he graphically visualized the concept categories of Aristotle. Philosopher Ramon Llull (1235–1315) also used such techniques. Buzan's specific approach, and the introduction of the term "mind map", started with a 1974 BBC TV series he hosted, called Use Your Head. In this show, and companion book series, Buzan promoted his conception of radial tree, diagramming key words in a colorful, radiant, tree-like structure. == Differences from other visualizations == Concept maps: Mind maps differ from concept maps in that mind maps are based on a radial hierarchy (tree structure) denoting relationships with a central concept, whereas concept maps can be more free-form, based on connections between concepts in more diverse patterns. Also, concept maps typically have text labels on the links between nodes. However, either can be part of a larger personal knowledge base system. Modeling graphs or graphical modeling languages: There is no rigorous right or wrong with mind maps, which rely on the arbitrariness of mnemonic associations to aid people's information organization and memory. In contrast, a modeling graph such as a UML diagram structures elements using a precise standardized iconography to aid the design of systems. == Research == === Effectiveness === Cunningham (2005) conducted a user study in which 80% of the students thought "mindmapping helped them understand concepts and ideas in science". Other studies also report some subjective positive effects of the use of mind maps. Positive opinions on their effectiveness, however, were much more prominent among students of art and design than in students of computer and information technology, with 62.5% vs 34% (respectively) agreeing that they were able to understand concepts better with mind mapping software. Farrand, Hussain, and Hennessy (2002) found that spider diagrams (similar to concept maps) had limited, but significant, impact on memory recall in undergraduate students (a 10% increase over baseline for a 600-word text only) as compared to preferred study methods (a 6% increase over baseline). This improvement was only robust after a week for those in the diagram group and there was a significant decrease in motivation compared to the subjects' preferred methods of note taking. A meta study about concept mapping concluded that concept mapping is more effective than "reading text passages, attending lectures, and participating in class discussions". The same study also concluded that concept mapping is slightly more effective "than other constructive activities such as writing summaries and outlines". However, results were inconsistent, with the authors noting "significant heterogeneity was found in most subsets". In addition, they concluded that low-ability students may benefit more from mind mapping than high-ability students. === Features === Joeran Beel and Stefan Langer conducted a comprehensive analysis of the content of mind maps. They analysed 19,379 mind maps from 11,179 users of the mind mapping applications SciPlore MindMapping (now Docear) and MindMeister. Results include that average users create only a few mind maps (mean=2.7), average mind maps are rather small (31 nodes) with each node containing about three words (median). However, there were exceptions. One user created more than 200 mind maps, the largest mind map consisted of more than 50,000 nodes and the largest node contained ~7,500 words. The study also showed that between different mind mapping applications (Docear vs MindMeister) significant differences exist related to how users create mind maps. === Automatic creation === There have been some attempts to create mind maps automatically. Brucks & Schommer created mind maps automatically from full-text streams. Rothenberger et al. extracted the main story of a text and presented it as mind map. There is also a patent application about automatically creating sub-topics in mind maps. == Tools == Mind-mapping software can be used to organize large amounts of information, combining spatial organization, dynamic hierarchical structuring and node folding.Software packages can extend the concept of mind-mapping by allowing individuals to map more than thoughts and ideas with information on their computers and the Internet, like spreadsheets, documents, Internet sites, images and videos. It has been suggested that mind-mapping can improve learning/study efficiency up to 15% over conventional note-taking. == Gallery == The following dozen examples of mind maps show the range of styles that a mind map may take, from hand-drawn to computer-generated and from mostly text to highly illustrated. Despite their stylistic differences, all of the examples share a tree structure that hierarchically connects sub-topics to a main topic.
Large language model
A large language model (LLM) is a neural network trained on a vast amount of text for natural language processing tasks, especially language generation. LLMs can typically generate, summarize, translate and analyze text in many contexts, and are a foundational technology behind modern chatbots. Biased or inaccurate training data can make an LLM's output less reliable. As of 2026, the most capable LLMs are based on transformer architectures, which, according to the 2017 paper "Attention Is All You Need", can be more efficient and parallelizable than earlier statistical and recurrent neural network models. Benchmark evaluations for LLMs attempt to measure model reasoning, factual accuracy, alignment, and safety. == History == Before the emergence of transformer-based models in 2017, some language models were considered large relative to the computational and data constraints of their time. In the early 1990s, IBM's statistical models pioneered word alignment techniques for machine translation, laying the groundwork for corpus-based language modeling. In 2001, a smoothed n-gram model, such as those employing Kneser–Ney smoothing, trained on 300 million words, achieved state-of-the-art perplexity on benchmark tests. During the 2000s, with the rise of widespread internet access, researchers began compiling massive text datasets from the web ("web as corpus") to train statistical language models. Moving beyond n-gram models, researchers started in 2000 to use neural networks as language models. Following the breakthrough of deep neural networks in image classification around 2012, similar architectures were adapted for language tasks. This shift was marked by the development of word embeddings (e.g., Word2Vec by Mikolov in 2013) and sequence-to-sequence (seq2seq) models using LSTM. In 2016, Google transitioned its translation service to neural machine translation (NMT), replacing statistical phrase-based models with deep recurrent neural networks. These early NMT systems used LSTM-based encoder-decoder architectures, as they preceded the invention of transformers. At the 2017 NeurIPS conference, Google researchers introduced the transformer architecture in their landmark paper "Attention Is All You Need". This paper's goal was to improve upon 2014 seq2seq technology, and was based mainly on the attention mechanism developed by Bahdanau et al. in 2014. The following year in 2018, BERT was introduced and quickly became "ubiquitous". Though the original transformer has both encoder and decoder blocks, BERT is an encoder-only model. Academic and research usage of BERT began to decline in 2023, following rapid improvements in the abilities of decoder-only models (such as GPT) to solve tasks via prompting. Although decoder-only GPT-1 was introduced in 2018, it was GPT-2 in 2019 that caught widespread attention because OpenAI claimed to have initially deemed it too powerful to release publicly, out of fear of malicious use. GPT-3 in 2020 went a step further and as of 2025 is available only via API with no offering of downloading the model to execute locally. But it was the consumer-facing chatbot ChatGPT in late 2022 that received extensive media coverage and public attention by 2023. The 2023 GPT-4 was praised for its increased accuracy and as a "holy grail" for its multimodal capabilities. OpenAI did not reveal the high-level architecture and the number of parameters of GPT-4. The release of ChatGPT led to an uptick in LLM usage across several research subfields of computer science, including robotics, software engineering, and societal impact work. In 2024, OpenAI released the reasoning model OpenAI o1, which generates long chains of thought before returning a final answer. Many LLMs with parameter counts comparable to those of OpenAI's GPT series have been developed. Since 2022, weights-available models have been gaining popularity, especially at first with BLOOM and LLaMA, though both have restrictions on usage and deployment. Mistral AI's open-weight models Mistral 7B and Mixtral 8x7B have a more permissive Apache License. In January 2025, DeepSeek released DeepSeek R1, a 671-billion-parameter open-weight model that performs comparably to OpenAI o1 but at a much lower price per token for users. Since 2023, many LLMs have been trained to be multimodal, having the ability to also process or generate other types of data, such as images, audio, or 3D meshes. Open-weight LLMs have become more influential since 2023. Per Vake et al. (2025), community-driven contributions to open-weight models improve their efficiency and performance via collaborative platforms such as Hugging Face. == Dataset preprocessing == === Tokenization === As machine learning algorithms process numbers rather than text, the text must be converted to numbers. In the first step, a vocabulary is decided upon, then integer indices are arbitrarily but uniquely assigned to each vocabulary entry, and finally, an embedding is associated with the integer index. Algorithms include byte-pair encoding (BPE) and WordPiece. There are also special tokens serving as control characters, such as [MASK] for masked-out token (as used in BERT), and [UNK] ("unknown") for characters not appearing in the vocabulary. Also, some special symbols are used to denote special text formatting. For example, "Ġ" denotes a preceding whitespace in RoBERTa and GPT and "##" denotes continuation of a preceding word in BERT. For example, the BPE tokenizer used by the legacy version of GPT-3 would split tokenizer: texts -> series of numerical "tokens" as Tokenization also compresses the datasets. Because LLMs generally require input to be an array that is not jagged, the shorter texts must be "padded" until they match the length of the longest one. ==== Byte-pair encoding ==== As an example, consider a tokenizer based on byte-pair encoding. In the first step, all unique characters (including blanks and punctuation marks) are treated as an initial set of n-grams (i.e. initial set of uni-grams). Successively the most frequent pair of adjacent characters is merged into a bi-gram and all instances of the pair are replaced by it. All occurrences of adjacent pairs of (previously merged) n-grams that most frequently occur together are then again merged into even lengthier n-gram, until a vocabulary of prescribed size is obtained. After a tokenizer is trained, any text can be tokenized by it, as long as it does not contain characters not appearing in the initial-set of uni-grams. === Dataset cleaning === In the context of training LLMs, datasets are typically cleaned by removing low-quality, duplicated, or toxic data. Cleaned datasets can increase training efficiency and lead to improved downstream performance. A trained LLM can be used to clean datasets for training a further LLM. With the increasing proportion of LLM-generated content on the web, data cleaning in the future may include filtering out such content. LLM-generated content can pose a problem if the content is similar to human text (making filtering difficult) but of lower quality (degrading performance of models trained on it). === Synthetic data === Training of largest language models might need more linguistic data than naturally available, or that the naturally occurring data is of insufficient quality. In these cases, synthetic data might be used. == Training == An LLM is a type of foundation model (large X model) trained on language. LLMs can be trained in different ways. In particular, GPT models are first pretrained to predict the next word on a large amount of data, before being fine-tuned. === Cost === Substantial infrastructure is necessary for training the largest models. The tendency towards larger models is visible in the list of large language models. For example, the training of GPT-2 (i.e. a 1.5-billion-parameter model) in 2019 cost $50,000, while training of the PaLM (i.e. a 540-billion-parameter model) in 2022 cost $8 million, and Megatron-Turing NLG 530B (in 2021) cost around $11 million. The qualifier "large" in "large language model" is inherently vague, as there is no definitive threshold for the number of parameters required to qualify as "large". === Fine-tuning === Before being fine-tuned, most LLMs are next-token predictors. The fine-tuning shapes the LLM's behavior via techniques like reinforcement learning from human feedback (RLHF) or constitutional AI. Instruction fine-tuning is a form of supervised learning used to teach LLMs to follow user instructions. In 2022, OpenAI demonstrated InstructGPT, a version of GPT-3 similarly fine-tuned to follow instructions. Reinforcement learning from human feedback (RLHF) involves training a reward model to predict which text humans prefer. Then, the LLM can be fine-tuned through reinforcement learning to better satisfy this reward model. Since humans typically prefer truthful, helpful and harmless answers, RLHF favors such answers. == Architecture == LLMs are generally based on the tra
Elements of AI
Elements of AI is a massive open online course (MOOC) teaching the basics of artificial intelligence. The course, originally launched in 2018, is designed and organized by the University of Helsinki and learning technology company MinnaLearn. The course includes modules on machine learning, neural networks, the philosophy of artificial intelligence, and using artificial intelligence to solve problems. It consists of two parts: Introduction to AI and its sequel, Building AI, that was released in late 2020. In November 2019, the course was named one of four winners of MIT’s Inclusive Innovation Challenge. University of Helsinki's computer science department is known as the alma mater of Linus Torvalds, a Finnish-American software engineer who is the creator of the Linux kernel, which is the kernel for Linux operating systems. == EU’s AI pledge == The government of Finland has pledged to offer the course for all EU citizens by the end of 2021, as the course is made available in all the official EU languages. The initiative was launched as part of Finland's Presidency of the Council of the European Union in 2019, with the European Commission providing translations of the course materials. In 2017, Finland launched an AI strategy to stay competitive in the field of AI amid growing competition between China and the United States. With the support of private companies and the government, Finland's now-realized goal was to get 1 percent of its citizens to participate in Elements of AI. Other governments have also given their support to the course. For instance, Germany's Federal Minister for Economic Affairs and Energy Peter Altmeier has encouraged citizens to take part in the course to help Germany gain a competitive advantage in AI. Sweden's Minister for Energy and Minister for Digital Development Anders Ygeman has said that Sweden aims to teach 1 percent of its population the basics of AI like Finland has. == Participants == Elements of AI had enrolled more than 1 million students from more than 110 countries by May 2023. A quarter of the course's participants are aged 45 and over, and some 40 percent are women. Among Nordic participants, the share of women is nearly 60 percent. In September 2022, the course was available in Finnish, Swedish, Estonian, English, German, Latvian, Norwegian, French, Belgian, Czech, Greek, Slovakian, Slovenian, Latvian, Lithuanian, Portuguese, Spanish, Irish, Icelandic, Maltese, Croatian, Romanian, Italian, Dutch, Polish, and Danish.
Inductive probability
Inductive probability attempts to give the probability of future events based on past events. It is the basis for inductive reasoning, and gives the mathematical basis for learning and the perception of patterns. It is a source of knowledge about the world. There are three sources of knowledge: inference, communication, and deduction. Communication relays information found using other methods. Deduction establishes new facts based on existing facts. Inference establishes new facts from data. Its basis is Bayes' theorem. Information describing the world is written in a language. For example, a simple mathematical language of propositions may be chosen. Sentences may be written down in this language as strings of characters. But in the computer it is possible to encode these sentences as strings of bits (1s and 0s). Then the language may be encoded so that the most commonly used sentences are the shortest. This internal language implicitly represents probabilities of statements. Occam's razor says the "simplest theory, consistent with the data is most likely to be correct". The "simplest theory" is interpreted as the representation of the theory written in this internal language. The theory with the shortest encoding in this internal language is most likely to be correct. == History == Probability and statistics was focused on probability distributions and tests of significance. Probability was formal, well defined, but limited in scope. In particular its application was limited to situations that could be defined as an experiment or trial, with a well defined population. Bayes's theorem is named after Rev. Thomas Bayes 1701–1761. Bayesian inference broadened the application of probability to many situations where a population was not well defined. But Bayes' theorem always depended on prior probabilities, to generate new probabilities. It was unclear where these prior probabilities should come from. Ray Solomonoff developed algorithmic probability which gave an explanation for what randomness is and how patterns in the data may be represented by computer programs, that give shorter representations of the data circa 1964. Chris Wallace and D. M. Boulton developed minimum message length circa 1968. Later Jorma Rissanen developed the minimum description length circa 1978. These methods allow information theory to be related to probability, in a way that can be compared to the application of Bayes' theorem, but which give a source and explanation for the role of prior probabilities. Marcus Hutter combined decision theory with the work of Ray Solomonoff and Andrey Kolmogorov to give a theory for the Pareto optimal behavior for an Intelligent agent, circa 1998. === Minimum description/message length === The program with the shortest length that matches the data is the most likely to predict future data. This is the thesis behind the minimum message length and minimum description length methods. At first sight Bayes' theorem appears different from the minimimum message/description length principle. At closer inspection it turns out to be the same. Bayes' theorem is about conditional probabilities, and states the probability that event B happens if firstly event A happens: P ( A ∧ B ) = P ( B ) ⋅ P ( A | B ) = P ( A ) ⋅ P ( B | A ) {\displaystyle P(A\land B)=P(B)\cdot P(A|B)=P(A)\cdot P(B|A)} becomes in terms of message length L, L ( A ∧ B ) = L ( B ) + L ( A | B ) = L ( A ) + L ( B | A ) . {\displaystyle L(A\land B)=L(B)+L(A|B)=L(A)+L(B|A).} This means that if all the information is given describing an event then the length of the information may be used to give the raw probability of the event. So if the information describing the occurrence of A is given, along with the information describing B given A, then all the information describing A and B has been given. ==== Overfitting ==== Overfitting occurs when the model matches the random noise and not the pattern in the data. For example, take the situation where a curve is fitted to a set of points. If a polynomial with many terms is fitted then it can more closely represent the data. Then the fit will be better, and the information needed to describe the deviations from the fitted curve will be smaller. Smaller information length means higher probability. However, the information needed to describe the curve must also be considered. The total information for a curve with many terms may be greater than for a curve with fewer terms, that has not as good a fit, but needs less information to describe the polynomial. === Inference based on program complexity === Solomonoff's theory of inductive inference is also inductive inference. A bit string x is observed. Then consider all programs that generate strings starting with x. Cast in the form of inductive inference, the programs are theories that imply the observation of the bit string x. The method used here to give probabilities for inductive inference is based on Solomonoff's theory of inductive inference. ==== Detecting patterns in the data ==== If all the bits are 1, then people infer that there is a bias in the coin and that it is more likely also that the next bit is 1 also. This is described as learning from, or detecting a pattern in the data. Such a pattern may be represented by a computer program. A short computer program may be written that produces a series of bits which are all 1. If the length of the program K is L ( K ) {\displaystyle L(K)} bits then its prior probability is, P ( K ) = 2 − L ( K ) {\displaystyle P(K)=2^{-L(K)}} The length of the shortest program that represents the string of bits is called the Kolmogorov complexity. Kolmogorov complexity is not computable. This is related to the halting problem. When searching for the shortest program some programs may go into an infinite loop. ==== Considering all theories ==== The Greek philosopher Epicurus is quoted as saying "If more than one theory is consistent with the observations, keep all theories". As in a crime novel all theories must be considered in determining the likely murderer, so with inductive probability all programs must be considered in determining the likely future bits arising from the stream of bits. Programs that are already longer than n have no predictive power. The raw (or prior) probability that the pattern of bits is random (has no pattern) is 2 − n {\displaystyle 2^{-n}} . Each program that produces the sequence of bits, but is shorter than the n is a theory/pattern about the bits with a probability of 2 − k {\displaystyle 2^{-k}} where k is the length of the program. The probability of receiving a sequence of bits y after receiving a series of bits x is then the conditional probability of receiving y given x, which is the probability of x with y appended, divided by the probability of x. ==== Universal priors ==== The programming language affects the predictions of the next bit in the string. The language acts as a prior probability. This is particularly a problem where the programming language codes for numbers and other data types. Intuitively we think that 0 and 1 are simple numbers, and that prime numbers are somehow more complex than numbers that may be composite. Using the Kolmogorov complexity gives an unbiased estimate (a universal prior) of the prior probability of a number. As a thought experiment an intelligent agent may be fitted with a data input device giving a series of numbers, after applying some transformation function to the raw numbers. Another agent might have the same input device with a different transformation function. The agents do not see or know about these transformation functions. Then there appears no rational basis for preferring one function over another. A universal prior insures that although two agents may have different initial probability distributions for the data input, the difference will be bounded by a constant. So universal priors do not eliminate an initial bias, but they reduce and limit it. Whenever we describe an event in a language, either using a natural language or other, the language has encoded in it our prior expectations. So some reliance on prior probabilities are inevitable. A problem arises where an intelligent agent's prior expectations interact with the environment to form a self reinforcing feed back loop. This is the problem of bias or prejudice. Universal priors reduce but do not eliminate this problem. === Universal artificial intelligence === The theory of universal artificial intelligence applies decision theory to inductive probabilities. The theory shows how the best actions to optimize a reward function may be chosen. The result is a theoretical model of intelligence. It is a fundamental theory of intelligence, which optimizes the agents behavior in, Exploring the environment; performing actions to get responses that broaden the agents knowledge. Competing or co-operating with another agent; games. Balancing short and long term rewards. In general no agent will always provi
Matrix regularization
In the field of statistical learning theory, matrix regularization generalizes notions of vector regularization to cases where the object to be learned is a matrix. The purpose of regularization is to enforce conditions, for example sparsity or smoothness, that can produce stable predictive functions. For example, in the more common vector framework, Tikhonov regularization optimizes over min x ‖ A x − y ‖ 2 + λ ‖ x ‖ 2 {\displaystyle \min _{x}\left\|Ax-y\right\|^{2}+\lambda \left\|x\right\|^{2}} to find a vector x {\displaystyle x} that is a stable solution to the regression problem. When the system is described by a matrix rather than a vector, this problem can be written as min X ‖ A X − Y ‖ 2 + λ ‖ X ‖ 2 , {\displaystyle \min _{X}\left\|AX-Y\right\|^{2}+\lambda \left\|X\right\|^{2},} where the vector norm enforcing a regularization penalty on x {\displaystyle x} has been extended to a matrix norm on X {\displaystyle X} . Matrix regularization has applications in matrix completion, multivariate regression, and multi-task learning. Ideas of feature and group selection can also be extended to matrices, and these can be generalized to the nonparametric case of multiple kernel learning. == Basic definition == Consider a matrix W {\displaystyle W} to be learned from a set of examples, S = ( X i t , y i t ) {\displaystyle S=(X_{i}^{t},y_{i}^{t})} , where i {\displaystyle i} goes from 1 {\displaystyle 1} to n {\displaystyle n} , and t {\displaystyle t} goes from 1 {\displaystyle 1} to T {\displaystyle T} . Let each input matrix X i {\displaystyle X_{i}} be ∈ R D T {\displaystyle \in \mathbb {R} ^{DT}} , and let W {\displaystyle W} be of size D × T {\displaystyle D\times T} . A general model for the output y {\displaystyle y} can be posed as y i t = ⟨ W , X i t ⟩ F , {\displaystyle y_{i}^{t}=\left\langle W,X_{i}^{t}\right\rangle _{F},} where the inner product is the Frobenius inner product. For different applications the matrices X i {\displaystyle X_{i}} will have different forms, but for each of these the optimization problem to infer W {\displaystyle W} can be written as min W ∈ H E ( W ) + R ( W ) , {\displaystyle \min _{W\in {\mathcal {H}}}E(W)+R(W),} where E {\displaystyle E} defines the empirical error for a given W {\displaystyle W} , and R ( W ) {\displaystyle R(W)} is a matrix regularization penalty. The function R ( W ) {\displaystyle R(W)} is typically chosen to be convex and is often selected to enforce sparsity (using ℓ 1 {\displaystyle \ell ^{1}} -norms) and/or smoothness (using ℓ 2 {\displaystyle \ell ^{2}} -norms). Finally, W {\displaystyle W} is in the space of matrices H {\displaystyle {\mathcal {H}}} with Frobenius inner product ⟨ … ⟩ F {\displaystyle \langle \dots \rangle _{F}} . == General applications == === Matrix completion === In the problem of matrix completion, the matrix X i t {\displaystyle X_{i}^{t}} takes the form X i t = e t ⊗ e i ′ , {\displaystyle X_{i}^{t}=e_{t}\otimes e_{i}',} where ( e t ) t {\displaystyle (e_{t})_{t}} and ( e i ′ ) i {\displaystyle (e_{i}')_{i}} are the canonical basis in R T {\displaystyle \mathbb {R} ^{T}} and R D {\displaystyle \mathbb {R} ^{D}} . In this case the role of the Frobenius inner product is to select individual elements w i t {\displaystyle w_{i}^{t}} from the matrix W {\displaystyle W} . Thus, the output y {\displaystyle y} is a sampling of entries from the matrix W {\displaystyle W} . The problem of reconstructing W {\displaystyle W} from a small set of sampled entries is possible only under certain restrictions on the matrix, and these restrictions can be enforced by a regularization function. For example, it might be assumed that W {\displaystyle W} is low-rank, in which case the regularization penalty can take the form of a nuclear norm. R ( W ) = λ ‖ W ‖ ∗ = λ ∑ i | σ i | , {\displaystyle R(W)=\lambda \left\|W\right\|_{}=\lambda \sum _{i}\left|\sigma _{i}\right|,} where σ i {\displaystyle \sigma _{i}} , with i {\displaystyle i} from 1 {\displaystyle 1} to min D , T {\displaystyle \min D,T} , are the singular values of W {\displaystyle W} . === Multivariate regression === Models used in multivariate regression are parameterized by a matrix of coefficients. In the Frobenius inner product above, each matrix X {\displaystyle X} is X i t = e t ⊗ x i {\displaystyle X_{i}^{t}=e_{t}\otimes x_{i}} such that the output of the inner product is the dot product of one row of the input with one column of the coefficient matrix. The familiar form of such models is Y = X W + b {\displaystyle Y=XW+b} Many of the vector norms used in single variable regression can be extended to the multivariate case. One example is the squared Frobenius norm, which can be viewed as an ℓ 2 {\displaystyle \ell ^{2}} -norm acting either entrywise, or on the singular values of the matrix: R ( W ) = λ ‖ W ‖ F 2 = λ ∑ i ∑ j | w i j | 2 = λ Tr ( W ∗ W ) = λ ∑ i σ i 2 . {\displaystyle R(W)=\lambda \left\|W\right\|_{F}^{2}=\lambda \sum _{i}\sum _{j}\left|w_{ij}\right|^{2}=\lambda \operatorname {Tr} \left(W^{}W\right)=\lambda \sum _{i}\sigma _{i}^{2}.} In the multivariate case the effect of regularizing with the Frobenius norm is the same as the vector case; very complex models will have larger norms, and, thus, will be penalized more. === Multi-task learning === The setup for multi-task learning is almost the same as the setup for multivariate regression. The primary difference is that the input variables are also indexed by task (columns of Y {\displaystyle Y} ). The representation with the Frobenius inner product is then X i t = e t ⊗ x i t . {\displaystyle X_{i}^{t}=e_{t}\otimes x_{i}^{t}.} The role of matrix regularization in this setting can be the same as in multivariate regression, but matrix norms can also be used to couple learning problems across tasks. In particular, note that for the optimization problem min W ‖ X W − Y ‖ 2 2 + λ ‖ W ‖ 2 2 {\displaystyle \min _{W}\left\|XW-Y\right\|_{2}^{2}+\lambda \left\|W\right\|_{2}^{2}} the solutions corresponding to each column of Y {\displaystyle Y} are decoupled. That is, the same solution can be found by solving the joint problem, or by solving an isolated regression problem for each column. The problems can be coupled by adding an additional regularization penalty on the covariance of solutions min W , Ω ‖ X W − Y ‖ 2 2 + λ 1 ‖ W ‖ 2 2 + λ 2 Tr ( W T Ω − 1 W ) {\displaystyle \min _{W,\Omega }\left\|XW-Y\right\|_{2}^{2}+\lambda _{1}\left\|W\right\|_{2}^{2}+\lambda _{2}\operatorname {Tr} \left(W^{T}\Omega ^{-1}W\right)} where Ω {\displaystyle \Omega } models the relationship between tasks. This scheme can be used to both enforce similarity of solutions across tasks, and to learn the specific structure of task similarity by alternating between optimizations of W {\displaystyle W} and Ω {\displaystyle \Omega } . When the relationship between tasks is known to lie on a graph, the Laplacian matrix of the graph can be used to couple the learning problems. == Spectral regularization == Regularization by spectral filtering has been used to find stable solutions to problems such as those discussed above by addressing ill-posed matrix inversions (see for example Filter function for Tikhonov regularization). In many cases the regularization function acts on the input (or kernel) to ensure a bounded inverse by eliminating small singular values, but it can also be useful to have spectral norms that act on the matrix that is to be learned. There are a number of matrix norms that act on the singular values of the matrix. Frequently used examples include the Schatten p-norms, with p = 1 or 2. For example, matrix regularization with a Schatten 1-norm, also called the nuclear norm, can be used to enforce sparsity in the spectrum of a matrix. This has been used in the context of matrix completion when the matrix in question is believed to have a restricted rank. In this case the optimization problem becomes: min ‖ W ‖ ∗ subject to W i , j = Y i j . {\displaystyle \min \left\|W\right\|_{}~~{\text{ subject to }}~~W_{i,j}=Y_{ij}.} Spectral Regularization is also used to enforce a reduced rank coefficient matrix in multivariate regression. In this setting, a reduced rank coefficient matrix can be found by keeping just the top n {\displaystyle n} singular values, but this can be extended to keep any reduced set of singular values and vectors. == Structured sparsity == Sparse optimization has become the focus of much research interest as a way to find solutions that depend on a small number of variables (see e.g. the Lasso method). In principle, entry-wise sparsity can be enforced by penalizing the entry-wise ℓ 0 {\displaystyle \ell ^{0}} -norm of the matrix, but the ℓ 0 {\displaystyle \ell ^{0}} -norm is not convex. In practice this can be implemented by convex relaxation to the ℓ 1 {\displaystyle \ell ^{1}} -norm. While entry-wise regularization with an ℓ 1 {\displaystyle \ell ^{1}} -norm will find solutions with a small number of nonzero elements, applying an ℓ 1 {
Engineering Historical Memory
Engineering Historical Memory (EHM) is an online database in the digital humanities, serving as an open-access research tool for primary historical materials focused on 11th to 15th century Afro-Eurasia. It adopts computational methods to make historical documents machine-understandable. EHM parses traditional artifacts such as historical maps, travel accounts, chronicles and codices into computer-readable formats, and links them to secondary multi-media references, a process referred to as the "automatic narrative generation". This approach generates cultural narratives and facilitates interaction with the historical artifacts, making them accessible to audiences from various backgrounds. == History == EHM was first theorised in 2007 by researcher Andrea Nanetti when he was a visiting scholar at Princeton University, and the preliminary test results were published between 2008 and 2011. In 2013, the EHM research team was set up in Singapore following Nanetti's professorship at Nanyang Technological University (NTU). Two years later, after receiving several Microsoft research grants, EHM went live on Microsoft Azure. In 2018, the College of Humanities, Arts and Social Sciences (CoHASS) at NTU Singapore formed the Digital Humanities Research Cluster, as part of which, EHM has been an ongoing interdisciplinary research project led by Nanetti. Partnering with international educational and cultural institutions such as Ca' Foscari University of Venice, University of Florence, Taylor & Francis Group, Delft University of Technology (TUDelft), and SenticNet, EHM has been supported by over 130 scholars and engineers. == Applications == Primary historical materials on EHM are curated into several categories, including maps, travel accounts, chronicles, codices, sites, archival documents, and paintings, such as the Morosini Codex (listed under Chronicles) and Pope Gregory X's Privilege for the Holy Monastery of St Catherine of Sinai (listed under Archival Documents). EHM has been adopted by cultural organisations as an exhibition and research tool in the digital humanities field. An example is the publication of a digital interactive edition of Fra Mauro's Map of the World on EHM, a collaboration project between NTU Singapore and the Biblioteca Nazionale Marciana of Venice. The digitisation process of the map on EHM involved transcribing and geo-referencing the textual content in the 15th-century map, followed by creating semantic annotations to connect the map's content with related secondary data sources. The e-map was subsequently adopted and launched online by Museo Galileo in March 2022 and incorporated into the virtual exhibition "Venezia and Suzhou: Water Cities along the Silk Roads" (online, September-December 2022). In 2024, the Fra Mauro's Map of the World application on EHM was awarded the Digital Humanities and Multimedia Studies Prize (DHMS) by the Medieval Academy of America. Image-Based Video Search Engine is another experimental project under the EHM scope led by the research teams at Delft University of Technology (TUDelft) and NTU Singapore. This ongoing project aims to improve the efficiency of retrieving targeted objects from audio-visuals. == Awards == In 2021, EHM won the GLAMi Awards (MuseWeb Conference - Galleries, Libraries, Archives, and Museums Innovation awards) in the "Resources for Scholars and Researchers" category. In the same year, EHM was a Falling Walls finalist for Science Breakthrough of the Year in the category Social Sciences and Humanities after nominated by the School of Advanced Study at the University of London. In April 2022, the Italian National Commission for UNESCO has selected and sent the EHM project to the organisers of the "Jikji Memory of the World" Award for final evaluation. In January 2024, the Medieval Academy of America announced its 2024 Digital Humanities and Multimedia Studies Prize (DHMS) goes to the Fra Mauro's Map of the World application on EHM.
Inception score
The Inception Score (IS) is an algorithm used to assess the quality of images created by a generative image model such as a generative adversarial network (GAN). The score is calculated based on the output of a separate, pretrained Inception v3 image classification model applied to a sample of (typically around 30,000) images generated by the generative model. The Inception Score is maximized when the following conditions are true: The entropy of the distribution of labels predicted by the Inceptionv3 model for the generated images is minimized. In other words, the classification model confidently predicts a single label for each image. Intuitively, this corresponds to the desideratum of generated images being "sharp" or "distinct". The predictions of the classification model are evenly distributed across all possible labels. This corresponds to the desideratum that the output of the generative model is "diverse". It has been somewhat superseded by the related Fréchet inception distance. While the Inception Score only evaluates the distribution of generated images, the FID compares the distribution of generated images with the distribution of a set of real images ("ground truth"). == Definition == Let there be two spaces, the space of images Ω X {\displaystyle \Omega _{X}} and the space of labels Ω Y {\displaystyle \Omega _{Y}} . The space of labels is finite. Let p g e n {\displaystyle p_{gen}} be a probability distribution over Ω X {\displaystyle \Omega _{X}} that we wish to judge. Let a discriminator be a function of type p d i s : Ω X → M ( Ω Y ) {\displaystyle p_{dis}:\Omega _{X}\to M(\Omega _{Y})} where M ( Ω Y ) {\displaystyle M(\Omega _{Y})} is the set of all probability distributions on Ω Y {\displaystyle \Omega _{Y}} . For any image x {\displaystyle x} , and any label y {\displaystyle y} , let p d i s ( y | x ) {\displaystyle p_{dis}(y|x)} be the probability that image x {\displaystyle x} has label y {\displaystyle y} , according to the discriminator. It is usually implemented as an Inception-v3 network trained on ImageNet. The Inception Score of p g e n {\displaystyle p_{gen}} relative to p d i s {\displaystyle p_{dis}} is I S ( p g e n , p d i s ) := exp ( E x ∼ p g e n [ D K L ( p d i s ( ⋅ | x ) ‖ ∫ p d i s ( ⋅ | x ) p g e n ( x ) d x ) ] ) {\displaystyle IS(p_{gen},p_{dis}):=\exp \left(\mathbb {E} _{x\sim p_{gen}}\left[D_{KL}\left(p_{dis}(\cdot |x)\|\int p_{dis}(\cdot |x)p_{gen}(x)dx\right)\right]\right)} Equivalent rewrites include ln I S ( p g e n , p d i s ) := E x ∼ p g e n [ D K L ( p d i s ( ⋅ | x ) ‖ E x ∼ p g e n [ p d i s ( ⋅ | x ) ] ) ] {\displaystyle \ln IS(p_{gen},p_{dis}):=\mathbb {E} _{x\sim p_{gen}}\left[D_{KL}\left(p_{dis}(\cdot |x)\|\mathbb {E} _{x\sim p_{gen}}[p_{dis}(\cdot |x)]\right)\right]} ln I S ( p g e n , p d i s ) := H [ E x ∼ p g e n [ p d i s ( ⋅ | x ) ] ] − E x ∼ p g e n [ H [ p d i s ( ⋅ | x ) ] ] {\displaystyle \ln IS(p_{gen},p_{dis}):=H[\mathbb {E} _{x\sim p_{gen}}[p_{dis}(\cdot |x)]]-\mathbb {E} _{x\sim p_{gen}}[H[p_{dis}(\cdot |x)]]} ln I S {\displaystyle \ln IS} is nonnegative by Jensen's inequality. Pseudocode:INPUT discriminator p d i s {\displaystyle p_{dis}} . INPUT generator g {\displaystyle g} . Sample images x i {\displaystyle x_{i}} from generator. Compute p d i s ( ⋅ | x i ) {\displaystyle p_{dis}(\cdot |x_{i})} , the probability distribution over labels conditional on image x i {\displaystyle x_{i}} . Sum up the results to obtain p ^ {\displaystyle {\hat {p}}} , an empirical estimate of ∫ p d i s ( ⋅ | x ) p g e n ( x ) d x {\displaystyle \int p_{dis}(\cdot |x)p_{gen}(x)dx} . Sample more images x i {\displaystyle x_{i}} from generator, and for each, compute D K L ( p d i s ( ⋅ | x i ) ‖ p ^ ) {\displaystyle D_{KL}\left(p_{dis}(\cdot |x_{i})\|{\hat {p}}\right)} . Average the results, and take its exponential. RETURN the result. === Interpretation === A higher inception score is interpreted as "better", as it means that p g e n {\displaystyle p_{gen}} is a "sharp and distinct" collection of pictures. ln I S ( p g e n , p d i s ) ∈ [ 0 , ln N ] {\displaystyle \ln IS(p_{gen},p_{dis})\in [0,\ln N]} , where N {\displaystyle N} is the total number of possible labels. ln I S ( p g e n , p d i s ) = 0 {\displaystyle \ln IS(p_{gen},p_{dis})=0} iff for almost all x ∼ p g e n {\displaystyle x\sim p_{gen}} p d i s ( ⋅ | x ) = ∫ p d i s ( ⋅ | x ) p g e n ( x ) d x {\displaystyle p_{dis}(\cdot |x)=\int p_{dis}(\cdot |x)p_{gen}(x)dx} That means p g e n {\displaystyle p_{gen}} is completely "indistinct". That is, for any image x {\displaystyle x} sampled from p g e n {\displaystyle p_{gen}} , discriminator returns exactly the same label predictions p d i s ( ⋅ | x ) {\displaystyle p_{dis}(\cdot |x)} . The highest inception score N {\displaystyle N} is achieved if and only if the two conditions are both true: For almost all x ∼ p g e n {\displaystyle x\sim p_{gen}} , the distribution p d i s ( y | x ) {\displaystyle p_{dis}(y|x)} is concentrated on one label. That is, H y [ p d i s ( y | x ) ] = 0 {\displaystyle H_{y}[p_{dis}(y|x)]=0} . That is, every image sampled from p g e n {\displaystyle p_{gen}} is exactly classified by the discriminator. For every label y {\displaystyle y} , the proportion of generated images labelled as y {\displaystyle y} is exactly E x ∼ p g e n [ p d i s ( y | x ) ] = 1 N {\displaystyle \mathbb {E} _{x\sim p_{gen}}[p_{dis}(y|x)]={\frac {1}{N}}} . That is, the generated images are equally distributed over all labels.