Clustering illusion

Clustering illusion

The clustering illusion is the tendency to erroneously consider the inevitable "streaks" or "clusters" arising in small samples from random distributions to be non-random. The illusion is caused by a human tendency to underpredict the amount of variability likely to appear in a small sample of random or pseudorandom data. Thomas Gilovich, an early author on the subject, argued that the effect occurs for different types of random dispersions. Some might perceive patterns in stock market price fluctuations over time, or clusters in two-dimensional data such as the locations of impact of World War II V-1 flying bombs on maps of London. Although Londoners developed specific theories about the pattern of impacts within London, a statistical analysis by R. D. Clarke originally published in 1946 showed that the impacts of V-2 rockets on London were a close fit to a random distribution. == Similar biases == Using this cognitive bias in causal reasoning may result in the Texas sharpshooter fallacy, in which differences in data are ignored and similarities are overemphasized. More general forms of erroneous pattern recognition are pareidolia and apophenia. Related biases are the illusion of control which the clustering illusion could contribute to, and insensitivity to sample size in which people don't expect greater variation in smaller samples. A different cognitive bias involving misunderstanding of chance streams is the gambler's fallacy. == Possible causes == Daniel Kahneman and Amos Tversky explained this kind of misprediction as being caused by the representativeness heuristic (which itself they also first proposed).

Drop shadow

In graphic design and computer graphics, a drop shadow is a visual effect consisting of a drawing element which looks like the shadow of an object, giving the impression that the object is raised above the objects behind it. The drop shadow is often used for elements of a graphical user interface such as windows or menus, and for simple text. The text label for icons on desktops in many desktop environments has a drop shadow, as this effect effectively distinguishes the text from any colored background it may be in front of. A simple way of drawing a drop shadow of a rectangular object is to draw a gray or black area underneath and offset from the object. In general, a drop shadow is a copy in black or gray of the object, drawn in a slightly different position. Realism may be increased by: Darkening the colors of the pixels where the shadow casts instead of making them gray. This can be done with alpha blending the shadow with the area it is cast on. Softening the edges of the shadow. This can be done by adding Gaussian blur to the shadow's alpha channel before blending. Inset drop shadows are a type which draws the shadows inside the element. This allows the interface element to appear as if it is sunken into the interface. == Photo editing == In photo editing or photography post-production, a drop shadow may be added right beneath a model or product in the image. It is used to create contrast between the background and the subject. To add a drop shadow, retouchers use graphic editing tools like Adobe Photoshop. Drop shadows are often used as a visual effect in e-commerce. This is done to improve the presentation of product images and create depth in the image. == Use == Generally, window managers which are capable of compositing allow drop shadow effects, whereas incapable window managers do not. In some operating systems like macOS, drop shadow is used to differentiate between active and inactive windows. Websites are able to use drop shadow effects through the CSS properties box-shadow, text-shadow, and drop-shadow() filter function in filter. The first two are used for elements and text respectively, while the filter applies to the element's content, letting it support oddly shaped elements or transparent images.

Corinna Cortes

Corinna Cortes (born 31 March 1961) is a Danish computer scientist known for her contributions to machine learning. She is a Vice President at Google Research in New York City. Cortes is an ACM Fellow and a recipient of the Paris Kanellakis Award for her work on theoretical foundations of support vector machines. == Early life and education == Corinna Cortes was born in 1961 in Denmark. Cortes received her Master of Science degree in physics from University of Copenhagen in 1989. She received her PhD in computer science from the University of Rochester in 1993 for research supervised by Randal C. Nelson. == Career and research == Cortes joined AT&T Bell Labs as a researcher in 1993. Since 2003, she has served as Vice President of Google Research, New York City, and since 2011, as adjunct professor at the UCPH Department of Computer Science. She is serves as an editorial board member of the journal Machine Learning. Cortes' research covers a wide range of topics in machine learning, including support vector machines (SVM) and data mining. SVM is one of the most frequently used algorithms in machine learning, which is used in many practical applications, including medical diagnosis and weather forecasting. At AT&T, Cortes was a contributor to the design of Hancock programming language. === Awards and honours === In 2008, she jointly with Vladimir Vapnik received the Paris Kanellakis Award for the development of a highly effective algorithm for supervised learning known as support vector machines (SVM). She was named an ACM Fellow in 2023 for theoretical and practical contributions to machine learning, industrial leadership and service to the field. == Personal life == Corinna has two children and is also a competitive runner.

Magnetic ink character recognition

Magnetic ink character recognition code, known in short as MICR code, is a character recognition technology used mainly by the banking industry to streamline the processing and clearance of cheques and other documents. MICR encoding, called the MICR line, is at the bottom of cheques and other vouchers and typically includes the document-type indicator, bank code, bank account number, cheque number, cheque amount (usually added after a cheque is presented for payment), and a control indicator. The format for the bank code and bank account number is country-specific. The technology allows MICR readers to scan and read the information directly into a data-collection device. Unlike barcode and similar technologies, MICR characters can be read easily by humans. MICR encoded documents can be processed much faster and more accurately than conventional OCR encoded documents. == Pre-Unicode standard representation == The ISO standard ISO 2033:1983, and the corresponding Japanese Industrial Standard JIS X 9010:1984 (originally JIS C 6229–1984), define character encodings for OCR-A, OCR-B and E-13B. == International spread == There are two major MICR fonts in use: E-13B and CMC-7. There is no particular international agreement on which countries use which font. In practice, this does not create particular problems as cheques and other vouchers do not usually flow out of a particular jurisdiction. The E-13B font has been adopted as an international standard in ISO 1004-1:2013, and is the standard in Australia, Canada, the United Kingdom, the United States, as well as Central America and much of Asia, besides other countries. The CMC-7 font has been adopted as an international standard in ISO 1004-2:2013, and is widely used in Europe, including France and Italy, Mexico, and South America, including Argentina, Brazil, Chile, besides other countries. Israel is the only country that can use both fonts simultaneously, though the practice makes the system significantly less efficient. This situation is the product of the Israelis adopting CMC-7, while the Palestinians opted for E-13B. == Fonts == === E-13B === E-13B is a 14-character set, comprising the 10 decimal digits, and the following symbols: ⑆ (transit: used to delimit a bank code); ⑈ (on-us: used to delimit a customer account number); ⑇ (amount: used to delimit a transaction amount); ⑉ (dash: used to delimit parts of numbers—e.g., routing numbers or account numbers). In the check printing and banking industries the E-13B MICR line is also commonly referred to as the TOAD line. This reference comes from the 4 characters: Transit, On-us, Amount, and Dash. Compared to CMC-7, some pairs of E-13B characters (notably 2 and 5) can produce relatively similar results when magnetically scanned; however, as a fallback if magnetic reading fails, E-13B also performs well under optical character recognition. The E-13B repertoire can be represented in Unicode (see below). The official Unicode names contain misnomers. For example, the ⑈ on-us symbol is official titled "OCR Dash". Prior to Unicode, it could be encoded according to ISO 2033:1983, which encodes digits in their usual ASCII locations, transit as 0x3A, on-us as 0x3C, amount as 0x3B, and dash as 0x3D. For EBCDIC, IBM code page 1001 encodes digits in their usual EBCDIC locations, transit as 0xDB, on-us as 0xEB, amount as 0xCB, and dash as 0xFB. IBM code page 1032 extends code page 1001 by adding alternative encodings for transit at 0x5C, 0x7A and 0xC1, on-us at 0x4C, 0x61 and 0xC3, amount at 0x5B, 0x5E and 0xC2 and dash at 0x60, 0x7E and 0xC4, in addition to a zero-width space at 0x5A. These alternative representations were added for interoperability with Siemens and Océ printers. === CMC-7 === CMC-7 includes 10 numeric digits, 26 capital letters, and 5 control characters: S I (internal), S II (terminator), S III (amount), S IV (an unused character), and S V (routing). CMC-7 has a barcode format, with every character having two distinct large gaps in different places, as well as distinct patterns in between, to minimize any chance for character confusion while reading magnetically; however, these bars are too close and narrow to be reliably recognised at a typical scan resolution if falling back to optical scanning. CMC-7 can also produce superficially successful, but incorrect, scans of upside-down MICR lines. Unicode does not include support for the CMC-7 control symbols. IBM code page 1033 encodes: Digits and capitals in their usual EBCDIC locations S I (internal) as 0x5E, 0x61 or 0xCB; S II (terminator) as 0x4C, 0x5B or 0xEB; S III (amount) as 0x60, 0x7E or 0xFB; S IV as 0x50, 0x7A or 0xDB; S V (routing) as 0x5C, 0x6E or 0xBB. == MICR reader == MICR characters are printed on documents in one of the two MICR fonts, using magnetizable (commonly known as magnetic) ink or toner, usually containing iron oxide. In scanning, the document is passed through a MICR reader, which performs two functions: magnetization of the ink, and detection of the characters. The characters are read by a MICR reader head, a device similar to the playback head of a tape recorder. As each character passes over the head, it produces a unique waveform that can be easily identified by the system. MICR readers are the primary tool for cheque sorting and are used across the cheque distribution network at multiple stages. For example, a merchant will use a MICR reader to sort cheques by bank and send the sorted cheques to a clearing house for redistribution to those banks. Upon receipt, the banks perform another MICR sort to determine which customer's account is charged and to which branch the cheque should be sent on its way back to the customer. However, many banks no longer offer this last step of returning the cheque to the customer. Instead, cheques are scanned and stored digitally. Sorting of cheques is done as per the geographical coverage of banks in a nation. == Unicode == OCR and MICR characters have been included in the Unicode Standard since at least version 1.1 (June 1993). Since the Unicode Character Database only tracks characters starting with version 1.1, they may also have been present in Unicode 1.0 or 1.0.1. The Unicode block that includes OCR and MICR characters is called Optical Character Recognition and covers U+2440–U+245F. Of the characters in this block, four are from the MICR E-13B font: U+2446 ⑆ OCR BRANCH BANK IDENTIFICATION U+2447 ⑇ OCR AMOUNT OF CHECK U+2448 ⑈ OCR DASH (corrected alias MICR ON US SYMBOL) U+2449 ⑉ OCR CUSTOMER ACCOUNT NUMBER (corrected alias MICR DASH SYMBOL) The names of the latter two characters were inadvertently switched when they were named in ISO/IEC 10646:1993, and they have been assigned accurate names as formal aliases. Per the Unicode Stability Policy, the existing names remain, allowing their use as stable identifiers. Additionally, all four characters have informative (non-formal) aliases in the Unicode charts: "transit", "amount", "on-us", and "dash" respectively. Prior to Unicode, these symbols had been encoded by the ISO-IR-98 encoding defined by ISO 2033:1983, in which they were simply named SYMBOL ONE through SYMBOL FOUR. They were encoded immediately following the digits, which were encoded at their ASCII locations. Although ISO 2033 also specifies encoding for OCR-A and OCR-B, its encoding for E-13B is known simply as ISO_2033-1983 by the IANA. == History == Before the mid-1940s, cheques were processed manually using the Sort-A-Matic or Top Tab Key method. The processing and cheque clearing was very time-consuming and was a significant cost in cheque clearance and bank operations. As the number of cheques increased, ways were sought for automating the process. Standards were developed to ensure uniformity in financial institutions. By the mid-1950s, the Stanford Research Institute and General Electric Computer Laboratory had developed the first automated system to process cheques using MICR. The same team also developed the E-13B MICR font. "E" refers to the font being the fifth considered, and "B" to the fact that it was the second version. The "13" refers to the 0.013-inch character grid. The trial of MICR E-13B font was shown to the American Bankers Association (ABA) in July 1956, which adopted it in 1958 as the MICR standard for negotiable documents in the United States. ABA adopted MICR as its standard because machines could read MICR accurately, and MICR could be printed using existing technology. In addition, MICR remained machine readable, even through overstamping, marking, mutilation and more. The first cheques using MICR were printed by the end of 1959. Although compliance with MICR standards was voluntary in the United States, it had been almost universally adopted in the United States by 1963. In 1963, ANSI adopted the ABA's E-13B font as the American standard for MICR printing, and E-13B was also standardized as ISO 1004:1995. Other countries set their own standards, though the MICR readers and m

Jürgen Schmidhuber

Jürgen Schmidhuber (born 17 January 1963) is a German computer scientist noted for his work in the field of artificial intelligence, specifically artificial neural networks. He has been described by media outlets as a leading pioneer of modern artificial intelligence. He is a scientific director of the Dalle Molle Institute for Artificial Intelligence Research in Switzerland. He is also director of the Artificial Intelligence Initiative and professor of the Computer Science program in the Computer, Electrical, and Mathematical Sciences and Engineering (CEMSE) division at the King Abdullah University of Science and Technology (KAUST) in Saudi Arabia. He is best known for his work on long short-term memory (LSTM), a type of neural network architecture which was the dominant technique for various natural language processing tasks in research and commercial applications in the 2010s. He also introduced principles of dynamic neural networks, meta-learning, generative adversarial networks and linear transformers, all of which are widespread in modern AI. == Career == Schmidhuber completed his undergraduate (1987) and PhD (1991) studies at the Technical University of Munich in Munich, Germany. His PhD advisors were Wilfried Brauer and Klaus Schulten. He taught there from 2004 until 2009. From 2009 to 2021, he was a professor of artificial intelligence at the Università della Svizzera Italiana in Lugano, Switzerland. He has served as the director of Dalle Molle Institute for Artificial Intelligence Research (IDSIA), a Swiss AI lab, since 1995. Since 2021, he has also been the director of the AI Initiative at the King Abdullah University of Science and Technology (KAUST). In 2014, Schmidhuber formed a company, NNAISENSE, to work on commercial applications of artificial intelligence in fields such as finance, heavy industry and self-driving cars. Sepp Hochreiter, Jaan Tallinn, and Marcus Hutter are advisers to the company. Sales were under US$11 million in 2016; however, Schmidhuber states that the current emphasis is on research and not revenue. NNAISENSE raised its first round of capital funding in January 2017. Schmidhuber's overall goal is to create an all-purpose AI by training a single AI in sequence on a variety of narrow tasks, but as of 2026 he has said that the focus of NNAISENSE has shifted from artificial general intelligence to asset management. == Research == In the 1980s, backpropagation did not work well for deep learning with long credit assignment paths in artificial neural networks. To overcome this problem, Schmidhuber (1991) proposed a hierarchy of recurrent neural networks (RNNs) pre-trained one level at a time by self-supervised learning. It uses predictive coding to learn internal representations at multiple self-organizing time scales, facilitating downstream deep learning. The RNN hierarchy can be collapsed into a single RNN, by distilling a higher level chunker network into a lower level automatizer network. In 1993, a chunker solved a deep learning task whose depth exceeded 1000. In 1991, Schmidhuber published adversarial neural networks that contest with each other in the form of a zero-sum game, where one network's gain is the other network's loss. The first network is a generative model that models a probability distribution over output patterns. The second network learns by gradient descent to predict the reactions of the environment to these patterns. This was called "artificial curiosity". In 2014, this principle was used in the creation of the generative adversarial network, which Schmidhuber describes as a special case of artificial curiosity where the environmental reaction is 1 or 0 depending on whether the first network's output is in a given set. Schmidhuber supervised the 1991 diploma thesis of his student Sepp Hochreiter which he considered "one of the most important documents in the history of machine learning". It studied the neural history compressor and analyzed and overcame the vanishing gradient problem. This led to the creation of long short-term memory (LSTM), a type of recurrent neural network. The name LSTM was introduced in a tech report in 1995, leading to the most cited LSTM publication, published in 1997 and co-authored by Hochreiter and Schmidhuber. The standard LSTM architecture was introduced in 2000 by Felix Gers, Schmidhuber, and Fred Cummins. Today's "vanilla LSTM" using backpropagation through time was published with his student Alex Graves in 2005, and its connectionist temporal classification (CTC) training algorithm in 2006. CTC was applied to end-to-end speech recognition with LSTM. In 2014, the state of the art was training “very deep neural network” with 20 to 30 layers. Stacking too many layers led to a steep reduction in training accuracy, known as the "degradation" problem. In May 2015, Rupesh Kumar Srivastava, Klaus Greff, and Schmidhuber used LSTM principles to create the highway network, a feedforward neural network with hundreds of layers, much deeper than previous networks. In Dec 2015, the residual neural network (ResNet) was published, which is a variant of the highway network. In 1992, Schmidhuber published fast weights programmer, an alternative to recurrent neural networks. It has a slow feedforward neural network that learns by gradient descent to control the fast weights of another neural network through outer products of self-generated activation patterns, and the fast weights network itself operates over inputs. This was later shown to be equivalent to the unnormalized linear transformer. In 2011, Schmidhuber's team at IDSIA with his postdoc Dan Ciresan also achieved dramatic speedups of convolutional neural networks (CNNs) using graphics processing units (GPUs), based on CNN designs introduced much earlier by Kunihiko Fukushima. An earlier CNN on GPU by Chellapilla et al. (2006) was 4 times faster than an equivalent implementation on CPU. The deep CNN of Dan Ciresan et al. (2011) at IDSIA was 60 times faster and achieved the first superhuman performance in a computer vision contest in August 2011. Between 15 May 2011 and 10 September 2012, these CNNs won four more image competitions and improved the state of the art on multiple image benchmarks. The approach has become central to the field of computer vision. == Credit disputes == Schmidhuber has controversially argued that he and other researchers have been denied adequate recognition for their contribution to the field of deep learning, in favour of Geoffrey Hinton, Yoshua Bengio and Yann LeCun, who shared the 2018 Turing Award for their work in deep learning. He wrote a "scathing" 2015 article arguing that Hinton, Bengio and LeCun "heavily cite each other" but "fail to credit the pioneers of the field". In a statement to the New York Times, Yann LeCun wrote that "Jürgen is manically obsessed with recognition and keeps claiming credit he doesn't deserve for many, many things... It causes him to systematically stand up at the end of every talk and claim credit for what was just presented, generally not in a justified manner." Schmidhuber replied that LeCun did this "without any justification, without providing a single example", and published details of numerous priority disputes with Hinton, Bengio and LeCun. The term "schmidhubered" has been jokingly used in the AI community to describe Schmidhuber's habit of publicly challenging the originality of other researchers' work, a practice seen by some in the AI community as a "rite of passage" for young researchers. Some suggest that Schmidhuber's significant accomplishments have been underappreciated due to his confrontational personality. == Recognition == Schmidhuber received the Helmholtz Award of the International Neural Network Society in 2013, and the Neural Networks Pioneer Award of the IEEE Computational Intelligence Society in 2016 for "pioneering contributions to deep learning and neural networks." He is a member of the European Academy of Sciences and Arts. He has been referred to as the "father of modern AI", the "father of generative AI", and the "father of deep learning". Schmidhuber himself, however, has called Alexey Grigorevich Ivakhnenko the "father of deep learning", and gives credit to many even earlier AI pioneers. The New York Times ran a profile under the headline "When A.I. Matures, It May Call Jürgen Schmidhuber 'Dad'", highlighting his early work on deep learning and his long‑term vision for self‑improving AI. == Views == Schmidhuber is a proponent of open source AI, and believes that they will become competitive against commercial closed-source AI. Since the 1970s, Schmidhuber wanted to create "intelligent machines that could learn and improve on their own and become smarter than him within his lifetime." He differentiates between two types of AIs: tool AI, such as those for improving healthcare, and autonomous AIs that set their own goals, perform their own research, and explore the universe. He has worked on both types for de

Machine-learned interatomic potential

Machine-learned interatomic potentials (MLIPs), or simply machine learning potentials (MLPs), are interatomic potentials constructed using machine learning. Beginning in the 1990s, researchers have employed such programs to construct interatomic potentials by mapping atomic structures to their potential energies. These potentials are referred to as MLIPs or MLPs. Such machine learning potentials promised to fill the gap between density functional theory, a highly accurate but computationally intensive modelling method, and empirically derived or intuitively-approximated potentials, which were far lighter computationally but substantially less accurate. Improvements in artificial intelligence technology heightened the accuracy of MLPs while lowering their computational cost, increasing the role of machine learning in fitting potentials. Machine learning potentials began by using neural networks to tackle low-dimensional systems. While promising, these models could not systematically account for interatomic energy interactions; they could be applied to small molecules in a vacuum, or molecules interacting with frozen surfaces, but not much else – and even in these applications, the models often relied on force fields or potentials derived empirically or with simulations. These models thus remained confined to academia. Modern neural networks construct highly accurate and computationally light potentials, as theoretical understanding of materials science was increasingly built into their architectures and preprocessing. Almost all are local, accounting for all interactions between an atom and its neighbor up to some cutoff radius. There exist some nonlocal models, but these have been experimental for almost a decade. For most systems, reasonable cutoff radii enable highly accurate results. Almost all neural networks intake atomic coordinates and output potential energies. For some, these atomic coordinates are converted into atom-centered symmetry functions. From this data, a separate atomic neural network is trained for each element; each atomic network is evaluated whenever that element occurs in the given structure, and then the results are pooled together at the end. This process – in particular, the atom-centered symmetry functions which convey translational, rotational, and permutational invariances – has greatly improved machine learning potentials by significantly constraining the neural network search space. Other models use a similar process but emphasize bonds over atoms, using pair symmetry functions and training one network per atom pair. Other models to learn their own descriptors rather than using predetermined symmetry-dictating functions. These models, called message-passing neural networks (MPNNs), are graph neural networks. Treating molecules as three-dimensional graphs (where atoms are nodes and bonds are edges), the model takes feature vectors describing the atoms as input, and iteratively updates these vectors as information about neighboring atoms is processed through message functions and convolutions. These feature vectors are then used to predict the final potentials. The flexibility of this method often results in stronger, more generalizable models. In 2017, the first-ever MPNN model (a deep tensor neural network) was used to calculate the properties of small organic molecules. == Gaussian Approximation Potential (GAP) == One popular class of machine-learned interatomic potential is the Gaussian Approximation Potential (GAP), which combines compact descriptors of local atomic environments with Gaussian process regression to machine learn the potential energy surface of a given system. To date, the GAP framework has been used to successfully develop a number of MLIPs for various systems, including for elemental systems such as carbon, silicon, phosphorus, and tungsten, as well as for multicomponent systems such as Ge2Sb2Te5 and austenitic stainless steel, Fe7Cr2Ni. == Equivariant graph neural networks == A significant limitation of early MPNNs was that they were not inherently equivariant to rotations and reflections of atomic structures — meaning predictions could change depending on how a molecule was oriented in space. Beginning around 2021, a new class of models addressed this by incorporating equivariance directly into the message-passing layers using spherical harmonics and irreducible representations. Notable examples include NequIP (2021), MACE (2022), and GemNet-OC (2022). These equivariant architectures proved substantially more data-efficient and accurate than their predecessors, and became the dominant paradigm for high-accuracy MLIPs. == Universal MLIPs and large-scale datasets == Early MLIPs were system-specific, trained on a few thousand structures of a single material. A major shift occurred with the creation of large, chemically diverse datasets enabling models that generalize across many elements, bonding environments, and application domains — so-called universal MLIPs. A key driver was the Open Catalyst Project (OC20, OC22), a collaboration between Meta AI (FAIR) and Carnegie Mellon University launched in 2020. OC20 comprises approximately 1.3 million DFT relaxations across 82 elements, designed to accelerate the discovery of catalysts for renewable energy applications. It was among the first datasets large enough to train GNNs that generalize across diverse chemical systems, and established a widely-used benchmark for the field. A subsequent dataset, Open Direct Air Capture (OpenDAC 2023 and OpenDAC 2025), applied the same approach to carbon capture, providing a large computational database of metal-organic frameworks and sorbent candidates evaluated for CO₂ capture, generated using nearly 400 million CPU hours of quantum chemistry calculations in collaboration with Georgia Tech. These datasets revealed a new challenge: the GNN architectures most effective for atomic simulations were memory-intensive, as they model higher-order interactions between triplets or quadruplets of atoms, making it difficult to scale model size. Graph Parallelism, introduced by Sriram et al. (ICLR 2022), addressed this by distributing a single input graph across multiple GPUs — a distinct strategy from data parallelism (which distributes training examples) or model parallelism (which distributes layers). This enabled training GNNs with hundreds of millions to billions of parameters for the first time. Building on these foundations, Meta FAIR released the Universal Model for Atoms (UMA) in 2025, trained on approximately 500 million unique 3D atomic structures spanning molecules, materials, and catalysts — the largest training run to date for an MLIP. UMA introduced a Mixture of Linear Experts (MoLE) architecture, enabling one model to learn from datasets generated by different DFT codes and settings without significant inference overhead. It matches or surpasses specialized models across catalysis, materials, and molecular benchmarks without task-specific fine-tuning, and has been described as marking a "pre/post-UMA" divide in the field. == Applications == Catalyst discovery: MLIPs have significantly accelerated the computational screening of heterogeneous catalysts by replacing expensive DFT relaxations with fast neural network surrogates. The Open Catalyst Project explicitly targets this application, aiming to identify new catalysts for green hydrogen production and other renewable energy reactions. Carbon capture: The OpenDAC project applies universal MLIPs to screening sorbent materials for direct air capture of CO₂, a key technology for climate change mitigation. AI-accelerated screening allows evaluation of orders of magnitude more candidate materials than traditional DFT workflows. Drug discovery and molecular design: MLIPs are increasingly used in pharmaceutical research to model molecular conformations and binding energies. The Open Molecules 2025 (OMol25) dataset, released by Meta FAIR in 2025, provides high-accuracy calculations for a large set of molecular systems to support this use case. Materials discovery: Universal MLIPs enable high-throughput screening of novel inorganic materials, including battery electrolytes, semiconductors, and superconductors, by rapidly estimating stability and properties across large chemical spaces.

HFST

Helsinki Finite-State Technology (HFST) is a computer programming library and set of utilities for natural language processing with finite-state automata and finite-state transducers. It is free and open-source software, released under a mix of the GNU General Public License version 3 (GPLv3) and the Apache License. == Features == The library functions as an interchanging interface to multiple backends, such as OpenFST, foma and SFST. The utilities comprise various compilers, such as hfst-twolc (a compiler for morphological two-level rules), hfst-lexc (a compiler for lexicon definitions) and hfst-regexp2fst (a regular expression compiler). Functions from Xerox's proprietary scripting language xfst is duplicated in hfst-xfst, and the pattern matching utility pmatch in hfst-pmatch, which goes beyond the finite-state formalism in having recursive transition networks (RTNs). The library and utilities are written in C++, with an interface to the library in Python and a utility for looking up results from transducers ported to Java and Python. Transducers in HFST may incorporate weights depending on the backend. For performing FST operations, this is currently only possible via the OpenFST backend. HFST provides two native backends, one designed for fast lookup (hfst-optimized-lookup), the other for format interchange. Both of them can be weighted. == Uses == HFST has been used for writing various linguistic tools, such as spell-checkers, hyphenators, and morphologies. Morphological dictionaries written in other formalisms have also been converted to HFST's formats.