AI Articles

AI Articles — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Biomedical data science

    Biomedical data science

    Biomedical data science is a multidisciplinary field which leverages large volumes of data to promote biomedical innovation and discovery. Biomedical data science draws from various fields including Biostatistics, Biomedical informatics, and machine learning, with the goal of understanding biological and medical data. It can be viewed as the study and application of data science to solve biomedical problems. Modern biomedical datasets often have specific features which make their analyses difficult, including: Large numbers of feature (sometimes billions), typically far larger than the number of samples (typically tens or hundreds) Noisy and missing data Privacy concerns (e.g., electronic health record confidentiality) Requirement of interpretability from decision makers and regulatory bodies Many biomedical data science projects apply machine learning to such datasets. These characteristics, while also present in many data science applications more generally, make biomedical data science a specific field. Examples of biomedical data science research include: Computational genomics Computational imaging Electronic health records data mining Biomedical network science Clinical Natural Language Processing (NLP) == Computational Imaging and Deep Learning == Computational imaging is a cornerstone of biomedical data science, focusing on the development of algorithms to enhance, analyze, and interpret medical imagery. In recent years, the field has been transformed by the integration of deep learning, particularly through the use of Convolutional Neural Networks. Deep learning started from researchers manually defining characteristics like edge detection or texture representation learning. In a more modern approach of computational imaging, models automatically learn a hierarchy of features directly from raw pixel data. This overlap between data science and deep learning is applied across several key tasks: Classification: Identifying the presence of specific diseases, such as distinguishing between benign and malignant tumors in histopathology slides or detecting pneumonia in chest X-rays. Segmentation: The precise delineation of anatomical structures or lesions. A notable example is the U-Net architecture, which is widely used for biomedical image segmentation to help clinicians quantify organ volume or track tumor growth. Detection: Automating the localization of small objects, such as identifying microcalcifications in mammograms or polyps during colonoscopies. Registration: The process of aligning multiple images to provide a comprehensive view of the patient's anatomy. Even with all of these enhancements, the application of deep learning in medical imaging requires accomplishing vigorous challenges. An example of these changes is building large, annotated datasets and creating the imperative for model interpretability in clinical decision-making. == Electronic Health Records == Electronic Health Records (EHRs) are a digital alternative to patient paper charts, usually including individual records or population health information. EHRs can be used in a wide variety of applications, including research and analysation as they often include demographics, diagnoses, medications, test results, and personal statistics. === History === ==== 1960s ==== The earliest precursor is considered Dr. Lawrence Weed's problem-oriented medical record (POMR) published in the 1968 which sorts and groups medical records by medical diagnoses and symptoms. The POMR was the first system to organize based off of patient information rather than the source (doctors, nurses, attendings, etc.). In 1969, the Regenstrief Institute developed and published the Regenstrief Medical Record System which established electronic writing, storage, and retrieval of records which served as the basis for modern EHR systems. ==== 2000s ==== In 2009, the Health Information Technology for Economic and Clinical Health Act (HITECH Act) was passed in the United States. This act standardized privacy and distribution of EHRs and increased the acceptance and utilization of EHRs within medical and academic settings. == Artificial Intelligence and Machine Learning Applications == Machine Learning and Artificial Intelligence have become central tools in biomedical data science. Recent advances in large language models (LLMs) have expanded their role beyond text, with models trained directly on genomic sequences enabling tasks such as gene function prediction, variant effect analysis, and drug discovery. In clinical settings, Natural Language Processing (NLP) models are applied to electronic health records to extract structured insights from unstructured clinical notes and data, supporting diagnosis and treatment planning. Beyond genomics, AI models have been applied to protein structure prediction. AlphaFold, developed by Google DeepMind, uses deep learning to predict three-dimensional protein structures from amino acid sequences with high accuracy. These predictions have been used to support drug target identification and the study of disease mechanisms. == Knowledge Graphs == Knowledge graphs (KGs) are widely used in biomedical data science to represent and analyze complex relationships among biological and medical entities. By structuring data as nodes (e.g., genes, diseases, drugs) and edges (relationships), KGs enable computational methods to extract insights and support decision-making. These biomedical relationships can be efficiently modeled and queried using technologies such as Neo4j. === Biomedical Research Applications === KGs provide biomedical researchers with a way to model complex biological systems. They have been used to identify the relationships between diseases and biomolecules, support drug repurposing, and to uncover new biological insights. Additional applications include: Identification of novel antibiotic resistance genes through graph-based link prediction. Finding associations between miRNA and diseases. Prediction of protein-protein interactions. === Clinical Applications === In clinical settings, KGs can be used to make visual representations of a patient's electronic health records. The data obtained from these graphs can assist healthcare providers in improving patient diagnoses and prescribing more effective drugs. Additionally, embeddings derived from resources like the Unified Medical Language System (UMLS) enable natural language processing of clinical text and similarity analysis between medical concepts. === Limitations === Despite their advantages, knowledge graphs face several challenges. Some of these include: High algorithmic complexity and large biological datasets make the process computationally expensive. KG construction can be a time-consuming process that requires careful attention to assign appropriate node types and vocabularies. Using data from a wide range of datasets in one KG requires them to be effectively integrated. == Privacy == A primary challenge in biomedical data science is maintaining medical privacy. Conducting research requires that data be collected on a number of people for training and testing purposes and is stored within biomedical datasets. This poses a risk for violating patient confidentiality and may dissuade people from participating in studies. The main sources of health statistics are surveys administrative and medical records health care claims data, vital records surveillance disease registries grey literature and peer-reviewed literature. Large data collection is a useful tool for researching various medical conditions. Researchers use these large datasets of information to identify factors that may make people more susceptible to certain diseases. Large amounts of collected data can help researchers identify patterns for disease probabilities. The findings can show a person is more likely for a condition, or identify environmental, social, and personal habits that may lead to adverse health issues. Institutions researching using personal medical information come with a moral and legal responsibility to protect the use of that information. Protection of the collected information has become a big concern. Sophisticated and coordinated attacks on certain medical systems happen more frequently. Medical companies, medical insurance and private businesses have invested a great deal into the protection of personal data. Despite this, data breaches continue to be documented. The chart below shows the top healthcare breaches in 2025. For these reasons, many people have reservations about giving up their personal data. Aside from the legitimate use of personal data there have been instances where companies have found methods to profit from brokering medical information. Concerns exist regarding unauthorized use of sensitive information within these data companies. If a person is identified within a dataset, then sensitive data can be used to discriminate against them. For example, insurance companies may charge a hi

    Read more →
  • Markov chain central limit theorem

    Markov chain central limit theorem

    In the mathematical theory of random processes, the Markov chain central limit theorem has a conclusion somewhat similar in form to that of the classic central limit theorem (CLT) of probability theory, but the quantity in the role taken by the variance in the classic CLT has a more complicated definition. See also the general form of Bienaymé's identity. == Statement == Suppose that: the sequence X 1 , X 2 , X 3 , … {\textstyle X_{1},X_{2},X_{3},\ldots } of random elements of some set is a Markov chain that has a stationary probability distribution; and the initial distribution of the process, i.e. the distribution of X 1 {\textstyle X_{1}} , is the stationary distribution, so that X 1 , X 2 , X 3 , … {\textstyle X_{1},X_{2},X_{3},\ldots } are identically distributed. In the classic central limit theorem these random variables would be assumed to be independent, but here we have only the weaker assumption that the process has the Markov property; and g {\textstyle g} is some (measurable) real-valued function for which var ⁡ ( g ( X 1 ) ) < + ∞ . {\textstyle \operatorname {var} (g(X_{1}))<+\infty .} Now let μ = E ⁡ ( g ( X 1 ) ) , μ ^ n = 1 n ∑ k = 1 n g ( X k ) σ 2 := lim n → ∞ var ⁡ ( n μ ^ n ) = lim n → ∞ n var ⁡ ( μ ^ n ) = var ⁡ ( g ( X 1 ) ) + 2 ∑ k = 1 ∞ cov ⁡ ( g ( X 1 ) , g ( X 1 + k ) ) . {\displaystyle {\begin{aligned}\mu &=\operatorname {E} (g(X_{1})),\\{\widehat {\mu }}_{n}&={\frac {1}{n}}\sum _{k=1}^{n}g(X_{k})\\\sigma ^{2}&:=\lim _{n\to \infty }\operatorname {var} ({\sqrt {n}}{\widehat {\mu }}_{n})=\lim _{n\to \infty }n\operatorname {var} ({\widehat {\mu }}_{n})=\operatorname {var} (g(X_{1}))+2\sum _{k=1}^{\infty }\operatorname {cov} (g(X_{1}),g(X_{1+k})).\end{aligned}}} Then as n → ∞ , {\textstyle n\to \infty ,} we have n ( μ ^ n − μ ) → D Normal ( 0 , σ 2 ) , {\displaystyle {\sqrt {n}}({\hat {\mu }}_{n}-\mu )\ {\xrightarrow {\mathcal {D}}}\ {\text{Normal}}(0,\sigma ^{2}),} where the decorated arrow indicates convergence in distribution. == Monte Carlo Setting == The Markov chain central limit theorem can be guaranteed for functionals of general state space Markov chains under certain conditions. In particular, this can be done with a focus on Monte Carlo settings. An example of the application in a MCMC (Markov Chain Monte Carlo) setting is the following: Consider a simple hard spheres model on a grid. Suppose X = { 1 , … , n 1 } × { 1 , … , n 2 } ⊆ Z 2 {\displaystyle X=\{1,\ldots ,n_{1}\}\times \{1,\ldots ,n_{2}\}\subseteq Z^{2}} . A proper configuration on X {\displaystyle X} consists of coloring each point either black or white in such a way that no two adjacent points are white. Let χ {\displaystyle \chi } denote the set of all proper configurations on X {\displaystyle X} , N χ ( n 1 , n 2 ) {\displaystyle N_{\chi }(n_{1},n_{2})} be the total number of proper configurations and π be the uniform distribution on χ {\displaystyle \chi } so that each proper configuration is equally likely. Suppose our goal is to calculate the typical number of white points in a proper configuration; that is, if W ( x ) {\displaystyle W(x)} is the number of white points in x ∈ χ {\displaystyle x\in \chi } then we want the value of E π W = ∑ x ∈ χ W ( x ) N χ ( n 1 , n 2 ) {\displaystyle E_{\pi }W=\sum _{x\in \chi }{\frac {W(x)}{N_{\chi }{\bigl (}n_{1},n_{2}{\bigr )}}}} If n 1 {\displaystyle n_{1}} and n 2 {\displaystyle n_{2}} are even moderately large then we will have to resort to an approximation to E π W {\displaystyle E_{\pi }W} . Consider the following Markov chain on χ {\displaystyle \chi } . Fix p ∈ ( 0 , 1 ) {\displaystyle p\in (0,1)} and set X 1 = x 1 {\displaystyle X_{1}=x_{1}} where x 1 ∈ χ {\displaystyle x_{1}\in \chi } is an arbitrary proper configuration. Randomly choose a point ( x , y ) ∈ X {\displaystyle (x,y)\in X} and independently draw U ∼ U n i f o r m ( 0 , 1 ) {\displaystyle U\sim \mathrm {Uniform} (0,1)} . If u ≤ p {\displaystyle u\leq p} and all of the adjacent points are black then color ( x , y ) {\displaystyle (x,y)} white leaving all other points alone. Otherwise, color ( x , y ) {\displaystyle (x,y)} black and leave all other points alone. Call the resulting configuration X 1 {\displaystyle X_{1}} . Continuing in this fashion yields a Harris ergodic Markov chain { X 1 , X 2 , X 3 , … } {\displaystyle \{X_{1},X_{2},X_{3},\ldots \}} having π {\displaystyle \pi } as its invariant distribution. It is now a simple matter to estimate E π W {\displaystyle E_{\pi }W} with w n ¯ = ∑ i = 1 n W ( X i ) / n {\displaystyle {\overline {w_{n}}}=\sum _{i=1}^{n}W(X_{i})/n} . Also, since χ {\displaystyle \chi } is finite (albeit potentially large) it is well known that X {\displaystyle X} will converge exponentially fast to π {\displaystyle \pi } which implies that a CLT holds for w n ¯ {\displaystyle {\overline {w_{n}}}} . == Implications == Not taking into account the additional terms in the variance which stem from correlations (e.g. serial correlations in markov chain monte carlo simulations) can result in the problem of pseudoreplication when computing e.g. the confidence intervals for the sample mean.

    Read more →
  • AI Clip Makers: Free vs Paid (2026)

    AI Clip Makers: Free vs Paid (2026)

    Shopping for the best AI clip maker? An AI clip maker is software that uses machine learning to help you get more done — it keeps getting smarter as the underlying models improve. Pricing, accuracy, and the size of the model behind the tool are the three factors that most affect daily usefulness. Whether you are a beginner or a pro, the right AI clip maker slots into your workflow and pays for itself fast. Below we compare features, pricing, and real output so you can choose with confidence.

    Read more →
  • Yun Sing Koh

    Yun Sing Koh

    Yun Sing Koh (born 1978) is a New Zealand computer science academic, and is a full professor at the University of Auckland, specialising in machine learning and artificial intelligence. She is a co-director of the Centre of Machine Learning for Social Good, and the Advanced Machine Learning and Data Analytics Research (MARS) Lab at Auckland. == Academic career == Koh earned a Bachelor of Science with Honours and a Master of Software Engineering at the University of Malaya. She then completed a PhD titled Generating sporadic association rules at the University of Otago in 2007. Koh joined the faculty of the University of Auckland in 2010, rising to full professor. As of 2024, she is director of the Centre of Machine Learning for Social Good at Auckland, alongside Gillian Dobbie and Daniel Wilson, and is director of the Master of AI course at the university. Koh also co-directs the Advanced Machine Learning and Data Analytics Research (MARS) Lab. Koh's research covers machine learning and artificial intelligence. She is especially interested in designing machine learning algorithms for data streams, and has led research using AI systems to identify individual stoats for pest population research. In 2018 she was awarded a Marsden grant for a research project "An Adaptive Predictive System for Life-long Learning on Data Streams", and has been part of three MBIE projects. In 2025 the stoat identification project Koh co-leads with Daniel Wilson was awarded $1 million per annum by the MBIE Smart Ideas fund. Koh was a finalist in the AI in Climate section of the Women in AI Australia and New Zealand Awards in 2022. She was a 2023 Fellow at the United States National Science Foundation-funded Convergence Research (CORE) Institute. Koh has chaired a number of sessions at international conferences on data mining. In March 2026 it was announced that Koh would be a member of the New Zealand Human Rights Commission's Expert Advisory Group on Artificial Intelligence, Emerging Digital Technologies and Human Rights. == Selected works == Philippe Fournier-Viger; Jerry Chun-Wei Lin; Rage Uday Kiran; Yun Sing Koh; Rincy Thomas (2017). "A Survey of Sequential Pattern Mining". Data Science and Pattern Recognition. 1 (1): 54–77. Wikidata Q138719481. Yun Sing Koh; Nathan Rountree; Richard O’Keefe (1 April 2006). "Finding Non-Coincidental Sporadic Rules Using Apriori-Inverse". International Journal of Data Warehousing and Mining (in Ndonga). 2 (2): 38–54. doi:10.4018/JDWM.2006040102. ISSN 1548-3924. Wikidata Q125185222. Russel Pears; Sripirakas Sakthithasan; Yun Sing Koh (11 January 2014). "Detecting concept change in dynamic data streams". Machine Learning. 97 (3): 259–293. doi:10.1007/S10994-013-5433-9. ISSN 1573-0565. Zbl 1319.68186. Wikidata Q125185156. David Tse Jung Huang; Yun Sing Koh; Gillian Dobbie; Russel Pears (December 2014), Detecting Volatility Shift in Data Streams, Institute of Electrical and Electronics Engineers, doi:10.1109/ICDM.2014.50, Wikidata Q125185151 Sidney Tsang; Yun Sing Koh; Gillian Dobbie (2011). "RP-Tree: Rare Pattern Tree Mining". Lecture Notes in Computer Science: 277–288. doi:10.1007/978-3-642-23544-3_21. ISSN 0302-9743. Wikidata Q125185206. Yun Sing Koh; Sri Devi Ravana (24 May 2016). "Unsupervised Rare Pattern Mining". ACM Transactions on Knowledge Discovery from Data. 10 (4): 1–29. doi:10.1145/2898359. ISSN 1556-4681. Wikidata Q125185136. Jack Julian; Yun Sing Koh; Albert Bifet (1 October 2025), Building adaptive knowledge bases for evolving continual learning models (PDF), vol. 1, doi:10.1038/S44387-025-00028-4, Wikidata Q138719496

    Read more →
  • Automatic taxonomy construction

    Automatic taxonomy construction

    Automatic taxonomy construction (ATC) is the use of software programs to generate taxonomical classifications from a body of texts called a corpus. ATC is a branch of natural language processing, which in turn is a branch of artificial intelligence. A taxonomy (or taxonomical classification) is a scheme of classification, especially, a hierarchical classification, in which things are organized into groups or types. Among other things, a taxonomy can be used to organize and index knowledge (stored as documents, articles, videos, etc.), such as in the form of a library classification system, or a search engine taxonomy, so that users can more easily find the information they are searching for. Many taxonomies are hierarchies (and thus, have an intrinsic tree structure), but not all are. Manually developing and maintaining a taxonomy is a labor-intensive task requiring significant time and resources, including familiarity of or expertise in the taxonomy's domain (scope, subject, or field), which drives the costs and limits the scope of such projects. Also, domain modelers have their own points of view which inevitably, even if unintentionally, work their way into the taxonomy. ATC uses artificial intelligence techniques to quickly automatically generate a taxonomy for a domain in order to avoid these problems and remove limitations. == Approaches == There are several approaches to ATC. One approach is to use rules to detect patterns in the corpus and use those patterns to infer relations such as hyponymy. Other approaches use machine learning techniques such as Bayesian inferencing and Artificial Neural Networks. === Keyword extraction === One approach to building a taxonomy is to automatically gather the keywords from a domain using keyword extraction, then analyze the relationships between them (see Hyponymy, below), and then arrange them as a taxonomy based on those relationships. === Hyponymy and "is-a" relations === In ATC programs, one of the most important tasks is the discovery of hypernym and hyponym relations among words. One way to do that from a body of text is to search for certain phrases like "is a" and "such as". In linguistics, is-a relations are called hyponymy. Words that describe categories are called hypernyms and words that are examples of categories are hyponyms. For example, dog is a hypernym and Fido is one of its hyponyms. A word can be both a hyponym and a hypernym. So, dog is a hyponym of mammal and also a hypernym of Fido. Taxonomies are often represented as is-a hierarchies where each level is more specific than (in mathematical language "a subset of") the level above it. For example, a basic biology taxonomy would have concepts such as mammal, which is a subset of animal, and dogs and cats, which are subsets of mammal. This kind of taxonomy is called an is-a model because the specific objects are considered instances of a concept. For example, Fido is-a instance of the concept dog and Fluffy is-a cat. == Applications == ATC can be used to build taxonomies for search engines, to improve search results. ATC systems are a key component of ontology learning (also known as automatic ontology construction), and have been used to automatically generate large ontologies for domains such as insurance and finance. They have also been used to enhance existing large networks such as Wordnet to make them more complete and consistent. == ATC software == == Other names == Other names for automatic taxonomy construction include: Automated outline building Automated outline construction Automated outline creation Automated outline extraction Automated outline generation Automated outline induction Automated outline learning Automated outlining Automated taxonomy building Automated taxonomy construction Automated taxonomy creation Automated taxonomy extraction Automated taxonomy generation Automated taxonomy induction Automated taxonomy learning Automatic outline building Automatic outline construction Automatic outline creation Automatic outline extraction Automatic outline generation Automatic outline induction Automatic outline learning Automatic taxonomy building Automatic taxonomy creation Automatic taxonomy extraction Automatic taxonomy generation Automatic taxonomy induction Automatic taxonomy learning Outline automation Outline building Outline construction Outline creation Outline extraction Outline generation Outline induction Outline learning Semantic taxonomy building Semantic taxonomy construction Semantic taxonomy creation Semantic taxonomy extraction Semantic taxonomy generation Semantic taxonomy induction Semantic taxonomy learning Taxonomy automation Taxonomy building Taxonomy construction Taxonomy creation Taxonomy extraction Taxonomy generation Taxonomy induction Taxonomy learning

    Read more →
  • Internettolken

    Internettolken

    Internettolken (or InternetPreter) is a web-based machine translating tool. As the first Swedish online translating service, it was started in 2002 and included the English and Swedish languages. Today, there are 14 languages with more than 120 possible combinations. The service is free up to 150 words per day, and as a 2,000-word free testing account. It is available both on its website, and as a gadget on iGoogle. The interface is either English or Swedish. Being a dictionary-based tool, with its own translation software, it can sometimes offer a more accurate translation than Google Translate and others, although the grammar will be incorrect. == Languages currently available ==

    Read more →
  • Brainware

    Brainware

    Brainware was an American software company that marketed Automatic identification and data capture and data extraction products. The company was acquired by Hyland Software in 2017. Brainware originally spun out of Dulles, Virginia-based SER Solutions Inc. in February 2006 when SER was acquired by The Gores Group LLC. From February 2006 to March 2012, Brainware's majority owner was San Francisco-based private equity firm Vista Equity Partners. == History == On March 5, 2012, Lexmark International announced it had acquired the company for a cash price of approximately $148 million. The company was added to Lexmark's Perceptive Software division. On July 10, 2017, Hyland Software finalized its acquisition of the Perceptive Business Unit of Lexmark International, Inc. All enterprise software business assets in the Perceptive business unit, including Perceptive Content (formerly ImageNow), Perceptive Intelligent Capture (formerly Brainware), Acuo VNA, PACSGEAR, Claron, Nolij, Saperion, Pallas Athena, ISYS and Twistage, now operate under Hyland's portfolio of products. Brainware was headquartered in Ashburn, Virginia, USA, with sales, support, professional services and R&D offices in London, UK; Kirchzarten, Germany; and Neuchâtel, Switzerland. The company had partnerships with most major enterprise software providers, including Oracle, SAP and Microsoft, and said its software integrated with most available enterprise content management platforms. Brainware also partnered with a number of hardware providers, including Hewlett-Packard, Fujitsu and OPEX. Brainware's core solution, Distiller, "disrupted the data capture industry by using contextual document data to deliver higher automated processing than earlier technology" said Henry Ijams, Managing Director and Founder, PayStream Advisors. Brainware was awarded a Technology Excellence Award by PayStream Advisors and their Advisory Board to honor those providers who are delivering industry leading solutions. Brainware said its software "could relieve a company of 60 percent to 80 percent of the work of manually keying in information from unstructured documents," and serviced companies such as NEC, Mayo Clinic, Bechtel, Royal Dutch Shell, and Rabobank. In a 2011 comparison report, Real Story Group classifies Brainware as a "Capture Solutions" vendor, competing directly with Kofax and ReadSoft. Brainware and its customers were profiled in publications including Profit Online, Business Finance, imageSource, Managing Automation, Industryweek, Treasury & Risk and others. The company's enterprise search technology has been profiled by InfoWorld.

    Read more →
  • Suffix automaton

    Suffix automaton

    In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix automaton of a string S {\displaystyle S} is the smallest directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string. In terms of automata theory, a suffix automaton is the minimal partial deterministic finite automaton that recognizes the set of suffixes of a given string S = s 1 s 2 … s n {\displaystyle S=s_{1}s_{2}\dots s_{n}} . The state graph of a suffix automaton is called a directed acyclic word graph (DAWG), a term that is also sometimes used for any deterministic acyclic finite state automaton. Suffix automata were introduced in 1983 by a group of scientists from the University of Denver and the University of Colorado Boulder. They suggested a linear time online algorithm for its construction and showed that the suffix automaton of a string S {\displaystyle S} having length at least two characters has at most 2 | S | − 1 {\textstyle 2|S|-1} states and at most 3 | S | − 4 {\textstyle 3|S|-4} transitions. Further works have shown a close connection between suffix automata and suffix trees, and have outlined several generalizations of suffix automata, such as compacted suffix automaton obtained by compression of nodes with a single outgoing arc. Suffix automata provide efficient solutions to problems such as substring search and computation of the largest common substring of two and more strings. == History == The concept of suffix automaton was introduced in 1983 by a group of scientists from University of Denver and University of Colorado Boulder consisting of Anselm Blumer, Janet Blumer, Andrzej Ehrenfeucht, David Haussler and Ross McConnell, although similar concepts had earlier been studied alongside suffix trees in the works of Peter Weiner, Vaughan Pratt and Anatol Slissenko. In their initial work, Blumer et al. showed a suffix automaton built for the string S {\displaystyle S} of length greater than 1 {\displaystyle 1} has at most 2 | S | − 1 {\displaystyle 2|S|-1} states and at most 3 | S | − 4 {\displaystyle 3|S|-4} transitions, and suggested a linear algorithm for automaton construction. In 1983, Mu-Tian Chen and Joel Seiferas independently showed that Weiner's 1973 suffix-tree construction algorithm while building a suffix tree of the string S {\displaystyle S} constructs a suffix automaton of the reversed string S R {\textstyle S^{R}} as an auxiliary structure. In 1987, Blumer et al. applied the compressing technique used in suffix trees to a suffix automaton and invented the compacted suffix automaton, which is also called the compacted directed acyclic word graph (CDAWG). In 1997, Maxime Crochemore and Renaud Vérin developed a linear algorithm for direct CDAWG construction. In 2001, Shunsuke Inenaga et al. developed an algorithm for construction of CDAWG for a set of words given by a trie. == Definitions == Usually when speaking about suffix automata and related concepts, some notions from formal language theory and automata theory are used, in particular: "Alphabet" is a finite set Σ {\displaystyle \Sigma } that is used to construct words. Its elements are called "characters"; "Word" is a finite sequence of characters ω = ω 1 ω 2 … ω n {\displaystyle \omega =\omega _{1}\omega _{2}\dots \omega _{n}} . "Length" of the word ω {\displaystyle \omega } is denoted as | ω | = n {\displaystyle |\omega |=n} ; "Formal language" is a set of words over given alphabet; "Language of all words" is denoted as Σ ∗ {\displaystyle \Sigma ^{}} (where the "" character stands for Kleene star), "empty word" (the word of zero length) is denoted by the character ε {\displaystyle \varepsilon } ; "Concatenation of words" α = α 1 α 2 … α n {\displaystyle \alpha =\alpha _{1}\alpha _{2}\dots \alpha _{n}} and β = β 1 β 2 … β m {\displaystyle \beta =\beta _{1}\beta _{2}\dots \beta _{m}} is denoted as α ⋅ β {\displaystyle \alpha \cdot \beta } or α β {\displaystyle \alpha \beta } and corresponds to the word obtained by writing β {\displaystyle \beta } to the right of α {\displaystyle \alpha } , that is, α β = α 1 α 2 … α n β 1 β 2 … β m {\displaystyle \alpha \beta =\alpha _{1}\alpha _{2}\dots \alpha _{n}\beta _{1}\beta _{2}\dots \beta _{m}} ; "Concatenation of languages" A {\displaystyle A} and B {\displaystyle B} is denoted as A ⋅ B {\displaystyle A\cdot B} or A B {\displaystyle AB} and corresponds to the set of pairwise concatenations A B = { α β : α ∈ A , β ∈ B } {\displaystyle AB=\{\alpha \beta :\alpha \in A,\beta \in B\}} ; If the word ω ∈ Σ ∗ {\displaystyle \omega \in \Sigma ^{}} may be represented as ω = α γ β {\displaystyle \omega =\alpha \gamma \beta } , where α , β , γ ∈ Σ ∗ {\displaystyle \alpha ,\beta ,\gamma \in \Sigma ^{}} , then words α {\displaystyle \alpha } , β {\displaystyle \beta } and γ {\displaystyle \gamma } are called "prefix", "suffix" and "subword" (substring) of the word ω {\displaystyle \omega } correspondingly; If T = T 1 … T n {\displaystyle T=T_{1}\dots T_{n}} and T l T l + 1 … T r = S {\displaystyle T_{l}T_{l+1}\dots T_{r}=S} (with 1 ≤ l ≤ r ≤ n {\displaystyle 1\leq l\leq r\leq n} ) then S {\displaystyle S} is said to "occur" in T {\displaystyle T} as a subword. Here l {\displaystyle l} and r {\displaystyle r} are called left and right positions of occurrence of S {\displaystyle S} in T {\displaystyle T} correspondingly. == Automaton structure == Formally, deterministic finite automaton is determined by 5-tuple A = ( Σ , Q , q 0 , F , δ ) {\displaystyle {\mathcal {A}}=(\Sigma ,Q,q_{0},F,\delta )} , where: Σ {\displaystyle \Sigma } is an "alphabet" that is used to construct words, Q {\displaystyle Q} is a set of automaton "states", q 0 ∈ Q {\displaystyle q_{0}\in Q} is an "initial" state of automaton, F ⊂ Q {\displaystyle F\subset Q} is a set of "final" states of automaton, δ : Q × Σ ↦ Q {\displaystyle \delta :Q\times \Sigma \mapsto Q} is a partial "transition" function of automaton, such that δ ( q , σ ) {\displaystyle \delta (q,\sigma )} for q ∈ Q {\displaystyle q\in Q} and σ ∈ Σ {\displaystyle \sigma \in \Sigma } is either undefined or defines a transition from q {\displaystyle q} over character σ {\displaystyle \sigma } . Most commonly, deterministic finite automaton is represented as a directed graph ("diagram") such that: Set of graph vertices corresponds to the state of states Q {\displaystyle Q} , Graph has a specific marked vertex corresponding to initial state q 0 {\displaystyle q_{0}} , Graph has several marked vertices corresponding to the set of final states F {\displaystyle F} , Set of graph arcs corresponds to the set of transitions δ {\displaystyle \delta } , Specifically, every transition δ ( q 1 , σ ) = q 2 {\textstyle \delta (q_{1},\sigma )=q_{2}} is represented by an arc from q 1 {\displaystyle q_{1}} to q 2 {\displaystyle q_{2}} marked with the character σ {\displaystyle \sigma } . This transition also may be denoted as q 1 σ ⟶ q 2 {\textstyle q_{1}{\begin{smallmatrix}{\sigma }\\[-5pt]{\longrightarrow }\end{smallmatrix}}q_{2}} . In terms of its diagram, the automaton recognizes the word ω = ω 1 ω 2 … ω m {\displaystyle \omega =\omega _{1}\omega _{2}\dots \omega _{m}} only if there is a path from the initial vertex q 0 {\displaystyle q_{0}} to some final vertex q ∈ F {\displaystyle q\in F} such that concatenation of characters on this path forms ω {\displaystyle \omega } . The set of words recognized by an automaton forms a language that is set to be recognized by the automaton. In these terms, the language recognized by a suffix automaton of S {\displaystyle S} is the language of its (possibly empty) suffixes. === Automaton states === "Right context" of the word ω {\displaystyle \omega } with respect to language L {\displaystyle L} is a set [ ω ] R = { α : ω α ∈ L } {\displaystyle [\omega ]_{R}=\{\alpha :\omega \alpha \in L\}} that is a set of words α {\displaystyle \alpha } such that their concatenation with ω {\displaystyle \omega } forms a word from L {\displaystyle L} . Right contexts induce a natural equivalence relation [ α ] R = [ β ] R {\displaystyle [\alpha ]_{R}=[\beta ]_{R}} on the set of all words. If language L {\displaystyle L} is recognized by some deterministic finite automaton, there exists unique up to isomorphism automaton that recognizes the same language and has the minimum possible number of states. Such an automaton is called a minimal automaton for the given language L {\displaystyle L} . Myhill–Nerode theorem allows it to define it explicitly in terms of right contexts: In these terms, a "suffix automaton" is the minimal deterministic finite automaton recognizing the language of suffixes of the word S = s 1 s 2 … s n {\displaystyle S=s_{1}s_{2}\dots s_{n}} . The right context of the word ω {\displaystyle \omeg

    Read more →
  • Apache Drill

    Apache Drill

    Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Drill supports a variety of NoSQL databases and file systems, including Alluxio, HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. Drill's datastore-aware optimizer automatically restructures a query plan to leverage the datastore's internal processing capabilities. In addition, Drill supports data locality, if Drill and the datastore are on the same nodes. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016. == Features == One explicitly stated design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds. Schema-free JSON document model similar to MongoDB and Elasticsearch, without requiring a formal schema to be declared Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs Extremely user and developer friendly Pluggable architecture enables connectivity to multiple datastores Version 1.9 added dynamic user-defined functions Version 1.11 added cryptographic-related functions and PCAP file format support == Back-end support == Drill is primarily focused on non-relational datastores, including Apache Hadoop text files, NoSQL, and cloud storage. A notable feature also includes in situ querying of local JSON and Apache Parquet files. Some additional datastores that it supports include: All Hadoop distributions (HDFS API 2.3+), including Apache Hadoop, MapR, CDH and Amazon EMR NoSQL: MongoDB, Apache HBase, Apache Cassandra Online Analytical Processing: Apache Kudu, Apache Druid, OpenTSDB Cloud storage: Amazon S3, Google Cloud Storage, Azure Blob Storage, Swift, IBM Cloud Object Storage Diverse data formats, including Apache Avro, Apache Parquet and JSON RDBMs storage plugins (Using JDBC to connect to MySQL, PostgreSQL, and others) A new datastore can be added by developing a storage plugin. Drill's "schema-free" JSON data model enables it to query non-relational datastores in-situ . == Front-end support == Drill itself can be queried via JDBC, ODBC, or REST through a variety of methods and languages including Python and Java. The default install includes a web interface allowing end-users to execute ANSI SQL directly and export data tables as CSV files without any programming. The dashboard library, Apache Superset, is particularly well suited for visualization of data queried with Drill.

    Read more →
  • Isabelle Guyon

    Isabelle Guyon

    Isabelle Guyon (French pronunciation: [izabɛl ɡɥijɔ̃]; born August 15, 1961) is a French-born researcher in machine learning known for her work on support-vector machines, artificial neural networks and bioinformatics. She is a Chair Professor at the University of Paris-Saclay. Guyon serves as the Director of Research at Google DeepMind since October 2022. She is considered to be a pioneer in the field, with her contribution to the support-vector machines with Vladimir Vapnik and Bernhard Boser. == Biography == After graduating from the French engineering school ESPCI Paris in 1985, she joined the group of Gerard Dreyfus at the Université Pierre-et-Marie-Curie to do a PhD on neural networks architectures and training. Guyon defended her thesis in 1988 and was hired the year after at AT&T Bell Laboratories, first as a post-doc, then as a group leader. She worked at Bell Labs for six years, where she explored several research areas, from neural networks to pattern recognition and computational learning theory, with application to handwriting recognition. She collaborated with Yann LeCun, Léon Bottou, Vladimir Vapnik, Corinna Cortes, Yoshua Bengio, Patrice Simard, and met her future husband, Bernhard Boser. In 1996, Guyon left Bell Labs and raised her children at Berkeley, California. In Berkeley, she created her own machine learning consulting company, Clopinet. She became interested in medical applications, and used her previous work to classify the genes responsible for different types of cancers. Since 2003, Guyon has organized many challenges in data science, in order to stimulate research in this field. She founded ChaLearn in 2011, a non-profit organization aimed at creating machine learning challenges open to everyone. She was Program Chair of NeurIPS 2016 and became General Chair of NeurIPS in 2017. She is also Action Editor for the Journal of Machine Learning Research and Series Editor for Series: Challenges in Machine Learning. She is a member of the European Laboratory for Learning and Intelligent Systems. In 2016, Guyon came back to France to take the Chair Professorship in Big data between the University of Paris-Saclay and INRIA. She works in TAU (TAckling the Underspecified), a research collaboration of the Laboratoire de recherche en informatique. Together with Bernhard Schölkopf and Vladimir Vapnik, she received in 2020 the BBVA Foundation Frontiers of Knowledge Awards for her work in machine learning. == Scientific work == Guyon has worked in many subfields of machine learning, including neural networks, support-vector machines, feature selection and applications of machine learning to biology. === Support-vector machines === Among her most notable contributions, Guyon co-invented support-vector machines (SVM) in 1992, with Bernhard Boser and Vladimir Vapnik. SVM is a supervised machine learning algorithm, comparable to neural networks or decision trees, which has quickly become a classical technique in machine learning. SVMs have especially contributed to the popularization of kernel methods. === Neural networks === During her years at Bell Labs, Guyon took part of numerous projects involving neural networks. In particular, she wrote some of the first papers on the use of neural network for handwriting recognition using the MNIST database. She is also a co-inventor of the siamese neural networks, a neural network architecture used to learn similarities, with applications to signature, face or object recognition. === Machine learning for biology === Guyon is the author of many publications at the intersection of biology (cancer research and genomics) and artificial intelligence. She has notably introduced the use of support-vector machines to detect cancer using genes. === Machine learning challenges === Through her non-profit organization ChaLearn, Guyon has organized and directed challenges open to everyone in order to solve open problems in machine learning, including computer vision, neurosciences, particle physics, feature selection, causality and automated machine learning. Most of the challenges organized by ChaLearn have resulted in publications. Among the most cited ones are: Guyon et al., Result analysis of the NIPS 2003 feature selection challenge, Advances in neural information processing systems, 2005, link Escalera et al., ChaLearn Looking at People Challenge 2014: Dataset and Results, Computer Vision - ECCV 2014 Workshops, Springer International Publishing, 2014, link Guyon et al., A brief Review of the ChaLearn AutoML Challenge, JMLR: Workshop and Conference Proceedings 64:21-30, 2016, link Adam-Bourdario et al., The Higgs boson machine learning challenge, JMLR: Workshop and Conference Proceedings 42:19-55, 2015, link == Private life == She is married to Bernhard Boser, a professor at UC Berkeley. She has twins and one daughter, all three of whom have completed a science degree. Guyon has three citizenships: French by birth, Swiss by marriage and American by naturalization. == Awards and honors == Nomination at the French Academy of technologies (2024) Recipient of the BBVA Foundation Frontiers of Knowledge Awards (2020) American Medical Informatics Association Fellow (2011) == Publications == Bernhard Boser, Isabelle Guyon and Vladmir Vapnik, A training algorithm for optimal margin classifiers, Proceedings of the fifth annual workshop on Computational learning theory, 1992, doi:10.1145/130385.130401 Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger and Roopak Shah, Signature verification using a" siamese" time delay neural network, Advances in Neural Information Processing Systems, 1994. Isabelle Guyon and André Elisseeff, An introduction to variable and feature selection, Journal of Machine Learning Research, 2003. Isabelle Guyon, Jason Weston, Stephen Barnhill and Vladimir Vapnik, Gene selection for cancer classification using support vector machines, Machine Learning, Kluwer Academic Publishers, 2002, doi:10.1023/A:1012487302797

    Read more →
  • Extended affix grammar

    Extended affix grammar

    In computer science, extended affix grammars (EAGs) are a formal grammar formalism for describing the context free and context sensitive syntax of language, both natural language and programming languages. EAGs are a member of the family of two-level grammars; more specifically, a restriction of Van Wijngaarden grammars with the specific purpose of making parsing feasible. Like Van Wijngaarden grammars, EAGs have hyperrules that form a context-free grammar except in that their nonterminals may have arguments, known as affixes, the possible values of which are supplied by another context-free grammar, the metarules. EAGs were introduced and studied by D.A. Watt in 1974; recognizers were developed at the University of Nijmegen between 1985 and 1995. The EAG compiler developed there will generate either a recogniser, a transducer, a translator, or a syntax directed editor for a language described in the EAG formalism. The formalism is quite similar to Prolog, to the extent that it borrowed its cut operator. EAGs have been used to write grammars of natural languages such as English, Spanish, and Hungarian. The aim was to verify the grammars by making them parse corpora of text (corpus linguistics); hence, parsing had to be sufficiently practical. However, the parse tree explosion problem that ambiguities in natural language tend to produce in this type of approach is worsened for EAGs because each choice of affix value may produce a separate parse, even when several different values are equivalent. The remedy proposed was to switch to the much simpler Affix Grammar over a Finite Lattice (AGFL) instead, in which metagrammars can only produce simple finite languages.

    Read more →
  • How to Choose an AI Bug Finder

    How to Choose an AI Bug Finder

    Comparing the best AI bug finder? An AI bug finder is software that uses machine learning to help you get more done — it lowers the barrier so anyone can produce professional output. Privacy matters too: check whether your data trains the model and whether a no-log or enterprise tier is available. Whether you are a beginner or a pro, the right AI bug finder slots into your workflow and pays for itself fast. Below we compare features, pricing, and real output so you can choose with confidence.

    Read more →
  • The Triple Revolution

    The Triple Revolution

    "The Triple Revolution" was an open memorandum sent to U.S. President Lyndon B. Johnson and other government figures on March 22, 1964. It concerned three megatrends of the time: increasing use of automation, the nuclear arms race, and advancements in human rights. Drafted under the auspices of the Center for the Study of Democratic Institutions, it was signed by an array of noted social activists, professors, and technologists who identified themselves as the Ad Hoc Committee on the Triple Revolution. The chief initiator of the proposal was W. H. "Ping" Ferry, at that time a vice-president of CSDI, basing it in large part on the ideas of the futurist Robert Theobald. == Overview == The statement identified three revolutions underway in the world: the cybernation revolution of increasing automation; the weaponry revolution of mutually assured destruction; and the human rights revolution. It discussed primarily the cybernation revolution. The committee claimed that machines would usher in "a system of almost unlimited productive capacity" while continually reducing the number of manual laborers needed, and increasing the skill needed to work, thereby producing increasing levels of unemployment. It proposed that the government should ease this transformation through large-scale public works, low-cost housing, public transit, electrical power development, income redistribution, union representation for the unemployed, and government restraint on technology deployment. == Legacy == Martin Luther King Jr.'s final Sunday sermon, delivered six days before his April 1968 assassination, explicitly references the thesis of "The Triple Revolution": There can be no gainsaying of the fact that a great revolution is taking place in the world today. In a sense it is a triple revolution: that is, a technological revolution, with the impact of automation and cybernation; then there is a revolution in weaponry, with the emergence of atomic and nuclear weapons of warfare; then there is a human rights revolution, with the freedom explosion that is taking place all over the world. Yes, we do live in a period where changes are taking place. And there is still the voice crying through the vista of time saying, "Behold, I make all things new; former things are passed away." In Harlan Ellison's 1967 anthology Dangerous Visions, Philip José Farmer's story "Riders of the Purple Wage" uses the Triple Revolution document as the premise of a future society, in which the "purple wage" of the title is a guaranteed income dole on which most of the population lives. At the 1968 World Science Fiction Convention in San Francisco, Farmer delivered a lengthy Guest of Honor speech in which he called for the founding of a grassroots activist organization called REAP which would work for implementation of the Ad Hoc Committee's recommendations. Looking back on the proposal in his 2008 book, Daniel Bell wrote: "the cybernetic revolution quickly proved to be illusory. There were no spectacular jumps in productivity. ... Cybernation had proved to be one more instance of the penchant for overdramatizing a momentary innovation and blowing it up far out of proportion to its actuality. ... The image of a completely automated production economy—with an endless capacity to turn out goods—was simply a social-science fiction of the early 1960s. Paradoxically, the vision of Utopia was suddenly replaced by the spectre of Doomsday. In place of the early-sixties theme of endless plenty, the picture by the end of the decade was one of a fragile planet of limited resources whose finite stocks were being rapidly depleted, and whose wastes from soaring industrial production were polluting the air and waters." In his 2015 book Rise of the Robots, Martin Ford claims The Triple Revolution's predictions of steady decline in future employment were not wrong, but rather premature. He cites "Seven Deadly Trends" that began in the 1970s-1980s and by the mid-2010s appeared set to continue: Stagnation in real wages Decline in labor's share of national income in many countries (breakdown of Bowley's law), while corporate profits increased Declining labor force participation Diminishing job creation, lengthening jobless recoveries, and soaring long-term unemployment Rising inequality Declining incomes, and underemployment for recent college graduates Polarization and part-time jobs (middle-class jobs are disappearing, to be replaced by a small number of high-paying jobs and large number of low-paying jobs) According to Ford, the 1960s were part of what in retrospect seems like a golden age for labor in the United States, when productivity and wages rose together in near lockstep, and unemployment was low. But after about 1980, wages began stagnating while productivity continued to rise. Labor's share of the economic output began to decline. Ford describes the role that automation and information technology play in these trends, and how new technologies including narrow AI threaten to destroy jobs faster than displaced workers can be retrained for new jobs, before automation takes the new jobs as well. This includes many job categories, such as in transportation, that were never threatened by automation before. According to a 2013 study, about 47% of US jobs are susceptible to automation. == Signatories ==

    Read more →
  • Karen Spärck Jones

    Karen Spärck Jones

    Karen Ida Boalth Spärck Jones (26 August 1935 – 4 April 2007) was a self-taught programmer and a pioneering British computer and information scientist responsible for the concept of inverse document frequency (IDF), a technology that underlies most modern search engines. She was an advocate for women in computer science, her slogan being, "Computing is too important to be left to men." In 2019, The New York Times published her belated obituary in its series Overlooked, calling her "a pioneer of computer science for work combining statistics and linguistics, and an advocate for women in the field." From 2008, to recognise her achievements in the fields of information retrieval (IR) and natural language processing (NLP), the Karen Spärck Jones Award is awarded annually to a recipient for outstanding research in one or both of her fields. == Early life and education == Karen Ida Boalth Spärck Jones was born in Huddersfield, Yorkshire, England. Her parents were Alfred Owen Jones, a chemistry lecturer, and Ida Spärck, a Norwegian who worked for the Norwegian government while in exile in London during World War II. Spärck Jones was educated at a grammar school in Huddersfield and then from 1953 to 1956 at Girton College, Cambridge, studying history, with an additional final year in Moral Sciences (philosophy). While at Cambridge, Spärck Jones joined the organisation known as the Cambridge Language Research Unit (CLRU) and met the head of CLRU Margaret Masterman, who would inspire her to go into computer science. While working at the CLRU, Spärck Jones began pursuing her PhD. At the time of submission, her PhD thesis was cast aside as uninspired and lacking original thought, but was later published in its entirety as a book. She briefly became a school teacher before moving into computer science. Spärck Jones married fellow Cambridge computer scientist Roger Needham in 1958. Spärck Jones's mother, Ida Spärck, had fled Norway on one of the last boats out after the German invasion in April 1940, going on to serve the Norwegian government in exile in London throughout the war. This background of displacement and resilience shaped the household in which Spärck Jones grew up. She later kept her mother's Norwegian surname professionally after marrying, stating that "it maintains a permanent existence of your own." Spärck Jones described her entry into computing as almost accidental. She had been working as a schoolteacher when she began visiting the CLRU out of curiosity about her husband's work. It was Margaret Masterman — whom she later described as "a very strange and interesting woman" — who offered her a research position and drew her fully into the field. == Career == Spärck Jones worked at the Cambridge Language Research Unit from the late 1950s, then at Cambridge University Computer Laboratory from 1974 until her retirement in 2002. From 1999, she held the post of Professor of Computers and Information. She had been given a permanent position only in 1993, and earlier in her career had been employed on a series of short-term contracts. She continued to work in the Computer Laboratory until shortly before her death. Her publications include nine books and numerous papers. A full list of her publications is available from the Cambridge Computer Laboratory. Spärck Jones' main research interests, since the late 1950s, were natural language processing and information retrieval. In 1964, Spärck Jones published "Synonymy and Semantic Classification", which is now seen as a foundational paper in the field of natural language processing. One of her most important contributions was the concept of inverse document frequency (IDF) weighting in information retrieval, which she introduced in a 1972 paper. IDF is used in most search engines today, usually as part of the term frequency–inverse document frequency (TF–IDF) weighting scheme. In the 1980s, Spärck Jones began her work on early speech recognition systems. In 1982 she became involved in the Alvey Programme which was an initiative to motivate more computer science research across the country. == Significance of inverse document frequency == At the time Spärck Jones was working, most computer scientists were focused on making people adapt to machines — learning precise codes and commands to retrieve information. Spärck Jones was working in the opposite direction: teaching computers to understand human language as it is actually used. Her 1972 paper introduced the concept of inverse document frequency (IDF) by observing that not all words carry equal informational value. A word like "the" appears in virtually every document and tells a retrieval system almost nothing about what any specific document is about. A rare word like "photosynthesis," by contrast, is highly specific and informative. IDF assigns each word a statistical weight based on how rarely it occurs across a document collection — the rarer the word, the higher its weight. When combined with term frequency (TF), which measures how often a word appears within a single document, the resulting TF–IDF score gives every word a relevance rating that can be used to rank documents in response to a search query. By 2007, Spärck Jones noted that "pretty much every web engine uses those principles." Her colleague John Tait remarked that "a lot of the stuff she was working on until five or ten years ago seemed like mad nonsense, and now we take it for granted." The 1972 paper remains among the most cited works in information retrieval research, with over 4,500 citations recorded in Google Scholar at the time of her death. The conceptual foundation of TF–IDF — that word meaning is statistical and contextual — has also informed later developments in machine learning and natural language processing, including transformer-based language models such as BERT. == Impact on artificial intelligence == Even though Spärck Jones' views on artificial intelligence (AI) were rather pessimistic in regard to the perceived limitations of AI in information retrieval, her work in natural language processing, information retrieval, and introducing the concept of inverse document frequency (IDF) contributed to the future technological development of AI. Her statistical and ranking methods shifted the direction of the development of AI towards being more expandable and led by data. Her work had a more indirect and conceptual impact on AI, compared to the current and direct impact it has had on search engines. == Gender and advocacy == Spärck Jones spent the majority of her career at Cambridge on short-term contracts without permanent employment, a situation she attributed directly to gender. In her 2001 IEEE oral history interview she stated that Cambridge was "in many ways not user-friendly, in the sense of women-friendly." She was frequently the only woman present in professional meetings throughout her career. She channelled this experience into active advocacy. She was a founding member of the women@cl network at Cambridge's Computer Laboratory, worked on outreach programmes aimed at encouraging girls into computing, and became widely known for her slogan: "Computing is too important to be left to men." She was the first woman ever to receive the BCS Lovelace Medal. === Honours and awards === These include: Gerard Salton Award (1988) Elected a Fellow of Association for the Advancement of Artificial Intelligence (AAAI) in 1993 President of the Association for Computational Linguistics (ACL) in 1994 Honorary degree of Doctor of Science from The City University in 1997. Elected a Fellow of the British Academy (FBA), where she also served as Vice-President in 2000–2002 Fellow of European Association for Artificial Intelligence (ECCAI) Association for Information Science and Technology (ASIS&T) Award of Merit (2002) Association for Computational Linguistics (ACL) Lifetime Achievement Award (2004) ACM - AAAI Allen Newell Award (2006) BCS Lovelace Medal (2007) Association for Computing Machinery (ACM) Women's Group Athena Award (2007) == Death and legacy == Spärck Jones died on 4 April 2007, due to cancer at the age of 71. In 2008, the BCS Information Retrieval Specialist Group (BCS IRSG) in conjunction with the British Computer Society established an annual Karen Spärck Jones Award in her honour, to encourage and promote research that advances understanding of Natural Language Processing or Information Retrieval. The Karen Spärck Jones lecture sponsored by BCS recognises the contribution that women have made to computing. In August 2017, the University of Huddersfield renamed one of its campus buildings in her honour. Formerly known as Canalside West, the Spärck Jones building houses the University's School of Computing and Engineering. When Spärck Jones died in 2007, The Times did not publish an obituary for her, despite having published one for her husband Roger Needham in 2003. In 2019, The New York Times included her in its Overlooked series under the title "Ove

    Read more →
  • The Best Free AI Presentation Maker for Beginners

    The Best Free AI Presentation Maker for Beginners

    Shopping for the best AI presentation maker? An AI presentation maker is software that uses machine learning to help you get more done — it keeps getting smarter as the underlying models improve. Pricing, accuracy, and the size of the model behind the tool are the three factors that most affect daily usefulness. Whether you are a beginner or a pro, the right AI presentation maker slots into your workflow and pays for itself fast. Below we compare features, pricing, and real output so you can choose with confidence.

    Read more →