AI Data Quality Tools

AI Data Quality Tools — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Wrike

    Wrike

    Wrike, Inc. is an American project management application service provider based in San Jose, California. Wrike also has offices in India, Dallas, Tallinn, Nicosia, Dublin, Tokyo, Melbourne, and Prague. == History == Wrike was founded in 2006 by Andrew Filev. Currently CEO at Wrike is Thomas Scott. Filev initially self-funded the company before later obtaining investor funding. Wrike released the beta version of its software (also called Wrike) in December 2006. The company then launched a new "Enterprise" platform in December 2013. In June 2015, Wrike announced the opening of an office in Dublin, Ireland and in 2016, Wrike launched a datacenter there to host data in compliance with local privacy regulations. In July 2016, Wrike announced the launch of Wrike for Marketers. That same year, Wrike's headquarters moved from Mountain View to San Jose, California. In January 2021, Citrix Systems announced its intention to acquire Wrike for $2.25 billion. The acquisition closed in March 2021. On January 31, 2022, it was announced that Citrix had been acquired in a $16.5 billion deal by affiliates of Vista Equity Partners and Evergreen Coast Capital. Citrix would merge with TIBCO Software, a Vista portfolio company to form Cloud Software Group (CSG). In September 2022, Wrike separated from Citrix Systems. In July 2023, Vista transferred ownership to Symphony Technology Group. == Investments == Wrike received $1 million in Angel funding in 2012 from TMT Investments. In October, 2013, Wrike secured $10 million in investment funding from Bain Capital. In May 2015, the company secured $15 million in a new round of funding. Investors included Scale Venture Partners, DCM Ventures, and Bain Capital. At that time, Wrike had 8,000 customers, 200 employees, and 30,000 new users each month. On November 29, 2018, Wrike signed a definitive agreement to receive a majority investment by Vista Equity Partners (“Vista”), a firm focused on software, data and technology-enabled businesses. == Software == The Wrike project management software is a Software-as-a-Service (SaaS) product with tools for managing projects, deadlines, schedules, and workflow processes. It includes collaboration features. The application is available in English, French, Spanish, German, Portuguese, Italian, Japanese and Russian. Wrike has triggers for task automation in workflow management. === Features === Wrike features a multi-pane UI and consists of features in two categories: project management, and team collaboration. According to Wrike, project management features are designed to help teams track dates and dependencies associated with projects, manage assignments and resources, and track time. These include an interactive Gantt chart, a workload view, and a sortable table that can be customized to store project data. The software includes a co-editing tool, discussion threads on tasks, and tools for attaching documents, editing them, and tracking their changes. Wrike uses an "inbox" feature and browser notifications to alert users of updates from their colleagues and dashboards for quick overviews of pending tasks. These updates are also available in Wrike's mobile apps on iOS and Android. Wrike has an optional feature set called "Wrike for Marketers" which has several tools for managing marketing workflows. In May 2012, Wrike announced the launch of a freemium version of its software for teams of up to 5 users. That year also saw the integration of a live text coeditor into its workspace to unify collaboration and task management. In late 2013 Wrike released a new feature set called Wrike Enterprise which included advanced analytics and other tools targeted at large business customers. Since then it has released several major updates to Wrike Enterprise, including a customizable spreadsheet called "Dynamic Platform" in late 2014 and custom workflows for teams in 2015. In July 2016, Wrike was updated with a set of add-on features under the name "Wrike for Marketers," which includes integrations with Adobe Photoshop, a tool for submitting requests, and proofing and approval tools for creative assets like videos and images. Wrike is available as native Android and iOS apps. Mobile apps include an interactive Gantt chart that syncs across devices. The apps are available offline, and sync when connection is restored. === Criticism === Critics said new users may have a learning curve with complex features. Wrike has 2,710 customers for an estimated 0.04% market share. Competitors include Google Workspace, Slack (software), and Quip (software).

    Read more →
  • Artificial intelligence industry in China

    Artificial intelligence industry in China

    The roots of the development of artificial intelligence in the People's Republic of China started in the late 1970s following Deng Xiaoping's reform and opening up emphasizing science and technology as the country's primary productive force. The initial stages of China's AI development were slow and encountered significant challenges due to lack of resources and talent. At the beginning China was behind most Western countries in terms of AI development. A majority of the research was led by scientists who had received higher education abroad. Since 2006, the Chinese government has steadily developed a national agenda for artificial intelligence development and emerged as one of the leading nations in artificial intelligence research and development. In 2016, the Chinese Communist Party (CCP) released its 13th Five-Year Plan in which it aimed to become a global AI leader by 2030. As of 2025, China is considered to be a world leader in AI technology along with the United States. The State Council has a list of "national AI teams" including fifteen China-based companies, including Baidu, Tencent, Alibaba, SenseTime, and iFlytek. Each company should lead the development of a designated specialized AI sector in China, such as facial recognition, software/hardware, and speech recognition. China's rapid AI development has significantly impacted Chinese society in many areas, including the socio-economic, military, intelligence, and political spheres. Agriculture, transportation, accommodation and food services, and manufacturing are the top industries that would be the most impacted by further AI deployment. The private sector, university laboratories, and the military are working collaboratively in many aspects as there are few current existing boundaries. In 2021, China published the Data Security Law of the People's Republic of China, its first national law addressing AI-related ethical concerns. In October 2022, the United States federal government announced a series of export controls and trade restrictions intended to restrict China's access to advanced computer chips for AI applications. In 2023, the Cyberspace Administration of China issued guidelines requiring that AI content upholds the ideology of the CCP including Core Socialist Values, avoids discrimination, respects intellectual property rights, and safeguards user data. In 2025, the Chinese government issued a document regarding training data, requiring companies to use as little as data deemed "unsafe" as possible, as well as requiring companies to test models regularly. Concerns have been raised about the effects of the Chinese government's censorship regime on the development of generative artificial intelligence and long-term talent acquisition with state of the country's demographics. Others have noted that official notions of AI safety require following the priorities of the CCP and are antithetical to standards in democratic societies and raised concerns about the extension of China's system of mass surveillance and censorship abroad. == History == The Chinese term for artificial intelligence (réngōngzhìnéng 人工智能) connotes "humanmade" intelligence. The term developed as mid-20th century localisation of the Japanese term jinko chino. The research and development of artificial intelligence in China started in the 1980s, with the announcement by Deng Xiaoping of the importance of science and technology for China's economic growth. === Late 1970s to early 2010s === Chinese artificial intelligence research and development began in late 1970s after Deng Xiaoping's reform and opening up. China's first national conference on AI occurred in 1979. Academic journals in the late 1970s began publishing literature reviews of Western research on AI topics. In the 1980s, a group of Chinese scientists launched AI research led by Qian Xuesen and Wu Wenjun. However, during the time, China's society still had a generally conservative view towards AI. In the early 1980s, Science Press published translated versions of Western textbooks such as Patrick Winston's Artificial Intelligence and Nils John Nilsson's Principles of Artificial Intelligence. In 1980, a journal of the Chinese Academy of Sciences convened its first annual National Symposium on Artificial Intelligence, which included national and international scholars like Herbert A. Simon. The Chinese Association for Artificial Intelligence (CAAI) was founded in September 1981 and was authorized by the Ministry of Civil Affairs. CAAI has continued to be the largest AI association in China as of 2025. In 1982, CAAI began publishing the Artificial Intelligence Journal, which published early AI research by Chinese academics. In the 1980s, Chinese research on AI was influenced by the field of cybernetics, particularly the work of Norbert Weiner and his text Cybernetics: Or Control and Communication in the Animal and the Machine. Chinese researchers at the time sought to situate AI as part of a broader "Intelligence Science" field which would include disciplines like mathematics, computer science, cognitive science, social sciences, and philosophy. In 1987, Tsinghua University began a research publication on AI. Beginning in 1993, smart automation and intelligence have been part of China's national technology plan. Since the 2000s, the Chinese government has further expanded its research and development funds for AI and the number of government-sponsored research projects has dramatically increased. In 2006, China announced a policy priority for the development of artificial intelligence, which was included in the National Medium and Long Term Plan for the Development of Science and Technology (2006–2020), released by the State Council. In the same year, artificial intelligence was also mentioned in the 11th Five-Year Plan. In 2011, the Association for the Advancement of Artificial Intelligence (AAAI) established a branch in Beijing, China. At same year, the Wu Wenjun Artificial Intelligence Science and Technology Award was founded in honor of Chinese mathematician Wu Wenjun, and it became the highest award for Chinese achievements in the field of artificial intelligence. The first award ceremony was held on May 14, 2012. In 2013, the International Joint Conferences on Artificial Intelligence (IJCAI) was held in Beijing, marking the first time the conference was held in China. This event coincided with the Chinese government's announcement of the "Chinese Intelligence Year," a significant milestone in China's development of artificial intelligence. === Late 2010s to early 2020s === AI became a major issue of commercial, public, and political focus in China in the latter half of the 2010s. Various interpretations of the primary cause for this increased focus exist, with some analyses focusing on the 2016 Go match between Google's AlphaGo and Lee Sedol, others emphasising the U.S. increasing trade restrictions on China's technology industries and the desire to achieve national technological self-sufficiency. The State Council of China issued "A Next Generation Artificial Intelligence Development Plan" (State Council Document [2017] No. 35) on 20 July 2017. In the document, the CCP Central Committee and the State Council urged governing bodies in China to promote the development of artificial intelligence. Specifically, the plan described AI as a strategic technology that has become a "focus of international competition".:2 The document urged significant investment in a number of strategic areas related to AI and called for close cooperation between the state and private sectors. It set the goal of China becoming the preeminent country for AI research and application by 2030. During the general secretaryship of Xi Jinping, artificial intelligence has been a focus of the CCP's military-civil fusion efforts. On the occasion of Xi's speech at the first plenary meeting of the Central Military-Civil Fusion Development Committee (CMCFDC), scholars from the National Defense University wrote in the PLA Daily that the "transferability of social resources" between economic and military ends is an essential component to being a great power. During the Two Sessions 2017,"artificial intelligence plus" was proposed to be elevated to a strategic level. The same year witnessed the emergence of multiple application-level usages in the medical field according to reports. In 2018, Xinhua News Agency, in partnership with Tencent's subsidiary Sogou, launched its first artificial intelligence-generated news anchor. In 2018, the State Council budgeted $2.1 billion for an AI industrial park in Mentougou district. In order to achieve this the State Council stated the need for massive talent acquisition, theoretical and practical developments, as well as public and private investments. Some of the stated motivations that the State Council gave for pursuing its AI strategy include the potential of artificial intelligence for industrial transformation, better social

    Read more →
  • Sequential algorithm

    Sequential algorithm

    In computer science, a sequential algorithm or serial algorithm is an algorithm that is executed sequentially – once through, from start to finish, without other processing executing – as opposed to concurrently or in parallel. The term is primarily used to contrast with concurrent algorithm or parallel algorithm; most standard computer algorithms are sequential algorithms, and not specifically identified as such, as sequentialness is a background assumption. Concurrency and parallelism are in general distinct concepts, but they often overlap – many distributed algorithms are both concurrent and parallel – and thus "sequential" is used to contrast with both, without distinguishing which one. If these need to be distinguished, the opposing pairs sequential/concurrent and serial/parallel may be used. "Sequential algorithm" may also refer specifically to an algorithm for decoding a convolutional code.

    Read more →
  • Adaptive algorithm

    Adaptive algorithm

    An adaptive algorithm is an algorithm that changes its behavior at the time it is run, based on information available and on a priori defined reward mechanism (or criterion). Such information could be the story of recently received data, information on the available computational resources, or other run-time acquired (or a priori known) information related to the environment in which it operates. Among the most used adaptive algorithms is the Widrow-Hoff’s least mean squares (LMS), which represents a class of stochastic gradient-descent algorithms used in adaptive filtering and machine learning. In adaptive filtering the LMS is used to mimic a desired filter by finding the filter coefficients that relate to producing the least mean square of the error signal (difference between the desired and the actual signal). For example, stable partition, using no additional memory is O(n lg n) but given O(n) memory, it can be O(n) in time. As implemented by the C++ Standard Library, stable_partition is adaptive and so it acquires as much memory as it can get (up to what it would need at most) and applies the algorithm using that available memory. Another example is adaptive sort, whose behavior changes upon the presortedness of its input. An example of an adaptive algorithm in radar systems is the constant false alarm rate (CFAR) detector. In machine learning and optimization, many algorithms are adaptive or have adaptive variants, which usually means that the algorithm parameters such as learning rate are automatically adjusted according to statistics about the optimisation thus far (e.g. the rate of convergence). Examples include adaptive simulated annealing, adaptive coordinate descent, adaptive quadrature, AdaBoost, Adagrad, Adadelta, RMSprop, and Adam. In data compression, adaptive coding algorithms such as Adaptive Huffman coding or Prediction by partial matching can take a stream of data as input, and adapt their compression technique based on the symbols that they have already encountered. In signal processing, the Adaptive Transform Acoustic Coding (ATRAC) codec used in MiniDisc recorders is called "adaptive" because the window length (the size of an audio "chunk") can change according to the nature of the sound being compressed, to try to achieve the best-sounding compression strategy.

    Read more →
  • Direct voice input

    Direct voice input

    Direct voice input (DVI), sometimes called voice input control (VIC), is a style of human–machine interaction "HMI" in which the user makes voice commands to issue instructions to the machine through speech recognition. In the field of military aviation, DVI has been introduced into the cockpits of several modern military aircraft, such as the Eurofighter Typhoon, the Lockheed Martin F-35 Lightning II, the Dassault Rafale, the KF-21 Boramae and the Saab JAS 39 Gripen. Such systems have also been used for various other purposes, including industry control systems and speech recognition assistance for impaired individuals. == Overview == DVI systems can be divided into two major categories of functionality: "user-dependent" or "user-independent". A user-dependent system requires that a personal voice template to be generated for a specific person; the template for this individual has to be loaded onto their assigned machine prior to use of the DVI system for it to function properly. In contrast, a user-independent system does not require any personal voice template, being intended to respond correctly to the voice of any user. They can also be categorised between "discrete recognition" and "continuous recognition". Users of a discrete recognition system must pause between each word so that the DVI system can identify the separations between each word, while a continuous speech recognition system is capable of understanding a normal rate of speech. During the mid-2000s, researchers at the National Aerospace Laboratory in the Netherlands examined the use of DVI in the "GRACE" simulator; a total of twelve pilots participated in the ensuing experiment. The tests performed reportedly revealed that, while the hardware itself functioned well, several improvements were desirable prior to real-world deployment on aircraft since DVI operations actually consumed more time in comparison to traditional existing methods. Recommendations for improvements included the adoption of simpler syntax, the achievement of a greater recognition rate, and a decrease in response times; all of the issues encountered were determined to be of a technological nature, and were deemed feasible to resolve. The researchers concluded that in cockpits, especially during emergencies where pilots have to operate entirely on their own, a DVI system could be highly relevant, but that it was not of crucial importance during most other conceivable scenarios. Around the same time, evaluations of DVI systems for civil aviation purposes were conducted within the framework of Project SafeSound, coordinated by the European Union. It involved the observation of pilot workloads in real-world cockpits and contrasting them against pilot activity in flight simulators using both conventional systems and DVI assistance. The project aimed to enhance aviation safety and to decrease the workload in both ground and flight operations via the application of enhanced audio functions. == Applications == === Aviation === Prior to its widespread deployment, a handful of conventional military aircraft were converted to trial DVI systems; examples include the Harrier AV-8B and F-16 VISTA. In another case, a General Dynamics F-16 Fighting Falcon simulator was modified with DVI for a voice control study that was undertaken by the Royal Netherlands Air Force. DVI trials have also been conducted on helicopters, including the Boeing AH-64 Apache, showing the potential to improve flight safety and mission effectiveness. Numerous modern fighter aircraft have been outfitted with DVI systems, often in combination with various other man-machine interface schemes, such as HOTAS-compliant controls and other advanced control technologies. The combination of Voice and HOTAS control schemes has sometimes been referred to as the "V-TAS" concept. A prominent fighter aircraft to be furnished with a V-TAS cockpit is the Eurofighter Typhoon. The Lockheed Martin F-35 Lightning II also features a DVI system, which was developed by Adacel. Other examples includes the Dassault Rafale and the Saab JAS 39 Gripen. Numerous aircraft have been planned to use DVI. At one stage, the United States Air Force had sought to integrate DVI upon the Lockheed Martin F-22 Raptor; however, the technology was eventually judged to pose too many technical risks at that point in time, and thus such efforts were abandoned. === Personal === By 1990, working prototypes of speech recognition systems were being demonstrated; these were being promoted for the purpose of providing an effective man-machine interface for individuals with impaired speech. Techniques employed included time-encoded digital speech and automatic token set selection. Investigations of these early DVI systems reportedly included the use of automatic diagnostic routines and limited-scale trials using volunteers. During the 2010s, various companies were offering voice recognition systems to the general public in the form of personal digital assistants. One example is the Google Voice service, which allows users to pose questions via a DVI package installed on either a personal computer, tablet, or mobile phone. Numerous digital assistants have been developed, such as Amazon Echo, Siri, and Cortana, that use DVI to interact with users. === Commercial === DVI technology has enabled automated telephone systems to be widely deployed. Many companies commonly use centralised phone systems that route callers to the correct department via such methods. Various car manufacturers have also furnished their road vehicles with DVI systems; these typically allow drivers to control infotainment systems and interact with mobile phones with more convenience than legacy methods. During the late 1980s, investigations into the use of DVI systems for controlling CNC machines and other manufacturing apparatus were underway. During the 2010s, such systems were being used for logistics and warehouse management purposes.

    Read more →
  • Sedona Canada Principles

    Sedona Canada Principles

    The Sedona Canada Principles are a set of authoritative guidelines published by The Sedona Conference to aid members of the Canadian legal community involved in the identification, collection, preservation, review and production of electronically stored information (ESI). The principles were drafted by a small group of lawyers, judges and technologists called the Sedona Working Group 7 or Sedona Canada. Sedona Canada is an offshoot of The Sedona Conference which is an American "non-profit ... research and educational institute dedicated to the advanced study of law and policy in the areas of antitrust law, complex litigation, and intellectual property rights". == Background == Civil procedure in Canada is jurisdictional with each province following its own rules of civil procedure. However, each province must address the fact that due to the advancement of technology the discovery process enshrined in the rules of civil procedure can be potentially derailed due to the sheer volume of electronically stored information (ESI). When dealing with litigation matters that involve electronically stored information (ESI), the discovery process is commonly called e-discovery. The problems associated with e-discovery in Canada led to the creation of the Sedona Canada Principles. Rule 29.1.03(4) of the wikibooks:Ontario Rules of Civil Procedure specifically refers to the Sedona Canada Principles in referencing Principles re Electronic Discovery although it has been reported that this rule has been largely ignored in practice. == Summary == The Sedona Canada Principles largely refer to the processes found in the Electronic Discovery Reference Model. The principles urge proportionality due to the potentially enormous volumes of documents that may be discoverable when dealing with ESI. They also encourage good faith in the document preservation stage and regular meetings between parties to discuss the scope of the litigation. Parties are urged to be aware of the potential costs involved in producing relevant ESI but are advised that only reasonably accessible ESI need be produced. The principles stipulate that parties should not be required to search for or collect deleted material unless there is an agreement or court order related to those terms. The use of electronic tools and processes such as data sampling and web harvesting are acceptable practices. Parties are encouraged to agree early in the litigation process on production format required for the exchange of relevant documents as part of the discovery process (native files, pdf, tiff, metadata requirements etc.). Agreements or direction should be sought, if necessary, with respect to privilege or other confidential information related to production of electronic documents and data. Parties should be aware that legal precedents can be formed as a result of e-discovery practices and sanctions can be considered for a party's failure to meet their discovery obligations unless it can be demonstrated that the failure was not intentional. All parties must bear the “reasonable” costs associated with e-discovery but other arrangements can be agreed upon by the parties or by court order. == Caselaw == In Warman v. National Post Company proportionality was at issue in a case where the plaintiff was suing the defendant for libel. A motion was brought by the defendant to have the plaintiff provide a mirror image of his hard drive in an effort to prove an internet article was indeed authored by the plaintiff. Issues of proportionality and the work of the Sedona Conference and Sedona Canada Principles were factored in to the decision to grant the defendant only limited access to the hard drive. In Innovative Health Group Inc. v. Calgary Health Region the plaintiff's legal obligation to produce imaged hard drives is in question. Justice Conrad refers to the advice of Sedona Canada on proportionality and problems associated with time and expense related to the difficulties associated with electronically stored information. In York University v. Michael Markicevic Justice Brown specifically refers to the need for the parties to agree upon a formal e-discovery plan to be drafted in consultation with Sedona Canada Principles. In Friends of Lansdowne v. Ottawa Master MacLeod refers to the need for Sedona Canada principles and states “This is particularly true in the current information age when e-mail is ubiquitous and multiple copies or variants of messages may be held on various kinds of data storage devices including individual hard drives, e-mail and Blackberry servers. Even documents that ultimately exist in paper form normally begin their life on computers and negotiations frequently involve exchanges of electronic drafts. To find every scrap of paper and every electronic trace of relevant information has become a nightmarish task that threatens to render any kind of litigation extravagantly expensive.” == Criticism == Critics of the Sedona Canada Principles believe they should address system integrity and that the true history of any file preserved cannot be identified without proof of the integrity of the electronic record systems management it comes from. Other criticism is more directed to the Sedona Canada working group and complaints that it is insular and irrelevant.

    Read more →
  • Archival bond

    Archival bond

    The archival bond is a concept in archival theory referring to the relationship that each archival record has with the other records produced as part of the same transaction or activity and located within the same grouping. These bonds are a core component of each individual record and are necessary for transforming a document into a record, as a document will only acquire meaning (and become a record) through its interrelationships with other records. == Description == The concept of the archival bond is primarily associated with the work of Luciana Duranti along with Heather MacNeil, as part of research into the integrity of electronic records. Duranti resumed and extended the concept of vincolo archivistico (archival bond), first expressed in 1937 by archivist Giorgio Cencetti of the Italian archival school. This bond emerges from the fact that electronic records are not physically arranged like traditional records. For traditional, analog records, their bond is implicit in their arrangement. But for electronic records, this bond must be made explicit due to the lack of a single sequential order of records in a digital environment. The archival bond was one of the core concepts of the subsequent International Research on Permanent Authentic Records in Electronic Systems (InterPARES) project and can be found in the InterPARES glossary. As Duranti notes, the archival bond is not to be confused with the broader term "context" as context exists independently of a record, while "the archival bond is an essential part of the record, which would not exist without it."

    Read more →
  • Communication-avoiding algorithm

    Communication-avoiding algorithm

    Communication-avoiding algorithms minimize movement of data within a memory hierarchy for improving its running-time and energy consumption. These minimize the total of two costs (in terms of time and energy): arithmetic and communication. Communication, in this context refers to moving data, either between levels of memory or between multiple processors over a network. It is much more expensive than arithmetic. == Formal theory == === Two-level memory model === A common computational model in analyzing communication-avoiding algorithms is the two-level memory model: There is one processor and two levels of memory. Level 1 memory is infinitely large. Level 0 memory ("cache") has size M {\displaystyle M} . In the beginning, input resides in level 1. In the end, the output resides in level 1. Processor can only operate on data in cache. The goal is to minimize data transfers between the two levels of memory. === Matrix multiplication === Corollary 6.2: More general results for other numerical linear algebra operations can be found in. The following proof is from. == Motivation == Consider the following running-time model: Measure of computation = Time per FLOP = γ Measure of communication = No. of words of data moved = β ⇒ Total running time = γ·(no. of FLOPs) + β·(no. of words) From the fact that β >> γ as measured in time and energy, communication cost dominates computation cost. Technological trends indicate that the relative cost of communication is increasing on a variety of platforms, from cloud computing to supercomputers to mobile devices. The report also predicts that gap between DRAM access time and FLOPs will increase 100× over coming decade to balance power usage between processors and DRAM. Energy consumption increases by orders of magnitude as we go higher in the memory hierarchy. United States president Barack Obama cited communication-avoiding algorithms in the FY 2012 Department of Energy budget request to Congress: New Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems. On modern computer architectures, communication between processors takes longer than the performance of a floating-point arithmetic operation by a given processor. ASCR researchers have developed a new method, derived from commonly used linear algebra methods, to minimize communications between processors and the memory hierarchy, by reformulating the communication patterns specified within the algorithm. This method has been implemented in the TRILINOS framework, a highly-regarded suite of software, which provides functionality for researchers around the world to solve large scale, complex multi-physics problems. == Objectives == Communication-avoiding algorithms are designed with the following objectives: Reorganize algorithms to reduce communication across all memory hierarchies. Attain the lower-bound on communication when possible. The following simple example demonstrates how these are achieved. === Matrix multiplication example === Let A, B and C be square matrices of order n × n. The following naive algorithm implements C = C + A B: for i = 1 to n for j = 1 to n for k = 1 to n C(i,j) = C(i,j) + A(i,k) B(k,j) Arithmetic cost (time-complexity): n2(2n − 1) for sufficiently large n or O(n3). Rewriting this algorithm with communication cost labelled at each step for i = 1 to n {read row i of A into fast memory} - n2 reads for j = 1 to n {read C(i,j) into fast memory} - n2 reads {read column j of B into fast memory} - n3 reads for k = 1 to n C(i,j) = C(i,j) + A(i,k) B(k,j) {write C(i,j) back to slow memory} - n2 writes Fast memory may be defined as the local processor memory (CPU cache) of size M and slow memory may be defined as the DRAM. Communication cost (reads/writes): n3 + 3n2 or O(n3) Since total running time = γ·O(n3) + β·O(n3) and β >> γ the communication cost is dominant. The blocked (tiled) matrix multiplication algorithm reduces this dominant term: ==== Blocked (tiled) matrix multiplication ==== Consider A, B and C to be n/b-by-n/b matrices of b-by-b sub-blocks where b is called the block size; assume three b-by-b blocks fit in fast memory. for i = 1 to n/b for j = 1 to n/b {read block C(i,j) into fast memory} - b2 × (n/b)2 = n2 reads for k = 1 to n/b {read block A(i,k) into fast memory} - b2 × (n/b)3 = n3/b reads {read block B(k,j) into fast memory} - b2 × (n/b)3 = n3/b reads C(i,j) = C(i,j) + A(i,k) B(k,j) - {do a matrix multiply on blocks} {write block C(i,j) back to slow memory} - b2 × (n/b)2 = n2 writes Communication cost: 2n3/b + 2n2 reads/writes << 2n3 arithmetic cost Making b as large possible: 3b2 ≤ M we achieve the following communication lower bound: 31/2n3/M1/2 + 2n2 or Ω (no. of FLOPs / M1/2) == Previous approaches for reducing communication == Most of the approaches investigated in the past to address this problem rely on scheduling or tuning techniques that aim at overlapping communication with computation. However, this approach can lead to an improvement of at most a factor of two. Ghosting is a different technique for reducing communication, in which a processor stores and computes redundantly data from neighboring processors for future computations. Cache-oblivious algorithms represent a different approach introduced in 1999 for fast Fourier transforms, and then extended to graph algorithms, dynamic programming, etc. They were also applied to several operations in linear algebra as dense LU and QR factorizations. The design of architecture specific algorithms is another approach that can be used for reducing the communication in parallel algorithms, and there are many examples in the literature of algorithms that are adapted to a given communication topology.

    Read more →
  • Computational semantics

    Computational semantics

    Computational semantics is a subfield of computational linguistics. Its goal is to elucidate the cognitive mechanisms supporting the generation and interpretation of meaning in humans. It usually involves the creation of computational models that simulate particular semantic phenomena, and the evaluation of those models against data from human participants. While computational semantics is a scientific field, it has many applications in real-world settings and substantially overlaps with Artificial Intelligence. Broadly speaking, the discipline can be subdivided into areas that mirror the internal organization of linguistics. For example, lexical semantics and frame semantics have active research communities within computational linguistics. Some popular methodologies are also strongly inspired by traditional linguistics. Most prominently, the area of distributional semantics, which underpins investigations into embeddings and the internals of Large Language Models, has roots in the work of Zellig Harris. Some traditional topics of interest in computational semantics are: construction of meaning representations, semantic underspecification, anaphora resolution, presupposition projection, and quantifier scope resolution. Methods employed usually draw from formal semantics or statistical semantics. Computational semantics has points of contact with the areas of lexical semantics (word-sense disambiguation and semantic role labeling), discourse semantics, knowledge representation and automated reasoning (in particular, automated theorem proving). Since 1999 there has been an ACL special interest group on computational semantics, SIGSEM.

    Read more →
  • Basic Formal Ontology

    Basic Formal Ontology

    Basic Formal Ontology (BFO) is a top-level ontology developed by Barry Smith and colleagues to promote interoperability among domain ontologies. The BFO methodology accomplishes this through a process of downward population. BFO is a formal ontology. The structure of BFO is based on a division of entities into two disjoint categories of continuant and occurrent, the former consists of objects and spatial regions, the latter contains processes conceived as extended through (or spanning) time. BFO thereby seeks to consolidate both time and space within a single framework A guide to building BFO-conformant domain ontologies was published by MIT Press in 2015. In 2021, the standard ISO/IEC 21838-2:2021 Information Technology — Top-level Ontologies (TLO) — Part 2: Basic Formal Ontology (BFO) was published by the Joint Technical Committee of the International Standards Organization and the International Electrotechnical Commission. ISO/IEC 21838 is a multi-part standard. Part 1 of the standard specifies the requirements that must be met if an ontology is to be classified as a top-level ontology by the standard. == History == BFO arose against the background of research in ontologies in the domain of geospatial information science by David Mark, Pierre Grenon, Achille Varzi and others, with a special role for the study of vagueness and of the ways sharp boundaries in the geospatial and other domains are created by fiat. BFO has passed through four major releases. 2001: release of BFO 1 2007: release of BFO 1.1 2015: release of BFO 2.0 2020: release of BFO 2020 2021: release of BFO 2020 as an ISO/IEC Standard The current revision was released in 2020, and this forms the basis of the standard ISO/IEC 21838-2, which was released by the Joint Committee of the International Standards Organization and International Electrotechnical Commission in 2021. == Applications == BFO has been adopted as a foundational ontology by over 650 ontology projects, principally in the areas of biomedical ontology, security and defense (intelligence) ontology, and industry ontologies. Example applications of BFO can be seen in the Ontology for Biomedical Investigations (OBI). In January 2024, BFO and the Common Core Ontologies (CCO), a suite of BFO-extension ontologies, were adopted as the "baseline standards for formal DOD and IC ontology" development work in the DOD and Intelligence Community. A memorandum to this effect was signed by the chief data officers of the DOD, the Office of the Director of National Intelligence and the Chief Digital and Artificial Intelligence Office.

    Read more →
  • Information

    Information

    Information is an abstract concept that refers to something which has the power to inform. At the most fundamental level, it pertains to the interpretation (perhaps formally) of that which may be sensed, or their abstractions. Any natural process that is not completely random and any observable pattern in any medium can be said to convey some amount of information. Whereas digital signals and other data use discrete signs to convey information, other phenomena and artifacts such as analogue signals, poems, pictures, music or other sounds, and currents convey information in a more continuous form. Information is not knowledge itself, but the meaning that may be derived from a representation through interpretation. The concept of information is relevant to and connected with various concepts, including constraint, communication, control, data, form, education, knowledge, meaning, understanding, mental stimuli, pattern, perception, proposition, representation, and entropy. Information is often processed iteratively: Data available at one step are processed into information to be interpreted and processed at the next step. For example, in written text each symbol or letter conveys information relevant to the word it is part of, each word conveys information relevant to the phrase it is part of, each phrase conveys information relevant to the sentence it is part of, and so on until at the final step information is interpreted and becomes knowledge in a given domain. In a digital signal, bits may be interpreted into the symbols, letters, numbers, or structures that convey the information available at the next level up. The key characteristic of information is that it is subject to interpretation and processing. The derivation of information from a signal or message may be thought of as the resolution of ambiguity or uncertainty that arises during the interpretation of patterns within the signal or message. Information may be structured as data. Redundant data can be compressed up to an optimal size, which is the theoretical limit of compression. The information available through a collection of data may be derived by analysis. For example, a restaurant collects data from every customer order. That information may be analyzed to produce knowledge that is put to use when the business subsequently wants to identify the most popular or least popular dish. Information can be transmitted in time, via data storage, and space, via communication and telecommunication. Information is expressed either as the content of a message or through direct or indirect observation. That which is perceived can be construed as a message in its own right, and in that sense, all information is always conveyed as the content of a message. Information can be encoded into various forms for transmission and interpretation (for example, information may be encoded into a sequence of signs, or transmitted via a signal). It can also be encrypted for safe storage and communication. The uncertainty of an event is measured by its probability of occurrence. Uncertainty is proportional to the negative logarithm of the probability of occurrence. Information theory takes advantage of this by concluding that more uncertain events require more information to resolve their uncertainty. The bit is the standard unit of information. It is 'that which reduces uncertainty by half'. Other units such as the nat may be used. For example, the information encoded in one "fair" coin flip is log2(2/1) = 1 bit, and in two fair coin flips is log2(4/1) = 2 bits. A 2011 Science article estimates that 97% of technologically stored information was already in digital bits in 2007 and that the year 2002 was the beginning of the digital age for information storage (with digital storage capacity bypassing analogue for the first time). == Etymology and history of the concept == The English word "information" comes from Middle French enformacion/informacion/information 'a criminal investigation' and its etymon, Latin informatiō(n) 'conception, teaching, creation'. In English, "information" is an uncountable mass noun. References on "formation or molding of the mind or character, training, instruction, teaching" date from the 14th century in both English (according to Oxford English Dictionary) and other European languages. In the transition from Middle Ages to Modernity the use of the concept of information reflected a fundamental turn in epistemological basis – from "giving a (substantial) form to matter" to "communicating something to someone". Peters (1988, pp. 12–13) concludes: Information was readily deployed in empiricist psychology (though it played a less important role than other words such as impression or idea) because it seemed to describe the mechanics of sensation: objects in the world inform the senses. But sensation is entirely different from "form" – the one is sensual, the other intellectual; the one is subjective, the other objective. My sensation of things is fleeting, elusive, and idiosyncratic. For Hume, especially, sensory experience is a swirl of impressions cut off from any sure link to the real world... In any case, the empiricist problematic was how the mind is informed by sensations of the world. At first informed meant shaped by; later it came to mean received reports from. As its site of action drifted from cosmos to consciousness, the term's sense shifted from unities (Aristotle's forms) to units (of sensation). Information came less and less to refer to internal ordering or formation, since empiricism allowed for no preexisting intellectual forms outside of sensation itself. Instead, information came to refer to the fragmentary, fluctuating, haphazard stuff of sense. Information, like the early modern worldview in general, shifted from a divinely ordered cosmos to a system governed by the motion of corpuscles. Under the tutelage of empiricism, information gradually moved from structure to stuff, from form to substance, from intellectual order to sensory impulses. In the modern era, the most important influence on the concept of information is derived from the Information theory developed by Claude Shannon and others. This theory, however, reflects a fundamental contradiction. Northrup (1993) wrote: Thus, actually two conflicting metaphors are being used: The well-known metaphor of information as a quantity, like water in the water-pipe, is at work, but so is a second metaphor, that of information as a choice, a choice made by :an information provider, and a forced choice made by an :information receiver. Actually, the second metaphor implies that the information sent isn't necessarily equal to the information received, because any choice implies a comparison with a list of possibilities, i.e., a list of possible meanings. Here, meaning is involved, thus spoiling the idea of information as a pure "Ding an sich." Thus, much of the confusion regarding the concept of information seems to be related to the basic confusion of metaphors in Shannon's theory: is information an autonomous quantity, or is information always per SE information to an observer? Actually, I don't think that Shannon himself chose one of the two definitions. Logically speaking, his theory implied information as a subjective phenomenon. But this had so wide-ranging epistemological impacts that Shannon didn't seem to fully realize this logical fact. Consequently, he continued to use metaphors about information as if it were an objective substance. This is the basic, inherent contradiction in Shannon's information theory." (Northrup, 1993, p. 5). In their seminal book The Study of Information: Interdisciplinary Messages, Almach and Mansfield (1983) collected key views on the interdisciplinary controversy in computer science, artificial intelligence, library and information science, linguistics, psychology, and physics, as well as in the social sciences. Almach (1983, p. 660) himself disagrees with the use of the concept of information in the context of signal transmission, the basic senses of information in his view all referring "to telling something or to the something that is being told. Information is addressed to human minds and is received by human minds." All other senses, including its use with regard to nonhuman organisms as well to society as a whole, are, according to Machlup, metaphoric and, as in the case of cybernetics, anthropomorphic. Hjørland (2007) describes the fundamental difference between objective and subjective views of information and argues that the subjective view has been supported by, among others, Bateson, Yovits, Span-Hansen, Brier, Buckland, Goguen, and Hjørland. Hjørland provided the following example: A stone on a field could contain different information for different people (or from one situation to another). It is not possible for information systems to map all the stone's possible information for every individual. Nor is any one mapping the one "true" mapping. But peop

    Read more →
  • Kullback–Leibler Upper Confidence Bound

    Kullback–Leibler Upper Confidence Bound

    In multi-armed bandit problems, KL-UCB (for Kullback–Leibler Upper Confidence Bound) is a UCB-type algorithm that is asymptotically optimal, in the sense that its regret matches the problem-dependent Lai-Robbins lower bound. == Multi-armed bandit problem == The Multi-armed bandit problem is a sequential game where one player has to choose at each turn between K {\displaystyle K} actions (arms). Behind every arm a {\displaystyle a} there is an unknown distribution ν a {\displaystyle \nu _{a}} that lies in a set D {\displaystyle {\mathcal {D}}} known by the player (for example, D {\displaystyle {\mathcal {D}}} can be the set of Gaussian distributions or Bernoulli distributions). At each turn t {\displaystyle t} the player chooses (pulls) an arm a t {\displaystyle a_{t}} , he then gets an observation X t {\displaystyle X_{t}} of the distribution ν a t {\displaystyle \nu _{a_{t}}} . === Regret minimization === The goal is to minimize the regret at time T {\displaystyle T} that is defined as R T := ∑ a = 1 K Δ a E [ N a ( T ) ] {\displaystyle R_{T}:=\sum _{a=1}^{K}\Delta _{a}\mathbb {E} [N_{a}(T)]} where μ a := E [ ν a ] {\displaystyle \mu _{a}:=\mathbb {E} [\nu _{a}]} is the mean of arm a {\displaystyle a} μ ∗ := max a μ a {\displaystyle \mu ^{}:=\max _{a}\mu _{a}} is the highest mean Δ a := μ ∗ − μ a {\displaystyle \Delta _{a}:=\mu ^{}-\mu _{a}} N a ( t ) {\displaystyle N_{a}(t)} is the number of pulls of arm a {\displaystyle a} up to turn t {\displaystyle t} The player has to find an algorithm that chooses at each turn t {\displaystyle t} which arm to pull based on the previous actions and observations ( a s , X s ) s < t {\displaystyle (a_{s},X_{s})_{s μ } {\displaystyle {\mathcal {K}}_{inf}(\nu ,\mu ,{\mathcal {D}}):=\inf \left\{\mathrm {KL} (\nu ,{\tilde {\nu }})\ |\ {\tilde {\nu }}\in {\mathcal {D}},\ \mathbb {E} [{\tilde {\nu }}]>\mu \right\}} K L {\displaystyle \mathrm {KL} } is the Kullback–Leibler divergence ν ^ a ( t ) {\displaystyle {\hat {\nu }}_{a}(t)} is the empirical distribution of arm a {\displaystyle a} at turn t {\displaystyle t} δ t {\displaystyle \delta _{t}} is a well-chosen sequence of positive numbers, often equal to ln ⁡ t + c ln ⁡ ln ⁡ t {\displaystyle \ln t+c\ln \ln t} with c > 0 {\displaystyle c>0} . Then we choose the arm a t {\displaystyle a_{t}} with the highest index: a t := arg ⁡ max a U a ( t ) {\displaystyle a_{t}:=\arg \max _{a}U_{a}(t)} We note that the algorithm does not require knowledge of T {\displaystyle T} . === Example === In the special case of Gaussian distribution with fixed variance σ 2 {\displaystyle \sigma ^{2}} , we have: U a ( t ) = μ ^ a ( t ) + 2 σ 2 δ t N a ( t ) {\displaystyle U_{a}(t)={\hat {\mu }}_{a}(t)+{\sqrt {\frac {2\sigma ^{2}\delta _{t}}{N_{a}(t)}}}} with μ ^ a ( t ) {\displaystyle {\hat {\mu }}_{a}(t)} being the empirical mean of arm a {\displaystyle a} at turn t {\displaystyle t} . === Pseudocode === The player gets the set D for each arm i do: n[i] ← 1; nu[i] ← None; d ← ln(K) for t from 1 to K do: select arm t observe reward r n[t] ← n[t] + 1 nu[t] ← update empirical distribution for t from K+1 to T do: for each arm i do: index[i] ← compute_index(n[i], nu[i], D, d) select arm a with highest index[a] observe reward r n[a] ← n[a] + 1 nu[a] ← update empirical distribution d ← ln(t+1) == Theoretical results == In the multi-armed bandit problem we have the Lai–Robbins asymptotic lower bound on regret. The algorithm KL-UCB matches this lower bound for one-dimensional exponential families with δ t := ln ⁡ t + 3 ln ⁡ ln ⁡ t {\displaystyle \delta _{t}:=\ln t+3\ln \ln t} and for distributions bounded in [ 0 , 1 ] {\displaystyle [0,1]} with δ t := ln ⁡ t + ln ⁡ ln ⁡ t {\displaystyle \delta _{t}:=\ln t+\ln \ln t} . === Lai–Robbins lower bound === In 1985 Lai and Robbins proved an asymptotic, problem-dependent lower bound on regret. It states that for every consistent algorithm on the set D {\displaystyle {\mathcal {D}}} — that is, an algorithm for which, for every ( ν 1 , … , ν K ) ∈ D K {\displaystyle (\nu _{1},\dots ,\nu _{K})\in {\mathcal {D}}^{K}} , the regret R T {\displaystyle R_{T}} is subpolynomial (i.e. R T = o T → + ∞ ( T α ) {\displaystyle R_{T}=o_{T\to +\infty }(T^{\alpha })} for all α > 0 {\displaystyle \alpha >0} ) — we have: R T ≥ ( ∑ a : μ a < μ ∗ Δ a K inf ( ν a , μ ∗ , D ) ) ln ⁡ T + o T → + ∞ ( ln ⁡ T ) . {\displaystyle R_{T}\geq \left(\sum _{a:\mu _{a}<\mu ^{}}{\frac {\Delta _{a}}{{\mathcal {K}}_{\inf }(\nu _{a},\mu ^{},{\mathcal {D}})}}\right)\ln T+o_{T\to +\infty }(\ln T).} This bound is asymptotic (as T → + ∞ {\displaystyle T\to +\infty } ) and gives a first-order lower bound of order ln ⁡ T {\displaystyle \ln T} with the optimal constant in front of it. === Regret bound for KL-UCB === The algorithm matches the Lai–Robbins lower bound for one-dimensional exponential-family distributions and for distributions bounded in [ 0 , 1 ] {\displaystyle [0,1]} . ==== One-dimensional exponential family ==== For D {\displaystyle {\mathcal {D}}} being the set of one-dimensional exponential families, with δ t := ln ⁡ t + 3 ln ⁡ ln ⁡ t {\displaystyle \delta _{t}:=\ln t+3\ln \ln t} we have the following upper bound on the regret of KL-UCB: R T ≤ ( ∑ a : μ a < μ ∗ Δ a K inf ( ν a , μ ∗ , D ) ) ln ⁡ T + O T ( ln ⁡ T ) . {\displaystyle R_{T}\leq \left(\sum _{a:\mu _{a}<\mu ^{}}{\frac {\Delta _{a}}{{\mathcal {K}}_{\inf }(\nu _{a},\mu ^{},{\mathcal {D}})}}\right)\ln T+O_{T}({\sqrt {\ln T}}).} ==== Bounded distributions in [0,1] ==== For D = P ( [ 0 , 1 ] ) {\displaystyle {\mathcal {D}}={\mathcal {P}}([0,1])} (the set of distributions supported on [ 0 , 1 ] {\displaystyle [0,1]} ), and for δ t := ln ⁡ t + ln ⁡ ln ⁡ t {\displaystyle \delta _{t}:=\ln t+\ln \ln t} , we have the following upper bound on the regret of KL-UCB: R T ≤ ( ∑ a : μ a < μ ∗ Δ a K inf ( ν a , μ ∗ , D ) ) ln ⁡ T + O T ( ( ln ⁡ T ) 4 / 5 ln ⁡ ln ⁡ T ) . {\displaystyle R_{T}\leq \left(\sum _{a:\mu _{a}<\mu ^{}}{\frac {\Delta _{a}}{{\mathcal {K}}_{\inf }(\nu _{a},\mu ^{},{\mathcal {D}})}}\right)\ln T+O_{T}{\big (}(\ln T)^{4/5}\ln \ln T{\big )}.} === Runtime === For D = P ( [ 0 , 1 ] ) {\displaystyle {\mathcal {D}}={\mathcal {P}}([0,1])} , the runtime needed per step and for an arm k {\displaystyle k} with n {\displaystyle n} observations is O ( n ( ln ⁡ n ) 2 ) {\displaystyle {\mathcal {O}}{\big (}n(\ln n)^{2}{\big )}} . This is higher than that of other optimal algorithms, such as NPTS with O ( n ) {\displaystyle {\mathcal {O}}(n)} . MED with O ( n ln ⁡ n ) {\displaystyle {\mathcal {O}}(n\ln n)} . and IMED with O ( n ln ⁡ n ) {\displaystyle {\mathcal {O}}(n\ln n)} . The high runtime of KL-UCB is due to a two-level optimisation: for each arm and candidate mean μ {\displaystyle \mu } , the algorithm evaluates K inf ( ν ^ a ( t ) , μ , D ) {\displaystyle {\mathcal {K}}_{\inf }({\hat {\nu }}_{a}(t),\mu ,{\mathcal {D}})} and then maximises μ {\displaystyle \mu } subject to N a ( t ) K inf ( ν ^ a ( t ) , μ , D ) ≤ δ t {\displaystyle N_{a}(t)\,{\mathcal {K}}_{\inf }({\hat {\nu }}_{a}(t),\mu ,{\mathcal {D}})\leq \delta _{t}} . For distributions bounded in [ 0 , 1 ] {\displaystyle [0,1]} the inner problem has no closed form and must be solved numerically, which increases the per-step cost.

    Read more →
  • Data Science and Predictive Analytics

    Data Science and Predictive Analytics

    The first edition of the textbook Data Science and Predictive Analytics: Biomedical and Health Applications using R, authored by Ivo D. Dinov, was published in August 2018 by Springer. The second edition of the book was printed in 2023. This textbook covers some of the core mathematical foundations, computational techniques, and artificial intelligence approaches used in data science research and applications. By using the statistical computing platform R and a broad range of biomedical case-studies, the 23 chapters of the book first edition provide explicit examples of importing, exporting, processing, modeling, visualizing, and interpreting large, multivariate, incomplete, heterogeneous, longitudinal, and incomplete datasets (big data). == Structure == === First edition table of contents === The first edition of the Data Science and Predictive Analytics (DSPA) textbook is divided into the following 23 chapters, each progressively building on the previous content. === Second edition table of contents === The significantly reorganized revised edition of the book (2023) expands and modernizes the presented mathematical principles, computational methods, data science techniques, model-based machine learning and model-free artificial intelligence algorithms. The 14 chapters of the new edition start with an introduction and progressively build foundational skills to naturally reach biomedical applications of deep learning. Introduction Basic Visualization and Exploratory Data Analytics Linear Algebra, Matrix Computing, and Regression Modeling Linear and Nonlinear Dimensionality Reduction Supervised Classification Black Box Machine Learning Methods Qualitative Learning Methods—Text Mining, Natural Language Processing, and Apriori Association Rules Learning Unsupervised Clustering Model Performance Assessment, Validation, and Improvement Specialized Machine Learning Topics Variable Importance and Feature Selection Big Longitudinal Data Analysis Function Optimization Deep Learning, Neural Networks == Reception == The materials in the Data Science and Predictive Analytics (DSPA) textbook have been peer-reviewed in the Journal of the American Statistical Association, International Statistical Institute’s ISI Review Journal, and the Journal of the American Library Association. Many scholarly publications reference the DSPA textbook. As of January 17, 2021, the electronic version of the book first edition (ISBN 978-3-319-72347-1) is freely available on SpringerLink and has been downloaded over 6 million times. The textbook is globally available in print (hardcover and softcover) and electronic formats (PDF and EPub) in many college and university libraries and has been used for data science, computational statistics, and analytics classes at various institutions.

    Read more →
  • Bibliographic database

    Bibliographic database

    A bibliographic database is a database of bibliographic records. This is an organised online collection of references to published written works like journal and newspaper articles, conference proceedings, reports, government and legal publications, patents and books. In contrast to library catalogue entries, a majority of the records in bibliographic databases describe articles and conference papers rather than complete monographs, and they generally contain very rich subject descriptions in the form of keywords, subject classification terms, or abstracts. A bibliographic database may cover a wide range of topics or one academic field like computer science. A significant number of bibliographic databases are marketed under a trade name by licensing agreement from vendors, or directly from their makers: the indexing and abstracting services. Many bibliographic databases have evolved into digital libraries, providing the full text of the organised contents:for instance CORE also organises and mirrors scholarly articles and OurResearch develops a search engine for open access content in Unpaywall. Others merge with non-bibliographic and scholarly databases to create more complete disciplinary search engine systems, such as Chemical Abstracts or Entrez. == History == Prior to the mid-20th century, individuals searching for published literature had to rely on printed bibliographic indexes, generated manually from index cards. During the early 1960s computers were used to digitize text for the first time; the purpose was to reduce the cost and time required to publish two American abstracting journals, the Index Medicus of the National Library of Medicine and the Scientific and Technical Aerospace Reports of the National Aeronautics and Space Administration (NASA). By the late 1960s, such bodies of digitized alphanumeric information, known as bibliographic and numeric databases, constituted a new type of information resource. Online interactive retrieval became commercially viable in the early 1970s over private telecommunications networks. The first services offered a few databases of indexes and abstracts of scholarly literature. These databases contained bibliographic descriptions of journal articles that were searchable by keywords in author and title, and sometimes by journal name or subject heading. The user interfaces were crude, the access was expensive, and searching was done by librarians on behalf of "end users".

    Read more →
  • Token-based replay

    Token-based replay

    Token-based replay technique is a conformance checking algorithm that checks how well a process conforms with its model by replaying each trace on the model (in Petri net notation ). Using the four counters produced tokens, consumed tokens, missing tokens, and remaining tokens, it records the situations where a transition is forced to fire and the remaining tokens after the replay ends. Based on the count at each counter, we can compute the fitness value between the trace and the model. == The algorithm == Source: The token-replay technique uses four counters to keep track of a trace during the replaying: p: Produced tokens c: Consumed tokens m: Missing tokens (consumed while not there) r: Remaining tokens (produced but not consumed) Invariants: At any time: p + m ≥ c ≥ m {\displaystyle p+m\geq c\geq m} At the end: r = p + m − c {\displaystyle r=p+m-c} At the beginning, a token is produced for the source place (p = 1) and at the end, a token is consumed from the sink place (c' = c + 1). When the replay ends, the fitness value can be computed as follows: 1 2 ( 1 − m c ) + 1 2 ( 1 − r p ) {\displaystyle {\frac {1}{2}}(1-{\frac {m}{c}})+{\frac {1}{2}}(1-{\frac {r}{p}})} == Example == Suppose there is a process model in Petri net notation as follows: === Example 1: Replay the trace (a, b, c, d) on the model M === Step 1: A token is initiated. There is one produced token ( p = 1 {\displaystyle p=1} ). Step 2: The activity a {\displaystyle \mathbf {a} } consumes 1 token to be fired and produces 2 tokens ( p = 1 + 2 = 3 {\displaystyle p=1+2=3} and c = 1 {\displaystyle c=1} ). Step 3: The activity b {\displaystyle \mathbf {b} } consumes 1 token and produces 1 token ( p = 3 + 1 = 4 {\displaystyle p=3+1=4} and c = 1 + 1 = 2 {\displaystyle c=1+1=2} ). Step 4: The activity c {\displaystyle \mathbf {c} } consumes 1 token and produces 1 token ( p = 4 + 1 = 5 {\displaystyle p=4+1=5} and c = 2 + 1 = 3 {\displaystyle c=2+1=3} ). Step 5: The activity d {\displaystyle \mathbf {d} } consumes 2 tokens and produces 1 token ( p = 5 + 1 = 6 {\displaystyle p=5+1=6} , c = 3 + 2 = 5 {\displaystyle c=3+2=5} ). Step 6: The token at the end place is consumed ( c = 5 + 1 = 6 {\displaystyle c=5+1=6} ). The trace is complete. The fitness of the trace ( a , b , c , d {\displaystyle \mathbf {a,b,c,d} } ) on the model M {\displaystyle \mathbf {M} } is: 1 2 ( 1 − m c ) + 1 2 ( 1 − r p ) = 1 2 ( 1 − 0 6 ) + 1 2 ( 1 − 0 6 ) = 1 {\displaystyle {\frac {1}{2}}(1-{\frac {m}{c}})+{\frac {1}{2}}(1-{\frac {r}{p}})={\frac {1}{2}}(1-{\frac {0}{6}})+{\frac {1}{2}}(1-{\frac {0}{6}})=1} === Example 2: Replay the trace (a, b, d) on the model M === Step 1: A token is initiated. There is one produced token ( p = 1 {\displaystyle p=1} ). Step 2: The activity a {\displaystyle \mathbf {a} } consumes 1 token to be fired and produces 2 tokens ( p = 1 + 2 = 3 {\displaystyle p=1+2=3} and c = 1 {\displaystyle c=1} ). Step 3: The activity b {\displaystyle \mathbf {b} } consumes 1 token and produces 1 token ( p = 3 + 1 = 4 {\displaystyle p=3+1=4} and c = 1 + 1 = 2 {\displaystyle c=1+1=2} ). Step 4: The activity d {\displaystyle \mathbf {d} } needs to be fired but there are not enough tokens. One artificial token was produced and the missing token counter is increased by one ( m = 1 {\displaystyle m=1} ). The artificial token and the token at place [ b , d ] {\displaystyle [\mathbf {b,d} ]} are consumed ( c = 2 + 2 = 4 {\displaystyle c=2+2=4} ) and one token is produced at place end ( p = 4 + 1 = 5 {\displaystyle p=4+1=5} ). Step 5: The token in the end place is consumed ( c = 4 + 1 = 5 {\displaystyle c=4+1=5} ). The trace is complete. There is one remaining token at place [ a , c ] {\displaystyle [\mathbf {a,c} ]} ( r = 1 {\displaystyle r=1} ). The fitness of the trace ( a , b , d {\displaystyle \mathbf {a,b,d} } ) on the model M {\displaystyle \mathbf {M} } is: 1 2 ( 1 − m c ) + 1 2 ( 1 − r p ) = 1 2 ( 1 − 1 5 ) + 1 2 ( 1 − 1 5 ) = 0.8 {\displaystyle {\frac {1}{2}}(1-{\frac {m}{c}})+{\frac {1}{2}}(1-{\frac {r}{p}})={\frac {1}{2}}(1-{\frac {1}{5}})+{\frac {1}{2}}(1-{\frac {1}{5}})=0.8}

    Read more →