AI Chatbot Interface

AI Chatbot Interface — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Controlled natural language

    Controlled natural language

    Controlled natural languages (CNLs) are subsets of natural languages that are obtained by restricting the grammar and vocabulary in order to reduce or eliminate ambiguity and complexity. Traditionally, controlled languages fall into two major types: those that improve readability for human readers (e.g. non-native speakers), and those that enable reliable automatic semantic analysis of the language. The first type of languages (often called "simplified" or "technical" languages), for example ASD Simplified Technical English, Caterpillar Technical English, IBM's Easy English, are used in the industry to increase the quality of technical documentation, and possibly simplify the semi-automatic translation of the documentation. These languages restrict the writer by general rules such as "Keep sentences short", "Avoid the use of pronouns", "Only use dictionary-approved words", and "Use only the active voice". The second type of languages have a formal syntax and formal semantics, and can be mapped to an existing formal language, such as first-order logic. Thus, those languages can be used as knowledge representation languages, and writing of those languages is supported by fully automatic consistency and redundancy checks, query answering, etc. == Languages == Existing controlled natural languages include: == Encoding == IETF has reserved simple as a BCP 47 variant subtag for simplified versions of languages.

    Read more →
  • Recommender system

    Recommender system

    A recommender system, also called a recommendation algorithm, recommendation engine, or recommendation platform, is a type of information filtering system that suggests items most relevant to a particular user. The value of these systems becomes particularly evident in scenarios where users must select from a large number of options, such as products, media, or content. Major social media platforms and streaming services rely on recommender systems that employ machine learning to analyze user behavior and preferences, thereby enabling personalized content feeds. Typically, the suggestions refer to a variety decision-making processes, including the selection of a product, musical selection, or online news source to read. The implementation of recommender systems is pervasive, with commonly recognised examples including the generation of playlist for video and music services, the provision of product recommendations for e-commerce platforms, and the recommendation of content on social media platforms and the open web. These systems can operate using a single type of input, such as music, or multiple inputs from diverse platforms, including news, books and search queries. Additionally, popular recommender systems have been developed for specific topics, such as restaurants and online dating services. Recommender systems have also been developed to explore research articles and experts, collaborators, and financial services. A content discovery platform is a software recommendation platform that employs recommender system tools. It utilizes user metadata in order to identify and suggest relevant content, whilst reducing ongoing maintenance and development costs. A content discovery platform delivers personalized content to websites, mobile devices, and set-top boxes. A large range of content discovery platforms currently exist for various forms of content ranging from news articles and academic journal articles to television. As operators compete to serve as the gateway to home entertainment, personalized television emerges as a key service differentiator. Academic content discovery has recently become another area of interest, the emergence of numerous companies dedicated to assisting academic researchers in keeping up to date with relevant academic content and facilitating serendipitous discovery of new content. == Overview == Recommender systems usually make use of either or both collaborative filtering and content-based filtering, as well as other systems such as knowledge-based systems. Collaborative filtering approaches build a model from a user's past behavior (e.g., items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users. This model is then used to predict items (or ratings for items) that the user may have an interest in. Content-based filtering approaches utilize a series of discrete, pre-tagged characteristics of an item in order to recommend additional items with similar properties. === Example === The differences between collaborative and content-based filtering can be demonstrated by comparing two early music recommender systems, Last.fm and Pandora Radio. We can also look at how these methods are applied in e-commerce, for example, on platforms like Amazon. Last.fm creates a "station" of recommended songs by observing what bands and individual tracks the user has listened to on a regular basis and comparing those against the listening behavior of other users. Last.fm will play tracks that do not appear in the user's library, but are often played by other users with similar interests. As this approach leverages the behavior of users, it is an example of a collaborative filtering technique. Pandora uses the properties of a song or artist (a subset of the 450 attributes provided by the Music Genome Project) to seed a "station" that plays music with similar properties. User feedback is used to refine the station's results, deemphasizing certain attributes when a user "dislikes" a particular song and emphasizing other attributes when a user "likes" a song. This is an example of a content-based approach. In e-commerce, Amazon's well-known "customers who bought X also bought Y" feature is a prime example of collaborative filtering. It also uses content-based filtering when it recommends a book by the same author you've previously read or a pair of shoes in a similar style to ones you've viewed. Each type of system has its strengths and weaknesses. In the above example, Last.fm requires a large amount of information about a user to make accurate recommendations. This is an example of the cold start problem, and is common in collaborative filtering systems. Whereas Pandora needs very little information to start, it is far more limited in scope (for example, it can only make recommendations that are similar to the original seed). === Alternative implementations === Recommender systems are a useful alternative to search algorithms since they help users discover items they might not have found otherwise. Of note, recommender systems are often implemented using search engines indexing non-traditional data. In some cases, like in the Gonzalez v. Google Supreme Court case, may argue that search and recommendation algorithms are different technologies. Recommender systems have been the focus of several granted patents, and there are more than 50 software libraries that support the development of recommender systems including LensKit, RecBole, ReChorus and RecPack. == History == Elaine Rich created the first recommender system in 1979, called Grundy. She looked for a way to recommend users books they might like. Her idea was to create a system that asks users specific questions and classifies them into classes of preferences, or "stereotypes", depending on their answers. Depending on users' stereotype membership, they would then get recommendations for books they might like. Another early recommender system, called a "digital bookshelf", was described in a 1990 technical report by Jussi Karlgren at Columbia University, and implemented at scale and worked through in technical reports and publications from 1994 onwards by Jussi Karlgren, then at SICS, and research groups led by Pattie Maes at MIT, Will Hill at Bellcore, and Paul Resnick, also at MIT, whose work with GroupLens was awarded the 2010 ACM Software Systems Award. Montaner provided the first overview of recommender systems from an intelligent agent perspective. Adomavicius provided a new, alternate overview of recommender systems. Herlocker provides an additional overview of evaluation techniques for recommender systems, and Beel et al. discussed the problems of offline evaluations. Beel et al. have also provided literature surveys on available research paper recommender systems and existing challenges. == Approaches == === Collaborative filtering === One approach to the design of recommender systems that has wide use is collaborative filtering. Collaborative filtering is based on the assumption that people who agreed in the past will agree in the future, and that they will like similar kinds of items as they liked in the past. The system generates recommendations using only information about rating profiles for different users or items. By locating peer users/items with a rating history similar to the current user or item, they generate recommendations using this neighborhood. This approach is a cornerstone for e-commerce sites that analyze the purchasing patterns of thousands of users to suggest what you might like. Collaborative filtering methods are classified as memory-based and model-based. A well-known example of memory-based approaches is the user-based algorithm, while that of model-based approaches is matrix factorization (recommender systems). A key advantage of the collaborative filtering approach is that it does not rely on machine analyzable content and therefore it is capable of accurately recommending complex items such as movies without requiring an "understanding" of the item itself. Many algorithms have been used in measuring user similarity or item similarity in recommender systems. For example, the k-nearest neighbor (k-NN) approach and the Pearson Correlation as first implemented by Allen. When building a model from a user's behavior, a distinction is often made between explicit and implicit forms of data collection. Examples of explicit data collection include the following: Asking a user to rate an item on a sliding scale. Asking a user to search. Asking a user to rank a collection of items from favorite to least favorite. Presenting two items to a user and asking him/her to choose the better one of them. Asking a user to create a list of items that he/she likes (see Rocchio classification or other similar techniques). Examples of implicit data collection include the following: Observing the items that a user views in an online store, media library, or other repository of med

    Read more →
  • Informationist

    Informationist

    An informationist (or information specialist in context) provides research and knowledge management services in the context of clinical care or biomedical research. Although there is no one educational pathway or formalized set of skills or knowledge for informationists, one way to think of the informationist is as one who possesses the knowledge and skill of a medical librarian with extensive research specialization and some formal clinical or public health education that goes beyond on-the-job osmosis. Medical librarians and other biomedical professional organizations have been exploring the possibilities for evaluating how informationists are being used and whether their activities supplement or replace medical library activity. More generally, an informationist is a professional who works with information within a particular business, analytic or scientific context to drive toward outcomes based on evidence, analysis, prediction and execution. For example, an extension of the term is increasingly emerging in financial services, life sciences and health care industries. Though still nascently in use, its adoption applies to individuals with extensive industry expertise, acute familiarity with organizational structures and processes, deep domain level information mastery and information systems technical savvy. Informationists in this context support transformational initiatives within and across functional areas of an enterprise as architects, governance experts, continuous improvement advocates and strategists. == Background == The term was proposed in 2000 by Davidoff & Florance. Their editorial suggested that physicians should be delegating their information needs to informationists, just as they currently order CT scans from radiologists or cardiac catheterizations from cardiologists. They conceived of an information professional who was embedded in (and indeed, supported by) the clinical departments. Supporters of the concept see it as a means for librarians to reinvigorate connections with the faculty/clinicians, as well as provide superior service by dint of informationists' biomedical training. Critics complained that the idea is nothing new; librarians already provide in-depth, high quality information services and clinical medical librarians have been working alongside physicians, nurses and other clinicians for years. Large informationist programs in the U.S. exist at the National Institutes of Health and at Vanderbilt University. Welch Medical Library at Johns Hopkins University (JHU) is developing an informationist service model in which its 10 clinical and public health librarians are moving from serving as liaison librarians for assigned departments toward becoming embedded informationists within their departments. To prepare for the embedded informationist role, librarians are undertaking education as needed to supplement their backgrounds. For example, librarians bring experience in clinical behavior counseling, public health, nursing, and more. Informationist training can then focus upon filling gaps in research methods knowledge more so than on gaining additional knowledge in the librarian's area of expertise. Courses, seminars and workshops being undertaken include those covering systematic reviews, evidence-based medicine, critical appraisal, medical language, anatomy and physiology, biostatistics, and clinical research. The term informationist is related to that of informatician—also informaticist—and many informationists do possess skills in clinical topics, bioinformatics, and biomedical informatics. Harvard University, the University of Pittsburgh, and Washington University in St. Louis are examples of institutional libraries which have hired PhD-level scientists (who may or may not have library degrees) to provide informatics support for biomedical research.

    Read more →
  • National Data Repository

    National Data Repository

    A National Data Repository (NDR) is a data bank that seeks to preserve and promote a country's natural resources data, particularly data related to the petroleum exploration and production (E&P) sector. A National Data Repository is normally established by an entity that governs, controls and supports the exchange, capture, transference and distribution of E&P information, with the final target to provide the State with the tools and information to assure the growth, govern-ability, control, independence and sovereignty of the industry. The two fundamental reasons for a country to establish an NDR are to preserve data generated inside the country by the industry, and to promote investments in the country by utilizing data to reduce the exploration, production, and transportation business risks. Countries take different approaches towards preserving and promoting their natural resources data. The approach varies according to a country's natural resources policies, level of openness, and its attitude towards foreign investment. == Data types == NDRs store a vast array of data related to a country's natural resources. This includes wells, well log data, well reports, core samples, seismic surveys, post-stack seismic, field data/tapes, seismic (acquisition/processing) reports, production data, geological maps and reports, license data and geological models. == Funding models == Some NDRs are financed entirely by a country's government. Others are industry-funded. Still some are hybrid systems, funded in part by industry and government. NDRs typically charge fees for data requests and for data loading. The cost differs significantly between countries. In some cases an annual membership is charged to oil companies to store and access the data in the NDR. == Standards body == Energistics is the global energy standards resource center for the upstream oil and gas industry. Energistics National Data Repository Work Group: The standards body is Energistics. === Energistics-standards-directory === Global regulators of upstream oil and natural gas information, including seismic, drilling, production and reservoir data, formed the National Data Repository (NDR) Work Group in 2008 to collaborate on the development of data management standards and to assist emerging nations with hydrocarbon reserves to better collect, maintain and deliver oil and gas data to the public and to the industry. Ten countries, led by the Netherlands, Norway and the United Kingdom, formed NDR to share best practices and to formalize the development and deployment of data management standards for regulatory agencies. The other countries involved in the NDR Work Group's formation are Australia, Canada, India, Kenya, New Zealand, South Africa and the United States. Annual NDR Conference: Approximately every 18 months Energistics organizes a National Data Repository Conference. The purpose is to provide government and regulatory agencies from around the world an opportunity to attend a series of workshops dedicated to developing data exchange standards, improving communications with the oil and gas industry and learning data management techniques for natural resources information. === Society of Exploration Geophysicists and The International Oil and Gas Producers Association === The SEG is the custodian of the SEG standards which are used for the exchange, retention and release of seismic data. They are commonly used by National Data Repositories with the SEGD and SEGY being the field and processed exchange standards respectively. == NDRs around the world == Click here to see a map of the NDRs around the world

    Read more →
  • Keka HR

    Keka HR

    Keka HR is a software company that provides cloud-based human resource management and payroll automation software. Keka HR specializes in providing business services in the field of HR technology, payroll automation, recruiting, leave, attendance and performance management. The company was founded by Vijay Yalamanchili on July 21, 2014. The company is headquartered in Hyderabad, with operations in Singapore and the United States. == History == Keka HR was established in 2014 in Hyderabad, Telangana, India. In 2015, the company entered the Indian HR market and received the HYSEA Startup Award. By 2019, Keka HR had surpassed $1 million in annual recurring revenue (ARR). During the COVID-19 pandemic in 2020, the company reported a sevenfold increase in sales. By 2021, the company had raised $1.6 million through Recur Club. In 2022, Keka HR secured $57 million in Series A funding from West Bridge Capital. The company's headquarters are located in Gachibowli, Hyderabad, with offices in Singapore and Seattle, Washington.

    Read more →
  • Applications of artificial intelligence

    Applications of artificial intelligence

    Artificial intelligence is the capability of computational systems to perform tasks that are typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. Artificial intelligence has been used in applications throughout industry and academia. Within the field of Artificial Intelligence, there are multiple subfields. The subfield of machine learning has been used for various scientific and commercial purposes, including language translation, image recognition, decision-making, credit scoring, and e-commerce. In recent years, massive advancements have been made in the field of generative artificial intelligence, which uses generative models to generate text, images, videos, and other forms of data. This article describes applications of AI in different sectors. == Agriculture == In agriculture, AI has been proposed as a way for farmers to identify areas that need irrigation, fertilization, or pesticide treatments to increase yields, thereby improving efficiency. AI has been used to attempt to classify livestock pig call emotions, automate greenhouses, detect diseases and pests, and optimize irrigation. == AI-assisted software develoment == == Architecture and design == == Business == A 2023 study found that generative AI increased productivity by 15% in contact centers. Another 2023 study found it increased productivity by up to 40% in writing tasks. An August 2025 review by MIT found that of surveyed companies, 95% did not report any improvement in revenue from the use of AI. A September 2025 article by the Harvard Business Review describes how increased use of AI does not automatically lead to increases in revenue or actual productivity. Referring to "AI generated work content that masquerades as good work, but lacks the substance to meaningfully advance a given task" the article coins the term workslop. Per studies done in collaboration with the Stanford Social Media Lab, workslop does not improve productivity and undermines trust and collaboration among colleagues. In telehealth, agentic AI is reportedly facilitating the creation of large business models (millions in annual profit) with 1-2 employees, such as MEDVi, which as of August 2025 only had 2 employees and ~$75M in annual profit for GLP-1 weight-loss telehealth services. == Chatbots == == Computer science == === Programming assistance === ==== AI-assisted software development ==== AI can be used for real-time code completion, chat, and automated test generation. These tools are typically integrated with editors and IDEs as plugins. AI-assisted software development systems differ in functionality, quality, speed, and approach to privacy. Creating software primarily via AI is known as "vibe coding". Code created or suggested by AI can be incorrect or inefficient. The use of AI-assisted coding can potentially speed-up software development, but can also slow-down the process by creating more work when debugging and testing. The rush to prematurely adopt AI technology can also incur additional technical debt. AI also requires additional consideration and careful review for cybersecurity, since AI coding software is trained on a wide range of code of inconsistent quality and often replicates poor practices. ==== Neural network design ==== AI can be used to create other AIs. For example, around November 2017, Google's AutoML project to evolve new neural net topologies created NASNet, a system optimized for ImageNet and POCO F1. NASNet's performance exceeded all previously published performance on ImageNet. ==== Quantum computing ==== Research and development of quantum computers has been performed with machine learning algorithms. For example, there is a prototype, photonic, quantum memristive device for neuromorphic computers (NC)/artificial neural networks and NC-using quantum materials with some variety of potential neuromorphic computing-related applications. The use of quantum machine learning for quantum simulators has been proposed for solving physics and chemistry problems. === Historical contributions === AI researchers have created many tools to solve the most difficult problems in computer science. Many of their inventions have been adopted by mainstream computer science and are no longer considered AI. All of the following were originally developed in AI laboratories: Time sharing Interactive interpreters Graphical user interfaces and the computer mouse Rapid application development environments The linked list data structure Automatic storage management Symbolic programming Functional programming Dynamic programming Object-oriented programming Optical character recognition Constraint satisfaction == Customer service == === Human resources === AI programs have been used in hiring processes to screen resumes and rank candidates based on their qualifications, predict a candidate's likelihood of success in a given role, and automate repetitive communication tasks using chatbots. Studies on these programs have identified tendencies for gender bias, favoring male names and male-coded characteristics, as well as bias against disabled candidates and racial minorities. === Online and telephone customer service === AI underlies avatars (automated online assistants) on web pages. It can reduce operation and training costs. Pypestream automated customer service for its mobile application to streamline communication with customers. A Google app analyzes language and converts speech into text. The platform can identify angry customers through their language and respond appropriately. Amazon uses a chatbot for customer service that can perform tasks like checking the status of an order, cancelling orders, offering refunds and connecting the customer with a human representative. Generative AI (GenAI), such as ChatGPT, is increasingly used in business to automate tasks and enhance decision-making. === Hospitality === In the hospitality industry, AI is used to reduce repetitive tasks, analyze trends, interact with guests, and predict customer needs. AI hotel services come in the form of a chatbot, application, virtual voice assistant and service robots. == Education == In educational institutions, AI has been used to automate routine tasks such as attendance tracking, grading, and marking. AI tools have also been used to monitor student progress and analyze learning behaviors, with the goal of facilitating timely interventions for students facing academic challenges. == Energy and environment == === Energy system === The U.S. Department of Energy wrote in an April 2024 report that AI may have applications in modeling power grids, reviewing federal permits with large language models, predicting levels of renewable energy production, and improving the planning process for electrical vehicle charging networks. Other studies have suggested that machine learning can be used for energy consumption prediction and scheduling, e.g. to help with renewable energy intermittency management (see also: smart grid and climate change mitigation in the power grid). === Environmental monitoring === Autonomous ships that monitor the ocean, AI-driven satellite data analysis, passive acoustics or remote sensing and other applications of environmental monitoring make use of machine learning. For example, "Global Plastic Watch" is an AI-based satellite monitoring-platform for analysis/tracking of plastic waste sites to help prevention of plastic pollution – primarily ocean pollution – by helping identify who and where mismanages plastic waste, dumping it into oceans. === Early-warning systems === Machine learning can be used to spot early-warning signs of disasters and environmental issues, possibly including natural pandemics, earthquakes, landslides, heavy rainfall, long-term water supply vulnerability, tipping-points of ecosystem collapse, cyanobacterial bloom outbreaks, and droughts. === Economic and social challenges === The University of Southern California launched the Center for Artificial Intelligence in Society, with the goal of using AI to address problems such as homelessness. Stanford researchers use AI to analyze satellite images to identify high poverty areas. == Entertainment and media == === Media === AI applications analyze media content such as movies, TV programs, advertisement videos or user-generated content. The solutions often involve computer vision. Typical scenarios include the analysis of images using object recognition or face recognition techniques, or the analysis of video for scene recognizing scenes, objects or faces. AI-based media analysis can facilitate media search, the creation of descriptive keywords for content, content policy monitoring (such as verifying the suitability of content for a particular TV viewing time), speech to text for archival or other purposes, and the detection of logos, products or celebrity faces for ad placement. Motion interpolation Pixel-art scaling algorithms Image scaling Imag

    Read more →
  • Overcategorization

    Overcategorization

    Overcategorization or category clutter is a phenomenon during classification where too many categories or classes are assigned to a document, record, or item. Overcategorization is related to the library and information science (LIS) concepts of document classification and subject indexing. It is also related to online shopping where excessive product categories can overwhelm users with too many choices or make it more difficult for customers to find the products they need. Although these categories are intended to improve organization and ease of navigation when shipping online, too many categories can lower customer satisfaction, increase difficulty navigating the online store, and reduce future shopping intentions. In LIS, the ideal number of terms that should be assigned to classify an item are measured by the variables precision and recall. Assigning few category labels that are most closely related to the content of the item being classified will result in searches that have high precision, I.e., where a high proportion of the results are closely related to the query. Assigning more category labels to each item will reduce the precision of each search, but increase the recall, retrieving more relevant results. Related LIS concepts include exhaustivity of indexing and information overload. == Basic principles == If too many categories are assigned to a given document, the implications for users depend on how informative the links are. If the user is able to distinguish between useful and not useful links, the damage is limited: The user only wastes time selecting links. In many cases, however, the user cannot judge whether or not a given link will turn out to be fruitful. In that case he or she has to follow the link and to read or skim another document. The worst case scenario is, of course, that even after reading the new document the user is unable to decide whether or not it might be useful if its subject matter is not thoroughly investigated. Overcategorization also has another unpleasant implication: It makes the system (for example in Wikipedia) difficult to maintain in a consistent way. If the system is inconsistent, it means that when the user considers the links in a given category, he or she will not find all documents relevant to that category. Basically, the problem of overcategorization should be understood from the perspective of relevance and the traditional measures of recall and precision. If too few relevant categories are assigned to a document, recall may decrease. If too many non-relevant categories are assigned, precision becomes lower. The hard job is to say which categories are fruitful or relevant for future use of the document.

    Read more →
  • Enumeration algorithm

    Enumeration algorithm

    In computer science, an enumeration algorithm is an algorithm that enumerates the answers to a computational problem. Formally, such an algorithm applies to problems that take an input and produce a list of solutions, similarly to function problems. For each input, the enumeration algorithm must produce the list of all solutions, without duplicates, and then halt. The performance of an enumeration algorithm is measured in terms of the time required to produce the solutions, either in terms of the total time required to produce all solutions, or in terms of the maximal delay between two consecutive solutions and in terms of a preprocessing time, counted as the time before outputting the first solution. This complexity can be expressed in terms of the size of the input, the size of each individual output, or the total size of the set of all outputs, similarly to what is done with output-sensitive algorithms. == Formal definitions == An enumeration problem P {\displaystyle P} is defined as a relation R {\displaystyle R} over strings of an arbitrary alphabet Σ {\displaystyle \Sigma } : R ⊆ Σ ∗ × Σ ∗ {\displaystyle R\subseteq \Sigma ^{}\times \Sigma ^{}} An algorithm solves P {\displaystyle P} if for every input x {\displaystyle x} the algorithm produces the (possibly infinite) sequence y {\displaystyle y} such that y {\displaystyle y} has no duplicate and z ∈ y {\displaystyle z\in y} if and only if ( x , z ) ∈ R {\displaystyle (x,z)\in R} . The algorithm should halt if the sequence y {\displaystyle y} is finite. == Common complexity classes == Enumeration problems have been studied in the context of computational complexity theory, and several complexity classes have been introduced for such problems. A very general such class is EnumP, the class of problems for which the correctness of a possible output can be checked in polynomial time in the input and output. Formally, for such a problem, there must exist an algorithm A which takes as input the problem input x, the candidate output y, and solves the decision problem of whether y is a correct output for the input x, in polynomial time in x and y. For instance, this class contains all problems that amount to enumerating the witnesses of a problem in the class NP. Other classes that have been defined include the following. In the case of problems that are also in EnumP, these problems are ordered from least to most specific: Output polynomial, the class of problems whose complete output can be computed in polynomial time. Incremental polynomial time, the class of problems where, for all i, the i-th output can be produced in polynomial time in the input size and in the number i. Polynomial delay, the class of problems where the delay between two consecutive outputs is polynomial in the input (and independent from the output). Strongly polynomial delay, the class of problems where the delay before each output is polynomial in the size of this specific output (and independent from the input or from the other outputs). The preprocessing is generally assumed to be polynomial. Constant delay, the class of problems where the delay before each output is constant, i.e., independent from the input and output. The preprocessing phase is generally assumed to be polynomial in the input. == Common techniques == Backtracking: The simplest way to enumerate all solutions is by systematically exploring the space of possible results (partitioning it at each successive step). However, performing this may not give good guarantees on the delay, i.e., a backtracking algorithm may spend a long time exploring parts of the space of possible results that do not give rise to a full solution. Flashlight search: This technique improves on backtracking by exploring the space of all possible solutions but solving at each step the problem of whether the current partial solution can be extended to a partial solution. If the answer is no, then the algorithm can immediately backtrack and avoid wasting time, which makes it easier to show guarantees on the delay between any two complete solutions. In particular, this technique applies well to self-reducible problems. Closure under set operations: If we wish to enumerate the disjoint union of two sets, then we can solve the problem by enumerating the first set and then the second set. If the union is non disjoint but the sets can be enumerated in sorted order, then the enumeration can be performed in parallel on both sets while eliminating duplicates on the fly. If the union is not disjoint and both sets are not sorted then duplicates can be eliminated at the expense of a higher memory usage, e.g., using a hash table. Likewise, the cartesian product of two sets can be enumerated efficiently by enumerating one set and joining each result with all results obtained when enumerating the second step. == Examples of enumeration problems == The vertex enumeration problem, where we are given a polytope described as a system of linear inequalities and we must enumerate the vertices of the polytope. Enumerating the minimal transversals of a hypergraph. This problem is related to monotone dualization and is connected to many applications in database theory and graph theory. Enumerating the answers to a database query, for instance a conjunctive query or a query expressed in monadic second-order. There have been characterizations in database theory of which conjunctive queries could be enumerated with linear preprocessing and constant delay. The problem of enumerating maximal cliques in an input graph, e.g., with the Bron–Kerbosch algorithm Listing all elements of structures such as matroids and greedoids Several problems on graphs, e.g., enumerating independent sets, paths, cuts, etc. Enumerating the satisfying assignments of representations of Boolean functions, e.g., a Boolean formula written in conjunctive normal form or disjunctive normal form, a binary decision diagram such as an OBDD, or a Boolean circuit in restricted classes studied in knowledge compilation, e.g., NNF. == Connection to computability theory == The notion of enumeration algorithms is also used in the field of computability theory to define some high complexity classes such as RE, the class of all recursively enumerable problems. This is the class of sets for which there exist an enumeration algorithm that will produce all elements of the set: the algorithm may run forever if the set is infinite, but each solution must be produced by the algorithm after a finite time.

    Read more →
  • Language model benchmark

    Language model benchmark

    A language model benchmark is a standardized test designed to evaluate the performance of language models on various natural language processing tasks. These tests are intended for comparing different models' capabilities in areas such as language understanding, generation, and reasoning. Benchmarks generally consist of a dataset and corresponding evaluation metrics. The dataset provides text samples and annotations, while the metrics measure a model's performance on tasks like answering questions, text classification, and machine translation. These benchmarks are developed and maintained by academic institutions, research organizations, and industry players to track progress in the field. In addition to accuracy, the metrics can include throughput, energy efficiency, bias, trust, and sustainability. == Overview == === Types === Benchmarks may be described by the following adjectives, not mutually exclusive: Classical: These tasks are studied in natural language processing, even before the advent of deep learning. Examples include the Penn Treebank for testing syntactic and semantic parsing, as well as bilingual translation benchmarked by BLEU scores. Question answering: These tasks have a text question and a text answer, often multiple-choice. They can be open-book or closed-book. Open-book QA resembles reading comprehension questions, with relevant passages included as annotation in the question, in which the answer appears. Closed-book QA includes no relevant passages. Closed-book QA is also called open-domain question-answering. Before the era of large language models, open-book QA was more common, and understood as testing information retrieval methods. Closed-book QA became common since GPT-2 as a method to measure knowledge stored within model parameters. Omnibus: An omnibus benchmark combines many benchmarks, often previously published. It is intended as an all-in-one benchmarking solution. Reasoning: These tasks are usually in the question-answering format, but are intended to be more difficult than standard question answering. Multimodal: These tasks require processing not only text, but also other modalities, such as images and sound. Examples include OCR and transcription. Agency: These tasks are for a language-model–based software agent that operates a computer for a user, such as editing images, browsing the web, etc. Adversarial: A benchmark is "adversarial" if the items in the benchmark are picked specifically so that certain models do badly on them. Adversarial benchmarks are often constructed after state of the art (SOTA) models have saturated (achieved 100% performance) a benchmark, to renew the benchmark. A benchmark is "adversarial" only at a certain moment in time, since what is adversarial may cease to be adversarial as newer SOTA models appear. Public/Private: A benchmark might be partly or entirely private, meaning that some or all of the questions are not publicly available. The idea is that if a question is publicly available, then it might be used for training, which would be "training on the test set" and invalidate the result of the benchmark. Usually, only the guardians of the benchmark have access to the private subsets, and to score a model on such a benchmark, one must send the model weights, or provide API access, to the guardians. The boundary between a benchmark and a dataset is not sharp. Generally, a dataset contains three "splits": training, test, and validation. Both the test and validation splits are essentially benchmarks. In general, a benchmark is distinguished from a test/validation dataset in that a benchmark is typically intended to be used to measure the performance of many different models that are not trained specifically for doing well on the benchmark, while a test/validation set is intended to be used to measure the performance of models trained specifically on the corresponding training set. In other words, a benchmark may be thought of as a test/validation set without a corresponding training set. Conversely, certain benchmarks may be used as a training set, such as the English Gigaword or the One Billion Word Benchmark, which in modern language is just the negative log-likelihood loss on a pretraining set with 1 billion words. Indeed, the distinction between benchmark and dataset in language models became sharper after the rise of the pretraining paradigm, whereby a model is first trained on massive, unlabeled datasets to learn general language patterns, syntax, and knowledge (pretraining), and the base model is then adapted to specific, downstream tasks using smaller, labeled datasets (fine-tuning). === Lifecycle === Generally, the life cycle of a benchmark consists of the following steps: Inception: A benchmark is published. It can be simply given as a demonstration of the power of a new model (implicitly) that others then picked up as a benchmark, or as a benchmark that others are encouraged to use (explicitly). Growth: More papers and models use the benchmark, and the performance on the benchmark grows. Maturity, degeneration or deprecation: A benchmark may be saturated, after which researchers move on to other benchmarks. Progress on the benchmark may also be neglected as the field moves to focus on other benchmarks. Renewal: A saturated benchmark can be upgraded to make it no longer saturated, allowing further progress. === Construction === Like datasets, benchmarks are typically constructed by several methods, individually or in combination: Web scraping: Ready-made question-answer pairs may be scraped online, such as from websites that teach mathematics and programming. Conversion: Items may be constructed programmatically from scraped web content, such as by blanking out named entities from sentences, and asking the model to fill in the blank. This was used for making the CNN/Daily Mail Reading Comprehension Task. Crowd sourcing: Items may be constructed by paying people to write them, such as on Amazon Mechanical Turk. This was used for making the MCTest. === Evaluation === Generally, benchmarks are fully automated. This limits the questions that can be asked. For example, with mathematical questions, "proving a claim" would be difficult to automatically check, while "calculate an answer with a unique integer answer" would be automatically checkable. With programming tasks, the answer can generally be checked by running unit tests, with an upper limit on runtime. The benchmark scores are of the following kinds: For multiple choice or cloze questions, common scores are accuracy (frequency of correct answer), precision, recall, F1 score, etc. pass@n: The model is given n {\displaystyle n} attempts to solve each problem. If any attempt is correct, the model earns a point. The pass@n score is the model's average score over all problems. k@n: The model makes n {\displaystyle n} attempts to solve each problem, but only k {\displaystyle k} attempts out of them are selected for submission. If any submission is correct, the model earns a point. The k@n score is the model's average score over all problems. cons@n: The model is given n {\displaystyle n} attempts to solve each problem. If the most common answer is correct, the model earns a point. The cons@n score is the model's average score over all problems. Here "cons" stands for "consensus" or "majority voting". The pass@n score can be estimated more accurately by making N > n {\displaystyle N>n} attempts, and use the unbiased estimator 1 − ( N − c n ) ( N n ) {\displaystyle 1-{\frac {\binom {N-c}{n}}{\binom {N}{n}}}} , where c {\displaystyle c} is the number of correct attempts. For less well-formed tasks, where the output can be any sentence, there are the following commonly used scores including BLEU ROUGE, METEOR, NIST, word error rate, LEPOR, CIDEr, and SPICE. === Issues === error: Some benchmark answers may be wrong. ambiguity: Some benchmark questions may be ambiguously worded. subjective: Some benchmark questions may not have an objective answer at all. This problem generally prevents creative writing benchmarks. Similarly, this prevents benchmarking writing proofs in natural language, though benchmarking proofs in a formal language is possible. open-ended: Some benchmark questions may not have a single answer of a fixed size. This problem generally prevents programming benchmarks from using more natural tasks such as "write a program for X", and instead uses tasks such as "write a function that implements specification X". inter-annotator agreement: Some benchmark questions may be not fully objective, such that even people would not agree with 100% on what the answer should be. This is common in natural language processing tasks, such as syntactic annotation. shortcut: Some benchmark questions may be easily solved by an "unintended" shortcut. For example, in the SNLI benchmark, having a negative word like "not" in the second sentence is a strong signal for the "Contradiction" category, regardless of what the se

    Read more →
  • Online public access catalog

    Online public access catalog

    The online public access catalog (OPAC), now frequently synonymous with library catalog, is an online database of materials held by a library or group of libraries. Online catalogs have largely replaced the analog card catalogs previously used in libraries. == History == === Early online === Although a handful of experimental systems existed as early as the 1960s, the first large-scale online catalogs were developed at Ohio State University in 1975 and the Dallas Public Library in 1978. These and other early online catalog systems tended to closely reflect the card catalogs that they were intended to replace. Using a dedicated terminal or telnet client, users could search a handful of pre-coordinate indexes and browse the resulting display in much the same way they had previously navigated the card catalog. Throughout the 1980s, the number and sophistication of online catalogs grew. The first commercial systems appeared, and would by the end of the decade largely replace systems built by libraries themselves. Library catalogs began providing improved search mechanisms, including Boolean and keyword searching, as well as ancillary functions, such as the ability to place holds on items that had been checked-out. At the same time, libraries began to develop applications to automate the purchase, cataloging, and circulation of books and other library materials. These applications, collectively known as an integrated library system (ILS) or library management system, included an online catalog as the public interface to the system's inventory. Most library catalogs are closely tied to their underlying ILS system. === Stagnation and dissatisfaction === The 1990s saw a relative stagnation in the development of online catalogs. Although the earlier character-based interfaces were replaced with ones for the Web, both the design and the underlying search technology of most systems did not advance much beyond that developed in the late 1980s. At the same time, organizations outside of libraries began developing more sophisticated information retrieval systems. Web search engines like Google and popular e-commerce websites such as Amazon.com provided simpler to use (yet more powerful) systems that could provide relevancy ranked search results using probabilistic and vector-based queries. Prior to the widespread use of the Internet, the online catalog was often the first information retrieval system library users ever encountered. Now accustomed to web search engines, newer generations of library users have grown increasingly dissatisfied with the complex (and often arcane) search mechanisms of older online catalog systems. This has, in turn, led to vocal criticisms of these systems within the library community itself, and in recent years to the development of newer (often termed 'next-generation') catalogs. === Next-generation catalogs === Newer generations of library catalog systems, typically called discovery systems (or a discovery layer), are distinguished from earlier OPACs by their use of more sophisticated search technologies, including relevancy ranking and faceted search, as well as features aimed at greater user interaction and participation with the system, including tagging and reviews. These new features rely heavily on existing metadata which may be poor or inconsistent, particularly for older records. Newer catalog platforms may be independent of the organization's integrated library system (ILS), instead providing drivers that allow for the synchronization of data between the two systems. While the original online catalog interfaces were almost exclusively built by ILS vendors, libraries have increasingly sought next-generation catalogs built by enterprise search companies and open-source software projects, often led by libraries themselves. == Union catalogs == Although library catalogs typically reflect the holdings of a single library, they can also contain the holdings of a group or consortium of libraries. These systems, known as union catalogs, are usually designed to aid the borrowing of books and other materials among the member institutions via interlibrary loan. Examples of this type of catalogs include COPAC, SUNCAT, NLA Trove, and WorldCat—the last catalogs the collections of libraries worldwide. == Related systems == There are a number of systems that share much in common with library catalogs, but have traditionally been distinguished from them. Libraries utilize these systems to search for items not traditionally covered by a library catalog, although these systems are sometimes integrated into a more comprehensive discovery system. Bibliographic databases—such as Medline, ERIC, PsycINFO, Scopus, Web of Science, and many others—index journal articles and other research data. There are also a number of applications aimed at managing documents, photographs, and other digitized or born-digital items such as Digital Commons and DSpace. Particularly in academic libraries, these systems (often known as digital library systems or institutional repository systems) assist with efforts to preserve documents created by faculty and students. Electronic resource management helps librarians to track selection, acquisition, and licensing of a library's electronic information resources.

    Read more →
  • Quality of Data

    Quality of Data

    Quality of Data (QoD) is a designation coined by L. Veiga, that specifies and describes the required Quality of Service of a distributed storage system from the Consistency point of view of its data. It can be used to support big data management frameworks, Workflow management, and HPC systems (mainly for data replication and consistency). It takes into account data semantics, namely the Time interval of data freshness, the Sequence of tolerable number of outstanding versions of the data read before ore refresh, and the Value divergence allowed before displaying it. Initially it was based on a model from an existing research work regarding vector-field Consistency, awarded the best-paper prize in the ACM/IFIP/Usenix Middleware Conference 2007 and later enhanced for increased scalability and fault-tolerance. This consistency model has been successfully applied and proven in big data key/value store Apache HBase, initially designed as a middleware module seating between clusters from separate data centres. The HBase-QoD coupling minimises bandwidth usage and optimises resources allocation during replication achieving the desired consistency level at a more fine-grained level. QoD is defined by the three-dimensions of vector k=(θ,σ,ν), but with a broader view of the issue, applicable also to large-scale data management techniques in regards to their timely delivery. == Other descriptions == Quality of Data should not be confused with other definitions for data quality such as completeness, validity, and accuracy.

    Read more →
  • Applied Information Science in Economics

    Applied Information Science in Economics

    The Applied Information Science in Economics (Russian: Прикладная информатика в Экономике) or Applied Computer Science in Economics is a professional qualification generally awarded in Russian Federation. The degree inherited from the U.S.S.R. education system also known as Specialist degree. The degree is awarded after five years of full-time study and includes several internships, course-works, thesis writing and defense. The degree has similarities with German Magister Artium or Diplom degree. However, due to the Bologna Process number of such degrees are declining. Degree focuses on applying mathematical methods in economics involving maximum information technology. It is very close to applied mathematics, but includes also major part of computer science. == List of specialty codes in the education system == 080801 - Applied computer science in economics 351400 - Applied computer science == Fields of activity == Organization and management; Project design; Experimental research; Marketing; Consulting; Operational and Maintenance. == Major == Information Science and Programming. High Level Methods of Information Science and Programming. Information Technologies in Economics. Computer Systems, Networks and Telecommunications Services. Operational Environments, Systems and Shells. Architecture and Design of Information Systems for Companies. Data Bases. Information security. Information Management. Imitative Simulation.

    Read more →
  • Video renderer

    Video renderer

    A video renderer is software that processes a video file and sends it sequentially to the video display controller card for display on a computer screen. An example of a video renderer, is the VMR-7 that was used by Microsoft's DirectShow. An example of a UNIX video renderer is the one container within GStreamer. Commonly used video renderers are: Enhanced Video Renderer VMR9 Renderless Haali's Video Renderer Madvr Video Renderer JRVR, a part of JRiver Media Center

    Read more →
  • Uncertain database

    Uncertain database

    An uncertain database is a kind of database studied in database theory. The goal of uncertain databases is to manage information on which there is some uncertainty. Uncertain databases make it possible to explicitly represent and manage uncertainty on the data, usually in a succinct way. == Formal definition == At the basis of uncertain databases is the notion of possible world. Specifically, a possible world of an uncertain database is a (certain) database which is one of the possible realizations of the uncertain database. A given uncertain database typically has more than one, and potentially infinitely many, possible worlds. A formalism to represent uncertain databases then explains how to succinctly represent a set of possible worlds into one uncertain database. == Types of uncertain databases == Uncertain database models differ in how they represent and quantify these possible worlds: Incomplete databases are a compact representation of the set of possible worlds – the use of NULL in SQL, arguably the most commonplace instantiation of uncertain databases, is an example of incomplete database model. Probabilistic databases are a compact representation of a probability distribution over the set of possible worlds. Fuzzy databases are a compact representation of a fuzzy set of the possible worlds. Though mostly studied in the relational setting, uncertain database models can also be defined in other relational models such as graph databases or XML databases. === Incomplete database === The most common database model is the relational model. Multiple incomplete database models have been defined over the relational model, that form extensions to the relational algebra. These have been called Imieliński–Lipski algebras: Relations with NULL values, also called Codd tables c-tables v-tables === Example === The following table is a relation of an incomplete database, described in the formalism of NULL values: There are infinitely many possible worlds for this incomplete database, obtained by replacing the "NULL" values with concrete values. For instance, the following relation is a possible world:

    Read more →
  • Kinodynamic planning

    Kinodynamic planning

    In robotics and motion planning, kinodynamic planning is a class of problems for which velocity, acceleration, and force/torque bounds must be satisfied, together with kinematic constraints such as avoiding obstacles. The term was coined by Bruce Donald, Pat Xavier, John Canny, and John Reif. Donald et al. developed the first polynomial-time approximation schemes (PTAS) for the problem. By providing a provably polynomial-time ε-approximation algorithm, they resolved a long-standing open problem in optimal control. Their first paper considered time-optimal control ("fastest path") of a point mass under Newtonian dynamics, amidst polygonal (2D) or polyhedral (3D) obstacles, subject to state bounds on position, velocity, and acceleration. Later they extended the technique to many other cases, for example, to 3D open-chain kinematic robots under full Lagrangian dynamics. == Modern approaches == Since the foundational theoretical work of the 1990s, the field has evolved significantly with new algorithmic approaches that address the computational and practical limitations of early methods. === Sampling-based methods === Many practical heuristic algorithms based on stochastic optimization and iterative sampling have been developed by a wide range of authors to address the kinodynamic planning problem. Popular approaches include extensions of RRT algorithms such as RRT for kinodynamic systems, and sampling-based methods like Model Predictive Path Integral (MPPI) control. These stochastic techniques have been shown to work well in practice and can handle complex, high-dimensional state spaces more efficiently than deterministic methods. However, all motion planning methods are subject to the PSPACE-hardnesss of classical motion planning even without dynamics, which means (assuming the usual structural complexity conjectures) they all can be worst-case exponential-time in the state-space dimension (the number of degrees of freedom). On the other hand, the deterministic methods have provable guarantees of completeness, accuracy, and complexity (for fixed dimension, they are polynomial-time not only in the geometric complexity, but also in ( 1 / ε ) {\displaystyle (1/\varepsilon )} , the closeness of the desired approximation), whereas most of the recent heuristic/stochastic methods sacrifice at least one of these criteria. === Mixed-integer optimization approaches === Recent advances in mixed-integer programming have enabled new deterministic approaches to kinodynamic planning. These methods formulate the planning problem as an optimization task that simultaneously determines the spatial path and control sequence while respecting all kinodynamic constraints. By using techniques such as McCormick envelopes to handle bilinear constraints, these approaches can provide globally optimal solutions with mathematical guarantees while achieving significant computational speedups over traditional methods. === Genetic algorithm approaches === Genetic algorithms have also been adapted for kinodynamic planning, particularly for gradient-free optimization in challenging terrain. These methods use evolutionary computation to optimize trajectories over receding horizons, with specialized mutation operators that ensure vehicle controls remain within operational limits. This approach is particularly useful when dealing with non-differentiable cost functions or when gradient information is unavailable or unreliable. === Three-dimensional terrain planning === The foundational theoretical work of the 1990s was extended to higher degrees of freedom, and even to n {\displaystyle n} -link, 3D open-chain kinematic robots under full Lagrangian dynamics. However, many of the subsequent heuristic techniques (typically employing stochastic optimization) were confined to planar environments. More recent kinodynamic planning has extended beyond these planar environments to handle complex 3D terrains represented as simplicial complexes or triangular meshes. This advancement is particularly important for applications such as autonomous vehicle navigation in off-road environments, where elevation changes and terrain geometry significantly impact vehicle dynamics. These methods must account for pitch angles, surface curvature, and the coupling between terrain geometry and vehicle kinodynamic constraints. == Performance and guarantees == The landscape of performance guarantees in kinodynamic planning has evolved considerably. While early heuristic methods could not guarantee optimality, recent mixed-integer approaches have demonstrated the ability to find globally optimal solutions with proven constraint satisfaction. Experimental comparisons have shown that modern optimization-based planners can achieve execution times several orders of magnitude faster than sampling-based methods while maintaining strict adherence to kinodynamic constraints. However, the choice of method often depends on the specific application requirements. Sampling-based methods remain valuable for their ability to quickly find feasible solutions in high-dimensional spaces and their robustness to modeling uncertainties. Optimization-based methods excel when optimality guarantees and constraint compliance are critical, particularly in safety-critical applications. == Applications == Kinodynamic planning finds applications across numerous domains including: Autonomous vehicles: Path planning for cars, trucks, and other ground vehicles that must respect acceleration, steering, and velocity limits Aerial robotics: Trajectory planning for quadrotors and other unmanned aerial vehicles with dynamic constraints Manipulation: Planning for robotic arms where joint velocities, accelerations, and torques are limited Legged locomotion: Footstep and trajectory planning for walking and running robots Space robotics: Planning under thrust and fuel constraints for spacecraft and rovers

    Read more →