AI For Kids Dale Lane

AI For Kids Dale Lane — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Teaspiller

    Teaspiller

    Teaspiller was a US-based web application for customers to find accountants and hire them to do their taxes and accounting online. In 2013 the company was acquired by Intuit, Inc and added to its TurboTax product line. The Teaspiller employees and code were all acquired and the product was renamed as "TurboTax CPA select". It enabled accountants to work remotely with clients (share files, send secure messages, schedule appointments), as well as find new clients looking for their specific skills through a complex search algorithm. This was done through extended profiles containing licensing information, professional histories, user ratings, peer endorsements, association memberships, and practice areas. The service had been called an H&R Block killer by Business Insider as it helped customers find accountants to prepare tax returns online. As of 2011 it had 20,000 US accountants listed on the site. The application was built using the Django framework. == History == Teaspiller was built by Vemdara, LLC, a web company based in New York and founded in 2009 by Amit Vemuri (a former VP at Travelocity). The web application was launched in 2010. In 2013 the company was acquired by Intuit as part of their TurboTax product line and renamed as "TurboTax CPA select".

    Read more →
  • Materials informatics

    Materials informatics

    Materials informatics is a field of study that applies the principles of informatics and data science to materials science and engineering to improve the understanding, use, selection, development, and discovery of materials. The term "materials informatics" is frequently used interchangeably with "data science", "machine learning", and "artificial intelligence" by the community. This is an emerging field, with a goal to achieve high-speed and robust acquisition, management, analysis, and dissemination of diverse materials data with the goal of greatly reducing the time and risk required to develop, produce, and deploy new materials, which generally takes longer than 20 years. This field of endeavor is not limited to some traditional understandings of the relationship between materials and information. Some more narrow interpretations include combinatorial chemistry, process modeling, materials databases, materials data management, and product life cycle management. Materials informatics is at the convergence of these concepts, but also transcends them and has the potential to achieve greater insights and deeper understanding by applying lessons learned from data gathered on one type of material to others. By gathering appropriate meta data, the value of each individual data point can be greatly expanded. == Databases == Databases are essential for any informatics research and applications. In material informatics many databases exist containing both empirical data obtained experimentally, and theoretical data obtained computationally. Big data that can be used for machine learning is particularly difficult to obtain for experimental data due to the lack of a standard for reporting data and the variability in the experimental environment. This lack of big data has led to growing effort in developing machine learning techniques that utilize data extremely data sets. On the other hand, large uniform database of theoretical density functional theory (DFT) calculations exists. These databases have proven their utility in high-throughput material screening and discovery. Some common DFT databases and high throughput tools are listed below: Databases: MaterialsProject.org, MaterialsWeb.org (University of Florida) HT software: Pymatgen, MPInterfaces, Matminer == Beyond computational methods? == The concept of materials informatics is addressed by the Materials Research Society. For example, materials informatics was the theme of the December 2006 issue of the MRS Bulletin. The issue was guest-edited by John Rodgers of Innovative Materials, Inc., and David Cebon of Cambridge University, who described the "high payoff for developing methodologies that will accelerate the insertion of materials, thereby saving millions of investment dollars." The editors focused on the limited definition of materials informatics as primarily focused on computational methods to process and interpret data. They stated that "specialized informatics tools for data capture, management, analysis, and dissemination" and "advances in computing power, coupled with computational modeling and simulation and materials properties databases" will enable such accelerated insertion of materials. A broader definition of materials informatics goes beyond the use of computational methods to carry out the same experimentation, viewing materials informatics as a framework in which a measurement or computation is one step in an information-based learning process that uses the power of a collective to achieve greater efficiency in exploration. When properly organized, this framework crosses materials boundaries to uncover fundamental knowledge of the basis of physical, mechanical, and engineering properties. == Challenges == While there are many who believe in the future of informatics in the materials development and scaling process, many challenges remain. Hill, et al., write that "Today, the materials community faces serious challenges to bringing about this data-accelerated research paradigm, including diversity of research areas within materials, lack of data standards, and missing incentives for sharing, among others. Nonetheless, the landscape is rapidly changing in ways that should benefit the entire materials research enterprise." This remaining tension between traditional materials development methodologies and the use of more computationally, machine learning, and analytics approaches will likely exist for some time as the materials industry overcomes some of the cultural barriers necessary to fully embrace such new ways of thinking. == Analogy from Biology == The overarching goals of bioinformatics and systems biology may provide a useful analogy. Andrew Murray of Harvard University expresses the hope that such an approach "will save us from the era of "one graduate student, one gene, one PhD". Similarly, the goal of materials informatics is to save us from one graduate student, one alloy, one PhD. Such goals will require more sophisticated strategies and research paradigms than applying data-science methods to the same tasks set currently undertaken by students.

    Read more →
  • Artificial intelligence in Brazilian industry

    Artificial intelligence in Brazilian industry

    In 2022, 16.9% (1,620) of the 9,586 Brazilian industrial companies with 100 or more employees used artificial intelligence in their operations Among the companies that used AI, the areas of administration (73.8%), product project development (65.9%), processes, services and marketing (65.1%) were those that used it the most, followed by the areas of production (56.4%) and logistics (48.4%). == Current scenario == === Adoption in Brazilian industrial sectors === In senior management, the majority (56%) of executives have a long-term vision for its use. The study also shows that IT, Innovation, and Marketing are the areas where AI use is most widespread, and that 43% of companies are developing or adapting the algorithms they use. The majority of large institutions that reported some type of AI use purchased these solutions from other companies (76%). Some factors for the adoption of artificial intelligence in companies include the establishment of an autonomous strategy by the company (87.0%), and the influence of suppliers and/or customers (63.0%) and the main difficulties in using technologies were high costs (80.8%), lack of qualified personnel in the company (54.6%) and excessive economic risks (49.5%). Three variables are considered the most relevant to explain the option to use AI: the implementation of a digital security policy, the size of companies with 250 or more employees and the characteristics of the company related to information and communication. When analyzing AI use by company size in Brazil, large companies have the highest proportion of AI use, mainly due to their investment capacity and technology experimentation. However, when comparing Brazil and Europe, indicators show an acceleration in AI use among large European companies, while in Brazil the situation remains stable. In 2023, 30% of large companies in the European bloc used some type of AI, a figure that rose to 41% in 2024, while in Brazil these proportions were 41% in 2023 and 38% in 2024. === Workforce === The challenge of upskilling begins with employees who are capable of understanding recent technological changes. Similarly, companies must create the environment and conditions for workforce development conducive to innovation, and universities must be prepared to provide knowledge aligned with the transition process, which in turn must be supported by public policies. The concern with training a specialized workforce in AI can be seen in the low number of graduates and PhDs in computer science and computer engineering in Brazil, compared to the number shown in other countries. As recorded in the document Recommendations for the Advancement of Artificial Intelligence in Brazil, 2019 data from the Coordination for the Improvement of Higher Education Personnel (CAPES) indicate that "the number of PhDs graduated annually in computing remained below 400 in 2016, and is not expected to have increased during the Covid-19 pandemic" (ABC, 2023). In the United States, by contrast, the number of PhDs graduated in these two areas has remained around 1,800 for the past 11 years, and during this period, the number of PhDs specializing in AI jumped from 10% to 19%. Based on data from the CNPq Lattes Platform (October 2019), it is possible to observe that the number of professionals in the AI field in Brazil is 4,429 specialists. This is still a small number compared to the 415,166 IT jobs in the country's business sector alone. === R&D, scientific production and integration with industry === China and the United States lead in the number of publications. These two countries are followed by the G7 members: India, Austria, South Korea, and Spain. Brazil appears in the next group, alongside the Netherlands, Russia, Indonesia, and Ireland. Regarding the promotion of research and technologies related to AI, public entities such as the Coordination for the Improvement of Higher Education Personnel (Capes) and the National Council for Scientific and Technological Development (CNPq) stood out as the main funders. Currently, different countries and territories have been promoting the development of Artificial Intelligence (AI). In the Brazilian case, one of the main initiatives is the creation of Engineering Research Centers/Applied Research Centers (CPE/CPA) in AI by the São Paulo Research Foundation (FAPESP), in collaboration with the Ministry of Science, Technology and Innovation (MCTI), the Ministry of Communications (MC) and the Brazilian Internet Steering Committee (CGI.br). In terms of the number of patents filed and the volume of investments, the leading nations in AI are the United States, China, France, Germany, the United Kingdom, Russia, India, Switzerland, Japan, South Korea, the Netherlands, Sweden, Finland, Ireland, Singapore, Canada, Israel, and Italy. Brazil appears among the top twenty countries in some rankings, mainly due to its good number of publications (approximately 10% of the number of articles published by the United States). The US is home to approximately 60% of the world's top AI researchers, followed by China (11%), Europe (10%), and Canada (6%). To change this scenario, in August 2024, the Brazilian government announced an investment of R$23 billion until 2028 in artificial intelligence, seeking to “transform the country into a global reference in innovation”. == Future challenges == The Organization for Economic Cooperation and Development (2020) report highlighted three factors that hinder the digital transformation journey and application of AI in Brazil: insufficient infrastructure, high costs due to the tax system, and financial limitations, such as limited access to financing. The costs of adopting technology, its incompatibility with the business, and the lack of training also represent obstacles that Brazilian industry must overcome. There are also inherent obstacles for companies. A McKinsey review emphasizes that once a company chooses one or more sectors to focus on, it must select specific applications. Buyers aren't interested in artificial intelligence simply because it's a breakthrough technology; they want AI to generate a good return on investment, whether by solving specific problems, saving money, or increasing sales. If an AI vendor tried to offer a horizontal solution, the value proposition might not be as compelling. Part of the solution to Brazil's technological backwardness involves building an ecosystem fueled by private institutions, universities, and governments.

    Read more →
  • Artificial empathy

    Artificial empathy

    Artificial empathy or computational empathy is the development of AI systems—such as companion robots or virtual agents—that can detect emotions and respond to them in an empathic way. Although such technology can be perceived as scary or threatening, it could also have a significant advantage over humans for roles in which emotional expression can be important, such as in the health care sector. An October 2025 review and meta-analysis in the British Medical Bulletin found that AI chatbots were rated as showing more empathy than human healthcare professionals in 13 of 15 studies that compared them. Care-givers who perform emotional labor above and beyond the requirements of paid labor can experience chronic stress or burnout, and can become desensitized to patients. Artificial empathy could also help the socialization of care-givers, or serve as role model for emotional detachment. A broader definition of artificial empathy is "the ability of nonhuman models to predict a person's internal state (e.g., cognitive, affective, physical) given the signals (s)he emits (e.g., facial expression, voice, gesture) or to predict a person's reaction (including, but not limited to internal states) when he or she is exposed to a given set of stimuli (e.g., facial expression, voice, gesture, graphics, music, etc.)". A 2025 study reported that some multimodal large language models can recognize basic facial expressions with human-level accuracy on a commonly used research dataset of posed facial expressions. == Areas of research == There are a variety of philosophical, theoretical, and applicative questions related to artificial empathy. For example: Which conditions would have to be met for a robot to respond competently to a human emotion? What models of empathy can or should be applied to Social and Assistive Robotics? Must the interaction of humans with robots imitate affective interaction between humans? Can a robot help science learn about affective development of humans? Would robots create unforeseen categories of inauthentic relations? What relations with robots can be considered authentic? How can we assess artificial empathy in AI systems? == Examples of artificial empathy research and practice == People often communicate and make decisions based on inferences about each other's internal states (e.g., emotional, cognitive, and physical states) that are in turn based on signals emitted by the person such as facial expression, body gesture, voice, and words. Broadly speaking, artificial empathy focuses on developing non-human models that achieve similar objectives using similar data. === Streams of artificial empathy research === Artificial empathy has been applied in various research disciplines, including artificial intelligence and business. Two main streams of research in this domain are: the use of nonhuman models to predict a person's internal state (e.g., cognitive, affective, physical) given the signals he or she emits (e.g., facial expression, voice, gesture) the use of nonhuman models to predict a person's reaction when he or she is exposed to a given set of stimuli (e.g., facial expression, voice, gesture, graphics, music, etc.). Research on affective computing, such as emotional speech recognition and facial expression detection, falls within the first stream of artificial empathy. Contexts that have been studied include oral interviews, call centers, human-computer interaction, sales pitches, and financial reporting. The second stream of artificial empathy has been researched more in marketing contexts, such as advertising, branding, customer reviews, in-store recommendation systems, movies, and online dating. === Artificial empathy applications in practice === With the increasing volume of visual, audio, and text data in commerce, many business applications for artificial empathy have followed. For example, Affectiva analyses viewers' facial expressions from video recordings while they are watching video advertisements in order to optimize the content design of video ads. Software like HireVue, BarRaiser, a hiring intelligence firm, helps firms make recruitment decisions by analyzing audio and video information from candidates' video interviews. Lapetus Solutions develops a model to estimate an individual's longevity, health status, and disease susceptibility from a face photo. Their technology has been applied in the insurance industry. == Artificial empathy and human services == Although artificial intelligence cannot yet replace social workers themselves, the technology has been deployed in that field. Florida State University published a study about Artificial Intelligence being used in the human services field. The research used computer algorithms to analyze health records for combinations of risk factors that could predict a future suicide attempt. The article reports, "machine learning—a future frontier for artificial intelligence—can predict with 80% to 90% accuracy whether someone will attempt suicide as far off as two years into the future. The algorithms become even more accurate as a person's suicide attempt gets closer. For example, the accuracy climbs to 92% one week before a suicide attempt when artificial intelligence focuses on general hospital patients". Such algorithmic machines can help social workers. Social work operates on a cycle of engagement, assessment, intervention, and evaluation with clients. Earlier assessment for risk of suicide can lead to earlier interventions and prevention, therefore saving lives. The system would learn, analyze, and detect risk factors, alerting the clinician of a patient's suicide risk score (analogous to a patient's cardiovascular risk score). Then, social workers could step in for further assessment and preventive intervention.

    Read more →
  • Conduit (company)

    Conduit (company)

    Conduit Ltd. is an international software company. From its founding in 2005 to 2013, its most well-known product was the Conduit toolbar, which was widely-described as malware. In 2013, it spun off its toolbar business; today, its main product is a mobile development platform that allows users to create native and web mobile applications for smartphones. == Products == From 2005 to 2013, the company's most well-known product was the Conduit toolbar, which is flagged by most antivirus software as potentially unwanted and adware. Conduit's toolbar software is often downloaded by malware packages from other publishers. The company spun off the toolbar division that manages the Conduit toolbar in 2013. Today, the company's main product is a mobile development platform that allows users to create native and web mobile applications for smartphones. App creation for its App Gallery is free, but it charges a monthly subscription fee to place apps on the App Store or Google Play. == History == Conduit was founded in 2005 by Shilo, Dror Erez, and Gaby Bilcyzk. Between years 2005 and 2013, it ran a successful but controversial toolbar platform business. Conduit was part of the so-called Download Valley companies monetizing free software and downloads by bundling adware. The toolbars were criticized by some as being very difficult to uninstall. The toolbar software was referred to as a "potentially unwanted program" by some in the computer industry because it could be used to change browser settings. The company had more than 400 employees in 2013. In September same year, Conduit spun off its entire website toolbar business division, which combined with Perion Network. After the deal, Conduit shareholders owned 81% of Perion's existing shares and both Perion and Conduit remained independent companies. The substantial size of the Conduit user base allowed Perion to immediately surpass AOL in U.S. searches. In 2015, Conduit announced it would purchase Keeprz, a mobile customer loyalty platform, for $45 million.

    Read more →
  • BuildingSMART Data Dictionary

    BuildingSMART Data Dictionary

    buildingSMART Data Dictionary (bSDD) is a service provided by buildingSMART which offers free data dictionaries for the international standardization of construction planning. The structure of bSDD was defined by the Nonprofit organization Buildingsmart and is used to describe objects and their attributes in a BIM process. == Aim == The aim of bSDD is to enable architects and planners to exchange and share building data across different specialists and language boundaries and thus avoid misunderstandings caused by different interpretations of terms. The bSDD standard extends the more general IFC. Software developers can access and use the dictionaries. In May 2025 over 300 dictionaries are available, including IFC, extensions to it such as Airport Domain IFC extension module or classification systems like Uniclass. == Structure == The main structural parts of bSDD are: Dictionary: A dictionary is a collection of classes: Class: A class describes the various object types, such as Bag drop or Baggage conveyor in airport planning. A class contains properties: Property: A property describes a part of a class, e.g. color or weight. Related properties are organized in a group: GroupOfProperties: A group organizes related properties, e.g. environmental properties or electrical properties. == Creating and managing a directory == Every dictionary in bSDD must be published in the name of a registered organization. As soon as the content is activated, it receives an unchangeable URI. This means that the content remains permanently in bSDD and cannot be deleted - this ensures stable use of the dictionary. It is only possible to change the status to inactive if it is no longer to be used - however, the dictionary remains permanently.

    Read more →
  • Systematic review

    Systematic review

    A systematic review is a scholarly synthesis of the evidence on a clearly presented topic using critical methods to identify, define and assess research on the topic. A systematic review extracts and interprets data from published studies on the topic (in the scientific literature), then analyzes, describes, critically appraises and summarizes interpretations into a refined evidence-based conclusion. For example, a systematic review of randomized controlled trials is a way of summarizing and implementing evidence-based medicine. Systematic reviews, sometimes along with meta-analyses, are generally considered the highest level of evidence in medical research. While a systematic review may be applied in the biomedical or health care context, it may also be used where an assessment of a precisely defined subject can advance understanding in a field of research. A systematic review may examine clinical tests, public health interventions, environmental interventions, social interventions, adverse effects, qualitative evidence syntheses, methodological reviews, policy reviews, and economic evaluations. Systematic reviews are closely related to meta-analyses, and often the same instance will combine both (being published with a subtitle of "a systematic review and meta-analysis"). The distinction between the two is that a meta-analysis uses statistical methods to induce a single number from the pooled data set (such as an effect size), whereas the strict definition of a systematic review excludes that step. However, in practice, when one is mentioned, the other may often be involved, as it takes a systematic review to assemble the information that a meta-analysis analyzes, and people sometimes refer to an instance as a systematic review, even if it includes the meta-analytical component. An understanding of systematic reviews and how to implement them in practice is common for professionals in health care, public health, and public policy. Systematic reviews contrast with a type of review often called a narrative review. Systematic reviews and narrative reviews both review the literature (the scientific literature), but the term literature review without further specification refers to a narrative review. == Characteristics == A systematic review can be designed to provide a thorough summary of current literature relevant to a research question. A systematic review uses a rigorous and transparent approach for research synthesis, with the aim of assessing and, where possible, minimizing bias in the findings. While many systematic reviews are based on an explicit quantitative meta-analysis of available data, there are also qualitative reviews and other types of mixed-methods reviews that adhere to standards for gathering, analyzing, and reporting evidence. Systematic reviews of quantitative data or mixed-method reviews sometimes use statistical techniques (meta-analysis) to combine results of eligible studies. Scoring levels are sometimes used to rate the quality of the evidence depending on the methodology used, although this is discouraged by the Cochrane Library. As evidence rating can be subjective, multiple people may be consulted to resolve any scoring differences between how evidence is rated. The EPPI-Centre, Cochrane, and the Joanna Briggs Institute have been influential in developing methods for combining both qualitative and quantitative research in systematic reviews. Several reporting guidelines exist to standardise reporting about how systematic reviews are conducted. Such reporting guidelines are not quality assessment or appraisal tools. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement suggests a standardized way to ensure a transparent and complete reporting of systematic reviews, and is now required for this kind of research by more than 170 medical journals worldwide. The latest version of this commonly used statement corresponds to PRISMA 2020 (the respective article was published in 2021). Several specialized PRISMA guideline extensions have been developed to support particular types of studies or aspects of the review process, including PRISMA-P for review protocols and PRISMA-ScR for scoping reviews. A list of PRISMA guideline extensions is hosted by the EQUATOR (Enhancing the QUAlity and Transparency Of health Research) Network. However, the PRISMA guidelines have been found to be limited to intervention research and the guidelines have to be changed in order to fit non-intervention research. As a result, Non-Interventional, Reproducible, and Open (NIRO) Systematic Reviews was created to counter this limitation. For qualitative reviews, reporting guidelines include ENTREQ (Enhancing transparency in reporting the synthesis of qualitative research) for qualitative evidence syntheses; RAMESES (Realist And MEta-narrative Evidence Syntheses: Evolving Standards) for meta-narrative and realist reviews; and eMERGe (Improving reporting of Meta-Ethnography) for meta-ethnograph. Developments in systematic reviews during the 21st century included realist reviews and the meta-narrative approach, both of which addressed problems of variation in methods and heterogeneity existing on some subjects. == Types == There are over 30 types of systematic review and Table 1 below non-exhaustingly summarises some of these. There is not always consensus on the boundaries and distinctions between the approaches described below. === Scoping reviews === Scoping reviews are distinct from systematic reviews in several ways. A scoping review is an attempt to search for concepts by mapping the language and data which surrounds those concepts and adjusting the search method iteratively to synthesize evidence and assess the scope of an area of inquiry. This can mean that the concept search and method (including data extraction, organisation and analysis) are refined throughout the process, sometimes requiring deviations from any protocol or original research plan. A scoping review may often be a preliminary stage before a systematic review, which 'scopes' out an area of inquiry and maps the language and key concepts to determine if a systematic review is possible or appropriate, or to lay the groundwork for a full systematic review. The goal can be to assess how much data or evidence is available regarding a certain area of interest. This process is further complicated if it is mapping concepts across multiple languages or cultures. As a scoping review should be systematically conducted and reported (with a transparent and repeatable method), some academic publishers categorize them as a kind of 'systematic review', which may cause confusion. Scoping reviews are helpful when it is not possible to carry out a systematic synthesis of research findings, for example, when there are no published clinical trials in the area of inquiry. Scoping reviews are helpful when determining if it is possible or appropriate to carry out a systematic review, and are a useful method when an area of inquiry is very broad, for example, exploring how the public are involved in all stages systematic reviews. There is still a lack of clarity when defining the exact method of a scoping review as it is both an iterative process and is still relatively new. There have been several attempts to improve the standardisation of the method, for example via a Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guideline extension for scoping reviews (PRISMA-ScR). PROSPERO (the International Prospective Register of Systematic Reviews) does not permit the submission of protocols of scoping reviews, although some journals will publish protocols for scoping reviews. == Stages == While there are multiple kinds of systematic review methods, the main stages of a review can be summarised as follows: === Defining the research question === Some reported that the 'best practices' involve 'defining an answerable question' and publishing the protocol of the review before initiating it to reduce the risk of unplanned research duplication and to enable transparency and consistency between methodology and protocol. Clinical reviews of quantitative data are often structured using the mnemonic PICO, which stands for 'Population or Problem', 'Intervention or Exposure', 'Comparison', and 'Outcome', with other variations existing for other kinds of research. For qualitative reviews, PICo is 'Population or Problem', 'Interest', and 'Context'. === Searching for sources === Relevant criteria can include selecting research that is of good quality and answers the defined question. The search strategy should be designed to retrieve literature that matches the protocol's specified inclusion and exclusion criteria. The methodology section of a systematic review should list all of the databases and citation indices that were searched. The titles and abstracts of identified articles can be checked against predetermined criteria for eligibility and r

    Read more →
  • DIKW pyramid

    DIKW pyramid

    The DIKW pyramid (also known as the knowledge pyramid or information hierarchy) is a model describing relationships between data, information, knowledge and wisdom sometimes also stylized as a chain, refer to models of possible structural and functional relationships between a set of components—often four, data, information, knowledge, and wisdom. The concept has roots predating the 1980s. In the latter years of that decade, interest in the models grew after explicit presentations and discussions, including from Milan Zeleny, Russell Ackoff, and Robert W. Lucky. Subsequent important discussions extended along theoretical and practical lines into the coming decades. While debate continues as to actual meaning of the component terms of DIKW-type models, and the actual nature of their relationships—including occasional doubt being cast over any simple, linear, unidirectional model—even so they have become very popular visual representations in use by business, the military, and others. Among the academic and popular, not all versions of the DIKW-type models include all four components (earlier ones excluding data, later ones excluding or downplaying wisdom, and several including additional components (for instance Ackoff inserting "understanding" before and Zeleny adding "enlightenment" after the wisdom component). In addition, DIKW-type models are no longer always presented as pyramids, instead also as a chart or framework (e.g., by Zeleny), as flow diagrams (e.g., by Liew, and by Chisholm et al.), and sometimes as a continuum (e.g., by Choo et al.). == Short description == As Rowley noted in 2007, the DIKW model "is often quoted, or used implicitly, in definitions of data, information and knowledge in the information management, information systems and knowledge management literatures, but [as of that date] there ha[d] been limited direct discussion of the hierarchy". Reviews of textbooks and a survey of scholars in relevant fields indicate that there was not a consensus as to definitions used in the model as of that date, and as reviewed by Liew in that year, even less "in the description of the processes that transform components lower in the hierarchy into those above them". Zins work, published in 2007—from studies in 2003-2005 that documented "130 definitions of data, information, and knowledge formulated by 45 scholars", published in 2007—to suggest that the data–information–knowledge components of DIKW refer to a class of no less than five models, as a function of whether data, information, and knowledge are each conceived of as subjective, objective (what Zins terms, "universal" or "collective") or both. In Zins' usage, subjective and objective "are not related to arbitrariness and truthfulness, which are usually attached to the concepts of subjective knowledge and objective knowledge". Information science, Zins argues, studies data and information, but not knowledge, as knowledge is an internal (subjective) rather than an external (universal–collective) phenomenon. == Representations == === Graphical representation === DIKW is a hierarchical model often depicted as a pyramid, sometimes as a chain, with data at its base and wisdom at its apex (or chain-beginning and -end). Both Zeleny and Ackoff have been credited with originating the pyramid representation, although neither used a pyramid to present their ideas. According to Wallace, Debons and colleagues may have been the first to "present the hierarchy graphically". Many variations of the DIKW-type pyramid have been produced. One, in use by knowledge managers in the United States Department of Defense, attempts to show the DIKW progression to enable effective decisions and consequent activities supporting shared understanding throughout defense organizations, as well as supporting management of risks associated with decisions. DIKW-type hierarchical information paradigms have also been represented as two-dimensional charts, and as flow diagrams, where relationships between the components may be presented less hierarchically, with defining aspects of the relationships, feedback loops, etc. === Computational representation === Intelligent decision support systems are trying to improve decision making by introducing new technologies and methods from the domain of modeling and simulation in general, and in particular from the domain of intelligent software agents in the contexts of agent-based modeling. The following example describes a military decision support system, but the architecture and underlying conceptual idea are transferable to other application domains: The value chain starts with data quality describing the information within the underlying command and control systems. Information quality tracks the completeness, correctness, currency, consistency and precision of the data items and information statements available. Knowledge quality deals with procedural knowledge and information embedded in the command and control system such as templates for adversary forces, assumptions about entities such as ranges and weapons, and doctrinal assumptions, often coded as rules. Awareness quality measures the degree of using the information and knowledge embedded within the command and control system. Awareness is explicitly placed in the cognitive domain. By the introduction of a common operational picture, data are put into context, which leads to information instead of data. The next step, which is enabled by service-oriented web-based infrastructures (but not yet operationally used), is the use of models and simulations for decision support. Simulation systems are the prototype for procedural knowledge, which is the basis for knowledge quality. Finally, using intelligent software agents to continually observe the battle sphere, apply models and simulations to analyze what is going on, to monitor the execution of a plan, and to do all the tasks necessary to make the decision maker aware of what is going on, command and control systems could even support situational awareness, the level in the value chain traditionally limited to pure cognitive methods. == History == Danny P. Wallace, a professor of library and information science, explained that the origin of the DIKW pyramid is uncertain: The presentation of the relationships among data, information, knowledge, and sometimes wisdom in a hierarchical arrangement has been part of the language of information science for many years. Although it is uncertain when and by whom those relationships were first presented, the ubiquity of the notion of a hierarchy is embedded in the use of the acronym DIKW as a shorthand representation for the data-to-information-to-knowledge-to-wisdom transformation.Many authors think that the idea of the DIKW relationship originated from two lines in the poem "Choruses", by T. S. Eliot, that appeared in the pageant play The Rock, in 1934: === Knowledge, intelligence, and wisdom === In 1927, Clarence W. Barron addressed his employees at Dow Jones & Company on the hierarchy: "Knowledge, Intelligence and Wisdom". === Data, information, knowledge === In 1955, English-American economist and educator Kenneth Boulding presented a variation on the hierarchy consisting of "signals, messages, information, and knowledge". However, "[t]he first author to distinguish among data, information, and knowledge and to also employ the term 'knowledge management' may have been American educator Nicholas L. Henry", in a 1974 journal article. === Data, information, knowledge, wisdom === Other early versions (prior to 1982) of the hierarchy that refer to a data tier include those of Chinese-American geographer Yi-Fu Tuan and sociologist-historian Daniel Bell.. In 1980, Irish-born engineer Mike Cooley invoked the same hierarchy in his critique of automation and computerization, in his book Architect or Bee?: The Human / Technology Relationship. Thereafter, in 1987, Czechoslovakia-born educator Milan Zeleny mapped the components of the hierarchy to knowledge forms: know-nothing, know-what, know-how, and know-why. Zeleny "has frequently been credited with proposing the [representation of DIKW as a pyramid ]... although he actually made no reference to any such graphical model." The hierarchy appears again in a 1988 address to the International Society for General Systems Research, by American organizational theorist Russell Ackoff, published in 1989. Subsequent authors and textbooks cite Ackoff's as the "original articulation" of the hierarchy or otherwise credit Ackoff with its proposal. Ackoff's version of the model includes an understanding tier (as Adler had, before him), interposed between knowledge and wisdom. Although Ackoff did not present the hierarchy graphically, he has also been credited with its representation as a pyramid. In 1989, Bell Labs veteran Robert W. Lucky wrote about the four-tier "information hierarchy" in the form of a pyramid in his book Silicon Dreams. In the same year as Ackoff presented his a

    Read more →
  • Lexical choice

    Lexical choice

    Lexical choice is the subtask of Natural language generation that involves choosing the content words (nouns, non-auxiliary verbs, adjectives, and adverbs) in a generated text. Function words (determiners, for example) are usually chosen during realisation. == Examples == The simplest type of lexical choice involves mapping a domain concept (perhaps represented in an ontology) to a word. For example, the concept Finger might be mapped to the word finger. A more complex situation is when a domain concept is expressed using different words in different situations. For example, the domain concept Value-Change can be expressed in many ways: The temperature rose: the verb rose is used for a Value-Change in temperature which increases the value. The temperature fell: the verb fell is used for a Value-Change in temperature which decreases the value. The rain got heavier: the phrase got heavier is used for a Value-Change in precipitation amount when the precipitation is rain. Sometimes words can communicate additional contextual information, for example: The temperature plummeted: the verb plummeted is used for a Value-Change in temperature which decreases the value, when the change is rapid and large. Contextual information is especially significant for vague terms such as tall. For example, a 2m tall man is tall, but a 2m tall horse is small. == Linguistic perspective == Lexical choice modules must be informed by linguistic knowledge of how the system's input data maps onto words. This is a question of semantics, but it is also influenced by syntactic factors (such as collocation effects) and pragmatic factors (such as context). Hence NLG systems need linguistic models of how meaning is mapped to words in the target domain (genre) of the NLG system. Genre tends to be very important; for example the verb veer has a very specific meaning in weather forecasts (wind direction is changing in a clockwise direction) which it does not have in general English, and a weather-forecast generator must be aware of this genre-specific meaning. In some cases there are major differences in how different people use the same word; for example, some people use by evening to mean 6PM and others use it to mean midnight. Psycholinguists have shown that when people speak to each other, they agree on a common interpretation via lexical alignment; this is not something which NLG systems can yet do. Ultimately, lexical choice must deal with the fundamental issue of how language relates to the non-linguistic world. For example, a system which chose colour terms such as red to describe objects in a digital image would need to know which RGB pixel values could generally be described as red; how this was influenced by visual (lighting, other objects in the scene) and linguistic (other objects being discussed) context; what pragmatic connotations were associated with red (for example, when an apple is called red, it is assumed to be ripe as well as have the colour red); and so forth. == Algorithms and models == A number of algorithms and models have been developed for lexical choice in the research community, for example Edmonds developed a model for choosing between near-synonyms (words with similar core meanings but different connotations). However such algorithms and models have not been widely used in applied NLG systems; such systems have instead often used quite simple computational models, and invested development effort in linguistic analysis instead of algorithm development.

    Read more →
  • Manhattan address algorithm

    Manhattan address algorithm

    The Manhattan address algorithm is a series of formulas used to estimate the closest east–west cross street for building numbers on north–south avenues in the New York City borough of Manhattan. == Algorithm == To find the approximate number of the closest cross street, divide the building number by a divisor (generally 20) and add (or subtract) the "tricky number" from the table below: For the north–south avenues, there are typically 20 address numbers between consecutive east–west streets (10 on either side of the avenue). A standard land lot on each avenue was originally 20 feet (6.1 m) wide, and there is about 200 feet (61 m) between each pair of east–west streets, for ten land lots between each pair of streets. The exceptions are Riverside Drive, as well as Fifth Avenue and Central Park West between 59th and 110th streets, which use a divisor of 10. These avenues all have buildings only on one side of the street, with a park on the other side. The "tricky number" often corresponds to a street near the southern end of the avenue. There are some notable exceptions: York Avenue address numbers are continuations of Avenue A address numbers, since the avenue was originally called Avenue A. East End Avenue address numbers are continuations of Avenue B address numbers, since the avenue was originally called Avenue B. Sixth Avenue and Broadway start south of Houston Street, the southern boundary of the Manhattan street numbering system. Although Park Avenue's southern terminus is at 32nd Street, a homeowner at 34th Street wanted the address "1 Park Avenue" (this was later changed). === Examples === For example, if you are at 62 Avenue B, 62 ÷ 20 ≈ 3 {\displaystyle 62\div 20\approx 3} , then add the "tricky number" 3 {\displaystyle 3} to give 6 {\displaystyle 6} . The nearest cross street to 62 Avenue B is East 6th Street. If you are at 78 Riverside Drive, 78 ÷ 10 ≈ 8 {\displaystyle 78\div 10\approx 8} , then add the "tricky number" 72 {\displaystyle 72} to give 80 {\displaystyle 80} . The nearest cross street to 78 Riverside Drive is West 80th Street. If you are at 501 5th Avenue, 501 ÷ 20 ≈ 25 {\displaystyle 501\div 20\approx 25} , then add the "tricky number" 18 {\displaystyle 18} to give 43 {\displaystyle 43} . The nearest cross street to 501 5th Avenue is actually 42nd Street, not 43rd Street, as the Manhattan address algorithm only gives approximate answers.

    Read more →
  • Time Warp Edit Distance

    Time Warp Edit Distance

    In the data analysis of time series, Time Warp Edit Distance (TWED) is a measure of similarity (or dissimilarity) between pairs of discrete time series, controlling the relative distortion of the time units of the two series using the physical notion of elasticity. In comparison to other distance measures, (e.g. DTW (dynamic time warping) or LCS (longest common subsequence problem)), TWED is a metric. Its computational time complexity is O ( n 2 ) {\displaystyle O(n^{2})} , but can be drastically reduced in some specific situations by using a corridor to reduce the search space. Its memory space complexity can be reduced to O ( n ) {\displaystyle O(n)} . It was first proposed in 2009 by P.-F. Marteau. == Definition == δ λ , ν ( A 1 p , B 1 q ) = M i n { δ λ , ν ( A 1 p − 1 , B 1 q ) + Γ ( a p ′ → Λ ) d e l e t e i n A δ λ , ν ( A 1 p − 1 , B 1 q − 1 ) + Γ ( a p ′ → b q ′ ) m a t c h o r s u b s t i t u t i o n δ λ , ν ( A 1 p , B 1 q − 1 ) + Γ ( Λ → b q ′ ) d e l e t e i n B {\displaystyle \delta _{\lambda ,\nu }(A_{1}^{p},B_{1}^{q})=Min{\begin{cases}\delta _{\lambda ,\nu }(A_{1}^{p-1},B_{1}^{q})+\Gamma (a_{p}^{'}\to \Lambda )&{\rm {delete\ in\ A}}\\\delta _{\lambda ,\nu }(A_{1}^{p-1},B_{1}^{q-1})+\Gamma (a_{p}^{'}\to b_{q}^{'})&{\rm {match\ or\ substitution}}\\\delta _{\lambda ,\nu }(A_{1}^{p},B_{1}^{q-1})+\Gamma (\Lambda \to b_{q}^{'})&{\rm {delete\ in\ B}}\end{cases}}} whereas Γ ( α p ′ → Λ ) = d L P ( a p ′ , a p − 1 ′ ) + ν ⋅ ( t a p − t a p − 1 ) + λ {\displaystyle \Gamma (\alpha _{p}^{'}\to \Lambda )=d_{LP}(a_{p}^{'},a_{p-1}^{'})+\nu \cdot (t_{a_{p}}-t_{a_{p-1}})+\lambda } Γ ( α p ′ → b q ′ ) = d L P ( a p ′ , b q ′ ) + d L P ( a p − 1 ′ , b q − 1 ′ ) + ν ⋅ ( | t a p − t b q | + | t a p − 1 − t b q − 1 | ) {\displaystyle \Gamma (\alpha _{p}^{'}\to b_{q}^{'})=d_{LP}(a_{p}^{'},b_{q}^{'})+d_{LP}(a_{p-1}^{'},b_{q-1}^{'})+\nu \cdot (|t_{a_{p}}-t_{b_{q}}|+|t_{a_{p-1}}-t_{b_{q-1}}|)} Γ ( Λ → b q ′ ) = d L P ( b p ′ , b p − 1 ′ ) + ν ⋅ ( t b q − t b q − 1 ) + λ {\displaystyle \Gamma (\Lambda \to b_{q}^{'})=d_{LP}(b_{p}^{'},b_{p-1}^{'})+\nu \cdot (t_{b_{q}}-t_{b_{q-1}})+\lambda } Whereas the recursion δ λ , ν {\displaystyle \delta _{\lambda ,\nu }} is initialized as: δ λ , ν ( A 1 0 , B 1 0 ) = 0 , {\displaystyle \delta _{\lambda ,\nu }(A_{1}^{0},B_{1}^{0})=0,} δ λ , ν ( A 1 0 , B 1 j ) = ∞ f o r j ≥ 1 {\displaystyle \delta _{\lambda ,\nu }(A_{1}^{0},B_{1}^{j})=\infty \ {\rm {{for\ }j\geq 1}}} δ λ , ν ( A 1 i , B 1 0 ) = ∞ f o r i ≥ 1 {\displaystyle \delta _{\lambda ,\nu }(A_{1}^{i},B_{1}^{0})=\infty \ {\rm {{for\ }i\geq 1}}} with a 0 ′ = b 0 ′ = 0 {\displaystyle a'_{0}=b'_{0}=0} === Implementations === An implementation of the TWED algorithm in C with a Python wrapper is available at TWED is also implemented into the Time Series Subsequence Search Python package (TSSEARCH for short) available at [1]. An R implementation of TWED has been integrated into the TraMineR, a R package for mining, describing and visualizing sequences of states or events, and more generally discrete sequence data. Additionally, cuTWED is a CUDA- accelerated implementation of TWED which uses an improved algorithm due to G. Wright (2020). This method is linear in memory and massively parallelized. cuTWED is written in CUDA C/C++, comes with Python bindings, and also includes Python bindings for Marteau's reference C implementation. ==== Python ==== Backtracking, to find the most cost-efficient path: ==== MATLAB ==== Backtracking, to find the most cost-efficient path:

    Read more →
  • Voiceverse NFT plagiarism scandal

    Voiceverse NFT plagiarism scandal

    In January 2022, 15—the pseudonymous Massachusetts Institute of Technology (MIT) artificial intelligence researcher and creator of the non-commercial generative artificial intelligence voice synthesis research project 15.ai—discovered that the blockchain-based technology company Voiceverse had plagiarized from their platform. Voiceverse marketed itself as a service that offered AI voice cloning technology that could be purchased and traded as non-fungible tokens (NFTs). Amid heightened controversy over NFTs in the gaming industry, voice actor Troy Baker (who has been described as one of the most famous voice actors in video games) announced his partnership with Voiceverse on January 14, 2022, triggering immediate backlash over concerns about the environmental impact of NFTs, potential for fraud, predatory monetization in video games, and the potential of AI displacing jobs for human voice actors. Later that same day, 15 revealed through server logs that Voiceverse had generated voice lines using 15's free text-to-speech platform, pitch-shifted the audio to make them unrecognizable, and falsely marketed the samples as their own technology before selling them as NFTs. Within an hour of being confronted with evidence, Voiceverse confessed and stated that their marketing team had used 15.ai without proper attribution while rushing to create a technology demo to coincide with Baker's partnership announcement, further exacerbating the already negative reception to the original announcement. In response, 15 replied "Go fuck yourself"; the interaction went viral and garnered a large amount of support for the developer. News publications universally characterized this incident as Voiceverse having "stolen" from 15.ai. The next day, Baker appeared on a podcast and stated that his motivation had been to help independent creators who were unable to afford professional voice actors. Following continued backlash and the plagiarism revelation, Baker ended his partnership with Voiceverse on January 31, 2022. Subsequently, the incident was documented in multiple AI ethics databases, criticisms of predatory monetization in video games, and retrospectives as one of the earliest instances of plagiarism and theft stemming from artificial intelligence during the AI boom. == Background == === Troy Baker === Troy Baker is a prominent voice actor in the video game industry best known for his performances as Joel Miller in The Last of Us franchise. Baker has been described as "ubiquitous" by Polygon, "one of the most high-profile and prolific voice actors in video games" by Eurogamer, and "arguably the most famous voice actor in the gaming industry" by GameGuru. His other prominent roles include voicing Agent John "Jonesy" Jones in Fortnite, Booker DeWitt in BioShock Infinite, and both Batman and Joker in multiple Batman video games. As of October 2025, Baker holds the record for the most acting nominations at the BAFTA Games Awards, with five between 2013 and 2021. === Voiceverse === Voiceverse is a blockchain-based startup founded by the Bored Ape Yacht Club that marketed itself as offering AI voice cloning technology in the form of NFTs. Prior to the announcement of their partnership with Baker, Voiceverse had partnered with LOVO, Inc., an AI voice platform that, according to LOVO, could generate human-like voices. Voiceverse stated that any user who purchases a voice NFT would have unlimited and perpetual access to the voice model, which could be used to create content such as audiobooks, YouTube videos, podcasts, e-learning materials, in-game voice chat, and Zoom calls. Voiceverse promised that buyers would "OWN [sic] all of the IP" of content they created using these voices. Voiceverse's roadmap included plans to release 8,888 initial voice NFTs, a feature to add emotions to existing voices, and the ability for users to mint their own voices as NFTs. Prior to Baker's partnership, Voiceverse had also partnered with voice actors Charlet Chung, who voices D.Va in Overwatch, and Andy Milonakis of The Andy Milonakis Show. === 15.ai === 15.ai is a free web application launched in 2020 that uses artificial intelligence to generate text-to-speech voices of fictional characters from popular media. Created by a pseudonymous artificial intelligence researcher known as 15, who began developing the technology as a freshman during their undergraduate research at MIT, it was an early example of an application of generative artificial intelligence during the initial stages of the AI boom. The platform showed that deep neural networks could generate emotionally expressive speech with only 15 seconds of speech; the name "15.ai" references the creator's statement that a voice can be convincingly cloned with just 15 seconds of audio, as opposed to the tens of hours of data previously required. 15.ai became an Internet phenomenon in early 2021 when content utilizing it went viral on social media and quickly gained widespread use among various Internet fandoms. 15 has emphasized that it remain free and non-commercial; it only requires users to give proper credit when using the service for content creation. === NFTs in the video game industry === By early 2022, NFTs had become highly controversial within the gaming industry. Critics raised concerns about their environmental impact due to the significant energy consumption of blockchain technology. In addition, the prevalence of scams, fraud, and potential money laundering associated with NFT sales, as well as fears that NFTs were a new form of predatory monetization following the increasing frequency of loot boxes, caused vocal pushback from the gaming community. Several major gaming companies had begun exploring NFT integration into their products, though fan backlash had already forced some projects to be cancelled. On December 16, 2021, the developers of S.T.A.L.K.E.R. 2: Heart of Chernobyl announced that they would be including NFTs in the game, but cancelled within an hour of the announcement due to immediate universal backlash. Simultaneously, the rise of AI voice technology raised concerns among voice actors about potential job displacement and the devaluation of their work amidst the voice acting industry's ongoing struggles for better compensation and working conditions. == Partnership announcement and backlash == On January 14, 2022, 1:02 a.m. EST, Baker announced on Twitter that he was partnering with Voiceverse "to explore ways where together we might bring new tools to new creators to make new things, and allow everyone a chance to own & invest in the IP's they create." The announcement concluded with the statement "You can hate. Or you can create." Baker's specific role with Voiceverse remained unclear at the time of the announcement. Along with Baker's announcement, Voiceverse promoted their supposed voice AI technology on Twitter by posting animated videos that featured a cat character created by NFT firm Chubbiverse. The videos concluded with text that read "The Voice Powered By Voiceverse"; Voiceverse stated on Twitter that the voices in the animations had been generated using their own AI voice synthesis technology and presented the videos as a technology demonstration of their voice NFT capabilities. The announcement provoked immediate and widespread backlash from the gaming community. Baker's tweet received thousands of replies and quote retweets (the vast majority of which were negative), far more than the number of likes; Michael McWhertor of Polygon described it as a "textbook example of being ratioed" and commented that reactions had been amplified by the final part of Baker's announcement. Michael Beckwith of Metro called Baker's approach "bizarrely aggressive". Later that day, Baker responded to the backlash by apologizing for his choice of words. He said he appreciated people's thoughts and acknowledged that the "hate/create part might have been a bit antagonistic," calling it a "bad attempt to bring levity". Despite the apology, Baker and his fellow voice actors did not distance themselves from Voiceverse at this point. At the same time, Voiceverse attempted to address the criticisms, stating that they were working to move to more environmentally friendly blockchain technology and that voice actors would receive royalties from NFT sales, with actors benefiting from any increase in NFT value. == Plagiarism revelation == On December 13, 2021, amidst the increasingly negative reactions toward NFTs among the general public, the creator of 15.ai (known pseudonymously as 15) announced that they had "no interest in incorporating NFTs into any aspect of [their] work." On January 14, 2022, 11:17 a.m. EST (10 hours after Baker's initial announcement), 15 commented on the Voiceverse venture, stating that it "sounds like a scam". Two hours later, at 1:20 p.m., 15 explicitly accused Voiceverse of "actively attempting to appropriate [15's] work for [Voiceverse's] own benefit." 15 provided evidence through

    Read more →
  • Empowerment (artificial intelligence)

    Empowerment (artificial intelligence)

    Empowerment in the field of artificial intelligence formalises and quantifies (via information theory) the potential an agent perceives that it has to influence its environment. An agent which follows an empowerment maximising policy, acts to maximise future options (typically up to some limited horizon). Empowerment can be used as a (pseudo) utility function that depends only on information gathered from the local environment to guide action, rather than seeking an externally imposed goal, thus is a form of intrinsic motivation. The empowerment formalism depends on a probabilistic model commonly used in artificial intelligence. An autonomous agent operates in the world by taking in sensory information and acting to change its state, or that of the environment, in a cycle of perceiving and acting known as the perception-action loop. Agent state and actions are modelled by random variables ( S : s ∈ S , A : a ∈ A {\displaystyle S:s\in {\mathcal {S}},A:a\in {\mathcal {A}}} ) and time ( t {\displaystyle t} ). The choice of action depends on the current state, and the future state depends on the choice of action, thus the perception-action loop unrolled in time forms a causal bayesian network. == Definition == Empowerment ( E {\displaystyle {\mathfrak {E}}} ) is defined as the channel capacity ( C {\displaystyle C} ) of the actuation channel of the agent, and is formalised as the maximal possible information flow between the actions of the agent and the effect of those actions some time later. Empowerment can be thought of as the future potential of the agent to affect its environment, as measured by its sensors. E := C ( A t ⟶ S t + 1 ) ≡ max p ( a t ) I ( A t ; S t + 1 ) {\displaystyle {\mathfrak {E}}:=C(A_{t}\longrightarrow S_{t+1})\equiv \max _{p(a_{t})}I(A_{t};S_{t+1})} In a discrete time model, Empowerment can be computed for a given number of cycles into the future, which is referred to in the literature as 'n-step' empowerment. E ( A t n ⟶ S t + n ) = max p ( a t , . . . , a t + n − 1 ) I ( A t , . . . , A t + n − 1 ; S t + n ) {\displaystyle {\mathfrak {E}}(A_{t}^{n}\longrightarrow S_{t+n})=\max _{p(a_{t},...,a_{t+n-1})}I(A_{t},...,A_{t+n-1};S_{t+n})} The unit of empowerment depends on the logarithm base. Base 2 is commonly used in which case the unit is bits. === Contextual Empowerment === In general the choice of action (action distribution) that maximises empowerment varies from state to state. Knowing the empowerment of an agent in a specific state is useful, for example to construct an empowerment maximising policy. State-specific empowerment can be found using the more general formalism for 'contextual empowerment'. C {\displaystyle C} is a random variable describing the context (e.g. state). E ( A t n ⟶ S t + n ∣ C ) = ∑ c ∈ C p ( c ) E ( A t n ⟶ S t + n ∣ C = c ) {\displaystyle {\mathfrak {E}}(A_{t}^{n}\longrightarrow S_{t+n}{\mid }C)=\sum _{c{\in }C}p(c){\mathfrak {E}}(A_{t}^{n}\longrightarrow S_{t+n}{\mid }C=c)} == Application == Empowerment maximisation can be used as a pseudo-utility function to enable agents to exhibit intelligent behaviour without requiring the definition of external goals, for example balancing a pole in a cart-pole balancing scenario where no indication of the task is provided to the agent. Empowerment has been applied in studies of collective behaviour and in continuous domains. As is the case with Bayesian methods in general, computation of empowerment becomes computationally expensive as the number of actions and time horizon extends, but approaches to improve efficiency have led to usage in real-time control. Empowerment has been used for intrinsically motivated reinforcement learning agents playing video games, and in the control of underwater vehicles.

    Read more →
  • Operational data store

    Operational data store

    An operational data store (ODS) is used for operational reporting and as a source of data for the enterprise data warehouse (EDW). It is a complementary element to an EDW in a decision support environment, and is used for operational reporting, controls, and decision making, as opposed to the EDW, which is used for tactical and strategic decision support. An ODS is a database designed to integrate data from multiple sources for additional operations on the data, for reporting, controls and operational decision support. Unlike a production master data store, the data is not passed back to operational systems. It may be passed for further operations and to the data warehouse for reporting. An ODS should not be confused with an enterprise data hub (EDH). An operational data store will take transactional data from one or more production systems and loosely integrate it, in some respects it is still subject oriented, integrated and time variant, but without the volatility constraints. This integration is mainly achieved through the use of EDW structures and content. An ODS is not an intrinsic part of an EDH solution, although an EDH may be used to subsume some of the processing performed by an ODS and the EDW. An EDH is a broker of data. An ODS is certainly not. Because the data originates from multiple sources, the integration often involves cleaning, resolving redundancy and checking against business rules for integrity. An ODS is usually designed to contain low-level or atomic (indivisible) data (such as transactions and prices) with limited history that is captured "real time" or "near real time" as opposed to the much greater volumes of data stored in the data warehouse generally on a less-frequent basis. == General use == The general purpose of an ODS is to integrate data from disparate source systems in a single structure, using data integration technologies like data virtualization, data federation, or extract, transform, and load (ETL). This will allow operational access to the data for operational reporting, master data or reference data management. An ODS is not a replacement or substitute for a data warehouse or for a data hub but in turn could become a source.

    Read more →
  • Flajolet–Martin algorithm

    Flajolet–Martin algorithm

    The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements in a stream with a single pass and space-consumption logarithmic in the maximal number of possible distinct elements in the stream (the count-distinct problem). The algorithm was introduced by Philippe Flajolet and G. Nigel Martin in their 1984 article "Probabilistic Counting Algorithms for Data Base Applications". Later it has been refined in "LogLog counting of large cardinalities" by Marianne Durand and Philippe Flajolet, and "HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm" by Philippe Flajolet et al. In their 2010 article "An optimal algorithm for the distinct elements problem", Daniel M. Kane, Jelani Nelson and David P. Woodruff give an improved algorithm, which uses nearly optimal space and has optimal O(1) update and reporting times. == The algorithm == Assume that we are given a hash function h a s h ( x ) {\displaystyle \mathrm {hash} (x)} that maps input x {\displaystyle x} to integers in the range [ 0 ; 2 L − 1 ] {\displaystyle [0;2^{L}-1]} , and where the outputs are sufficiently uniformly distributed. Note that the set of integers from 0 to 2 L − 1 {\displaystyle 2^{L}-1} corresponds to the set of binary strings of length L {\displaystyle L} . For any non-negative integer y {\displaystyle y} , define b i t ( y , k ) {\displaystyle \mathrm {bit} (y,k)} to be the k {\displaystyle k} -th bit in the binary representation of y {\displaystyle y} , such that: y = ∑ k ≥ 0 b i t ( y , k ) 2 k . {\displaystyle y=\sum _{k\geq 0}\mathrm {bit} (y,k)2^{k}.} We then define a function ρ ( y ) {\displaystyle \rho (y)} that outputs the position of the least-significant set bit in the binary representation of y {\displaystyle y} , and L {\displaystyle L} if no such set bit can be found as all bits are zero: ρ ( y ) = { min { k ≥ 0 ∣ b i t ( y , k ) ≠ 0 } y > 0 L y = 0 {\displaystyle \rho (y)={\begin{cases}\min\{k\geq 0\mid \mathrm {bit} (y,k)\neq 0\}&y>0\\L&y=0\end{cases}}} Note that with the above definition we are using 0-indexing for the positions, starting from the least significant bit. For example, ρ ( 13 ) = ρ ( 1101 2 ) = 0 {\displaystyle \rho (13)=\rho (1101_{2})=0} , since the least significant bit is a 1 (0th position), and ρ ( 8 ) = ρ ( 1000 2 ) = 3 {\displaystyle \rho (8)=\rho (1000_{2})=3} , since the least significant set bit is at the 3rd position. At this point, note that under the assumption that the output of our hash function is uniformly distributed, then the probability of observing a hash output ending with 2 k {\displaystyle 2^{k}} (a one, followed by k {\displaystyle k} zeroes) is 2 − ( k + 1 ) {\displaystyle 2^{-(k+1)}} , since this corresponds to flipping k {\displaystyle k} heads and then a tail with a fair coin. Now the Flajolet–Martin algorithm for estimating the cardinality of a multiset M {\displaystyle M} is as follows: Initialize a bit-vector BITMAP to be of length L {\displaystyle L} and contain all 0s. For each element x {\displaystyle x} in M {\displaystyle M} : Calculate the index i = ρ ( h a s h ( x ) ) {\displaystyle i=\rho (\mathrm {hash} (x))} . Set B I T M A P [ i ] = 1 {\displaystyle \mathrm {BITMAP} [i]=1} . Let R {\displaystyle R} denote the smallest index i {\displaystyle i} such that B I T M A P [ i ] = 0 {\displaystyle \mathrm {BITMAP} [i]=0} . Estimate the cardinality of M {\displaystyle M} as 2 R / ϕ {\displaystyle 2^{R}/\phi } , where ϕ ≈ 0.77351 {\displaystyle \phi \approx 0.77351} . The idea is that if n {\displaystyle n} is the number of distinct elements in the multiset M {\displaystyle M} , then B I T M A P [ 0 ] {\displaystyle \mathrm {BITMAP} [0]} is accessed approximately n / 2 {\displaystyle n/2} times, B I T M A P [ 1 ] {\displaystyle \mathrm {BITMAP} [1]} is accessed approximately n / 4 {\displaystyle n/4} times and so on. Consequently, if i ≫ log 2 ⁡ n {\displaystyle i\gg \log _{2}n} , then B I T M A P [ i ] {\displaystyle \mathrm {BITMAP} [i]} is almost certainly 0, and if i ≪ log 2 ⁡ n {\displaystyle i\ll \log _{2}n} , then B I T M A P [ i ] {\displaystyle \mathrm {BITMAP} [i]} is almost certainly 1. If i ≈ log 2 ⁡ n {\displaystyle i\approx \log _{2}n} , then B I T M A P [ i ] {\displaystyle \mathrm {BITMAP} [i]} can be expected to be either 1 or 0. The correction factor ϕ ≈ 0.77351 {\displaystyle \phi \approx 0.77351} (OEIS: A244256) is found by calculations, which can be found in the original article. == Improving accuracy == A problem with the Flajolet–Martin algorithm in the above form is that the results vary significantly. A common solution has been to run the algorithm multiple times with k {\displaystyle k} different hash functions and combine the results from the different runs. One idea is to take the mean of the k {\displaystyle k} results together from each hash function, obtaining a single estimate of the cardinality. The problem with this is that averaging is very susceptible to outliers (which are likely here). A different idea is to use the median, which is less prone to be influences by outliers. The problem with this is that the results can only take form 2 R / ϕ {\displaystyle 2^{R}/\phi } , where R {\displaystyle R} is integer. A common solution is to combine both the mean and the median: Create k ⋅ l {\displaystyle k\cdot l} hash functions and split them into k {\displaystyle k} distinct groups (each of size l {\displaystyle l} ). Within each group use the mean for aggregating together the l {\displaystyle l} results, and finally take the median of the k {\displaystyle k} group estimates as the final estimate. The 2007 HyperLogLog algorithm splits the multiset into subsets and estimates their cardinalities, then it uses the harmonic mean to combine them into an estimate for the original cardinality.

    Read more →