AI Content Provenance

AI Content Provenance — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • AZFinText

    AZFinText

    Arizona Financial Text System (AZFinText) is a textual-based quantitative financial prediction system written by Robert P. Schumaker of University of Texas at Tyler and Hsinchun Chen of the University of Arizona. == System == This system differs from other systems in that it uses financial text as one of its key means of predicting stock price movement. This reduces the information lag-time problem evident in many similar systems where new information must be transcribed (e.g., such as losing a costly court battle or having a product recall), before the quant can react appropriately. AZFinText overcomes these limitations by utilizing the terms used in financial news articles to predict future stock prices twenty minutes after the news article has been released. It is believed that certain article terms can move stocks more than others. Terms such as factory exploded or workers strike will have a depressing effect on stock prices whereas terms such as earnings rose will tend to increase stock prices. The AZFinText system analyzes financial news to identify the patterns in how investors react to such specific information. It uses methods like sentiment analysis and term weighting to examine the text of news articles. This system is designed to find price differences that occur when the market responds to news stories. This approach provides an alternative and easier method for predicting stock market movements. == Overview of research == The foundation of AZFinText can be found in the ACM TOIS article. Within this paper, the authors tested several different prediction models and linguistic textual representations. From this work, it was found that using the article terms and the price of the stock at the time the article was released was the most effective model and using proper nouns was the most effective textual representation technique. Combining the two, AZFinText netted a 2.84% trading return over the five-week study period. AZFinText was then extended to study what combination of peer organizations help to best train the system. Using the premise that IBM has more in common with Microsoft than GM, AZFinText studied the effect of varying peer-based training sets. To do this, AZFinText trained on the various levels of GICS and evaluated the results. It was found that sector-based training was most effective, netting an 8.50% trading return, outperforming Jim Cramer, Jim Jubak and DayTraders.com during the study period. AZFinText was also compared against the top 10 quantitative systems and outperformed 6 of them. A third study investigated the role of portfolio building in a textual financial prediction system. From this study, Momentum and Contrarian stock portfolios were created and tested. Using the premise that past winning stocks will continue to win and past losing stocks will continue to lose, AZFinText netted a 20.79% return during the study period. It was also noted that traders were generally overreacting to news events, creating the opportunity of abnormal returns. A fourth study looked into using author sentiment as an added predictive variable. Using the premise that an author can unwittingly influence market trades simply by the terms they use, AZFinText was tested using tone and polarity features. It was found that Contrarian activity was occurring within the market, where articles of a positive tone would decrease in price and articles of a negative tone would increase in price. A further study investigated what article verbs have the most influence on stock price movement. From this work, it was found that planted, announcing, front, smaller and crude had the highest positive impact on stock price. == Notable publicity == AZFinText has been the topic of discussion by numerous media outlets. Some of the more notable ones include The Wall Street Journal, MIT's Technology Review, Dow Jones Newswire, WBIR in Knoxville, TN, Slashdot and other media outlets.

    Read more →
  • Shapiro–Senapathy algorithm

    Shapiro–Senapathy algorithm

    The Shapiro—Senapathy algorithm (S&S) is a computational method for identifying splice sites in eukaryotic genes. The algorithm employs a Position Weight Matrix (PWM) scoring formula to predict donor and acceptor splice sites in any given gene. This methodology has been used to discover splice sites and disease-causing splice site mutations in the human genome, and has become a standard tool in clinical genomics. The S&S algorithm has been cited in thousands of clinical studies, according to Google Scholar. It has also formed the basis of widely used software, including Human Splicing Finder, SROOGLE, and Alamut, which identify splice sites and splice site mutations that cause disease. The algorithm has uncovered splicing mutations in diseases ranging from cancers to inherited disorders, and predicted the deleterious effects of these mutations including exon skipping, intron retention, and cryptic splice site activation. == The algorithm == A splice site defines the boundary between a coding exon and a non-coding intron in eukaryotic genes. The S&S algorithm employs a sliding window, corresponding to the length of the splice site motif, to scan a gene sequence and detect potential splice sites. For each sliding window, the algorithm calculates a score by comparing the nucleotide sequence to a Position Weight Matrix (PWM) derived from known splice sites. This formula generates a percentile score, indicating the likelihood that a given sequence functions as a donor or acceptor splice site. The majority of disease-causing mutations in the human genome are located in splice sites. Clinical genomics studies analyze the splice site scores generated by the S&S algorithm to predict the consequences of splice site mutations including exon skipping and intron retention. The algorithm's sensitivity to single-nucleotide changes allows it to determine mutations that may impact RNA splicing and contribute to disease. In addition to identifying real splice sites, the S&S algorithm has been used to discover cryptic splice sites — alternative splice sites activated by mutations — which may disrupt normal splicing. The algorithm detects mutations that lead to the activation of cryptic splice sites, which may be located proximal to real splice sites or deep within non-coding introns. It has thus been used to determine the causes of numerous diseases that are due to cryptic splicing. == Cancer gene discovery using S&S == The S&S algorithm has been used to identify splice-site mutations in genes associated with several cancers. For example, genes causing commonly occurring cancers including breast cancer, ovarian cancer, colorectal cancer, leukemia, head and neck cancers, prostate cancer, retinoblastoma, squamous cell carcinoma, gastrointestinal cancer, melanoma, liver cancer, Lynch syndrome, skin cancer, and neurofibromatosis have been found. In addition, splicing mutations in genes causing less commonly known cancers including gastric cancer, gangliogliomas, Li-Fraumeni syndrome, Loeys–Dietz syndrome, Osteochondromas (bone tumor), Nevoid basal cell carcinoma syndrome, and Pheochromocytomas have been identified. Specific mutations in different splice sites in various genes causing breast cancer (e.g., BRCA1, PALB2), ovarian cancer (e.g., SLC9A3R1, COL7A1, HSD17B7), colon cancer (e.g., APC, MLH1, DPYD), colorectal cancer (e.g., COL3A1, APC, HLA-A), skin cancer (e.g., COL17A1, XPA, POLH), and Fanconi anemia (e.g., FANC, FANA) have been uncovered. The mutations in the donor and acceptor splice sites in different genes causing a variety of cancers that have been identified by S&S are shown in Table 1. == Discovery of genes causing inherited disorders using S&S == Specific mutations in different splice sites in various genes that cause inherited disorders, including, for example, Type 1 diabetes (e.g., PTPN22, TCF1 (HCF-1A)), hypertension (e.g., LDL, LDLR, LPL), Marfan syndrome (e.g., FBN1, TGFBR2, FBN2), cardiac diseases (e.g., COL1A2, MYBPC3, ACTC1), eye disorders (e.g., EVC, VSX1) have been uncovered. A few example mutations in the donor and acceptor splice sites in different genes causing a variety of inherited disorders identified using S&S are shown in Table 2. == Genes causing immune system disorders == More than 100 immune system disorders affect humans, including inflammatory bowel diseases, multiple sclerosis, systemic lupus erythematosus, bloom syndrome, familial cold autoinflammatory syndrome, and dyskeratosis congenita. The Shapiro–Senapathy algorithm has been used to discover genes and mutations involved in many immune disorder diseases, including Ataxia telangiectasia, B-cell defects, epidermolysis bullosa, and X-linked agammaglobulinemia. Xeroderma pigmentosum, an autosomal recessive disorder is caused by faulty proteins formed due to new preferred splice donor site identified using S&S algorithm and resulted in defective nucleotide excision repair. Type I Bartter syndrome (BS) is caused by mutations in the gene SLC12A1. S&S algorithm helped in disclosing the presence of two novel heterozygous mutations c.724 + 4A > G in intron 5 and c.2095delG in intron 16 leading to complete exon 5 skipping. Mutations in the MYH gene, which is responsible for removing the oxidatively damaged DNA lesion are cancer-susceptible in the individuals. The IVS1+5C plays a causative role in the activation of a cryptic splice donor site and the alternative splicing in intron 1, S&S algorithm shows, guanine (G) at the position of IVS+5 is well conserved (at the frequency of 84%) among primates. This also supported the fact that the G/C SNP in the conserved splice junction of the MYH gene causes the alternative splicing of intron 1 of the β type transcript. Splice site scores were calculated according to S&S to find EBV infection in X-linked lymphoproliferative disease. Identification of Familial tumoral calcinosis (FTC) is an autosomal recessive disorder characterized by ectopic calcifications and elevated serum phosphate levels and it is because of aberrant splicing. == Application of S&S in hospitals for clinical practice and research == The Shapiro–Senapathy (S&S) algorithm has played a significant role in advancing the diagnosis and treatment of human diseases through its application in modern clinical genomics. With the widespread adoption of next-generation sequencing (NGS) technologies, the S&S algorithm is now routinely integrated into clinical practice by geneticists and diagnostic laboratories. It is implemented in various computational tools such as Human Splicing Finder (HSF), Splice Site Finder (SSF), and Alamut Visual, which assist in interpreting the functional impact of genetic variants on RNA splicing. The algorithm is particularly useful in identifying pathogenic splice site mutations in cases where the clinical presentation is unclear or where conventional diagnostic methods have failed to identify a causative gene. Its utility has been demonstrated across diverse patient cohorts, including individuals from different ethnic backgrounds with various cancers and inherited genetic disorders. The following are selected examples illustrating its application in clinical research. === Cancers === === Inherited disorders === == S&S - Algorithm for identifying splice sites, exons and split genes == The Shapiro–Senapathy algorithm (SSA) was developed to identify splice sites in uncharacterized genomic sequences, with early applications in the Human Genome Project. The method introduced a Position Weight Matrix (PWM)-based approach to analyze splicing sequences across eukaryotic organisms, marking the first computational framework to systematically define splice sites using probabilistic scoring. Key innovations of the algorithm included: Exon Detection – Exons were defined as sequences bounded by acceptor and donor splice sites with S&S scores above a threshold, requiring an open reading frame (ORF) for validation. Gene Prediction – The method enabled the identification of complete genes by assembling predicted exons, forming a basis for later gene-finding tools. Mutation Analysis – The algorithm distinguishes deleterious splice-site mutations (which disrupt protein function by lowering S&S scores) from neutral variations. This capability allowed researchers to study disease-linked cryptic splice sites in humans, animals, and plants. SSA's PWM-based framework influenced subsequent computational methods, including machine learning and neural network approaches, for splice-site prediction and alternative splicing research. It remains a foundational tool in genomics and disease studies. == Discovering the mechanisms of aberrant splicing in diseases == The Shapiro–Senapathy algorithm has been used to determine the various aberrant splicing mechanisms in genes due to deleterious mutations in the splice sites, which cause numerous diseases. Deleterious splice site mutations impair the normal splicing of the gene transcripts, and thereby make the encoded protei

    Read more →
  • Information architecture

    Information architecture

    Information architecture is the structural design of shared information environments, in particular the organisation of websites and software to support usability and findability. The term information architecture was coined by Richard Saul Wurman. Since its inception, information architecture has become an emerging community of practice focused on applying principles of design, architecture and information science in digital spaces. Typically, a model or concept of information is used and applied to activities which require explicit details of complex information systems. These activities include library systems and database development. == Definition == The term information architecture has different meanings in different branches of information systems or information technology. === User experience === In user experience design, information architecture has been described as the structural design of shared information environments, comprising the study and practice of organising and labelling web sites, intranets, online communities, and software to support user experience, in particular, the findability and usability of information. It has also been described as an emerging community of practice focused on bringing principles of design and architecture to the digital landscape. === Information systems === Technically speaking, information architecture comprises the combination of organization, labeling, search and navigation systems within websites and intranets, serving as a navigational aid to the content of information-rich systems. === Data architecture === Information architecture can be described as a subset of data architecture where usable data is constructed, designed, and arranged in a fashion most useful to the users of data. === Systems design === In the field of systems design, for example, information architecture is a component of enterprise architecture that deals with the information component when describing the structure of an enterprise. Some system design practitioners regard information architecture as strictly the application of information science to web design, which considers such issues as classification and information retrieval, and not factors like user experience and information design. == Principles == Principles of information architecture include the following: The principle of objects The principle of choices The principle of disclosure The principle of exemplars The principle of front doors The principle of multiple classification The principle of focused navigation The principle of growth == History == Richard Saul Wurman is credited with coining the term information architecture in relation to the design of information. From 1998 to 2015, Peter Morville and Louis Rosenfeld were co-authors of Information Architecture for the World Wide Web. Other authors include Jesse James Garrett and Christina Wodtke.

    Read more →
  • Species distribution modelling

    Species distribution modelling

    Species distribution modelling (SDM), also known as environmental (or ecological) niche modelling (ENM), habitat suitability modelling, predictive habitat distribution modelling, and range mapping uses ecological models to predict the distribution of a species across geographic space and time using environmental data. The environmental data are most often climate data (e.g. temperature, precipitation), but can include other variables such as soil type, water depth, and land cover. SDMs are used in several research areas in conservation biology, ecology and evolution. These models can be used to understand how environmental conditions influence the occurrence or abundance of a species, and for predictive purposes (ecological forecasting). Predictions from an SDM may be of a species' future distribution under climate change, a species' past distribution in order to assess evolutionary relationships, or the potential future distribution of an invasive species. Predictions of current and/or future habitat suitability can be useful for management applications (e.g. reintroduction or translocation of vulnerable species, reserve placement in anticipation of climate change). There are two main types of SDMs. Correlative SDMs, also known as climate envelope models, bioclimatic models, or resource selection function models, model the observed distribution of a species as a function of environmental conditions. Mechanistic SDMs, also known as process-based models or biophysical models, use independently derived information about a species' physiology to develop a model of the environmental conditions under which the species can exist. The extent to which such modelled data reflect real-world species distributions will depend on a number of factors, including the nature, complexity, and accuracy of the models used and the quality of the available environmental data layers; the availability of sufficient and reliable species distribution data as model input; and the influence of various factors such as barriers to dispersal, geologic history, or biotic interactions, that increase the difference between the realized niche and the fundamental niche. Environmental niche modelling may be considered a part of the discipline of biodiversity informatics. == History == A. F. W. Schimper used geographical and environmental factors to explain plant distributions in his 1898 Pflanzengeographie auf physiologischer Grundlage (Plant Geography Upon a Physiological Basis) and his 1908 work of the same name. Andrew Murray used the environment to explain the distribution of mammals in his 1866 The Geographical Distribution of Mammals. Robert Whittaker's work with plants and Robert MacArthur's work with birds strongly established the role the environment plays in species distributions. Elgene O. Box constructed environmental envelope models to predict the range of tree species. His computer simulations were among the earliest uses of species distribution modelling. The adoption of more sophisticated generalised linear models (GLMs) made it possible to create more sophisticated and realistic species distribution models. The expansion of remote sensing and the development of GIS-based environmental modelling increase the amount of environmental information available for model-building and made it easier to use. == Correlative vs mechanistic models == === Correlative SDMs === SDMs originated as correlative models. Correlative SDMs model the observed distribution of a species as a function of geographically referenced climatic predictor variables using multiple regression approaches. Given a set of geographically referred observed presences of a species and a set of climate maps, a model defines the most likely environmental ranges within which a species lives. Correlative SDMs assume that species are at equilibrium with their environment and that the relevant environmental variables have been adequately sampled. The models allow for interpolation between a limited number of species occurrences. For these models to be effective, it is required to gather observations not only of species presences, but also of absences, that is, where the species does not live. Records of species absences are typically not as common as records of presences, thus often "random background" or "pseudo-absence" data are used to fit these models. If there are incomplete records of species occurrences, pseudo-absences can introduce bias. Since correlative SDMs are models of a species' observed distribution, they are models of the realized niche (the environments where a species is found), as opposed to the fundamental niche (the environments where a species can be found, or where the abiotic environment is appropriate for the survival). For a given species, the realized and fundamental niches might be the same, but if a species is geographically confined due to dispersal limitation or species interactions, the realized niche will be smaller than the fundamental niche. Correlative SDMs are easier and faster to implement than mechanistic SDMs, and can make ready use of available data. Since they are correlative however, they do not provide much information about causal mechanisms and are not good for extrapolation. They will also be inaccurate if the observed species range is not at equilibrium (e.g. if a species has been recently introduced and is actively expanding its range). In standard SDMs, the distribution of a single species is often modeled, with unique parameters describing how environmental (abiotic) factors influence its occurrence probability. This allows for differentiated responses to environmental drivers among species, but can be problematic for data-deficient species. In contrast, similarities in environmental responses can be accounted for in multi-species SDMs, which model several species jointly using shared or hierarchically related parameters. However, neither approach explicitly accounts for community-level biotic interactions, which can be important in explaining species diversity patterns. Joint species distribution models (joint SDMs or J-SDMs) address this by modeling species co-occurrence patterns directly. The occurrence probability of a given species is thus influenced not only by abiotic drivers but also by inferred biotic associations with other species. This can improve accuracy for rarer taxa and provide insights into community ecology. Both standard SDMs and J-SDMs can be used to generate community-level metrics, such as species richness, by aggregating outputs across multiple species. These can be important for decision-making such as conservation planning. === Mechanistic SDMs === Mechanistic SDMs are more recently developed. In contrast to correlative models, mechanistic SDMs use physiological information about a species (taken from controlled field or laboratory studies) to determine the range of environmental conditions within which the species can persist. These models aim to directly characterize the fundamental niche, and to project it onto the landscape. A simple model may simply identify threshold values outside of which a species can't survive. A more complex model may consist of several sub-models, e.g. micro-climate conditions given macro-climate conditions, body temperature given micro-climate conditions, fitness or other biological rates (e.g. survival, fecundity) given body temperature (thermal performance curves), resource or energy requirements, and population dynamics. Geographically referenced environmental data are used as model inputs. Because the species distribution predictions are independent of the species' known range, these models are especially useful for species whose range is actively shifting and not at equilibrium, such as invasive species. Mechanistic SDMs incorporate causal mechanisms and are better for extrapolation and non-equilibrium situations. However, they are more labor-intensive to create than correlational models and require the collection and validation of a lot of physiological data, which may not be readily available. The models require many assumptions and parameter estimates, and they can become very complicated. Dispersal, biotic interactions, and evolutionary processes present challenges, as they aren't usually incorporated into either correlative or mechanistic models. Correlational and mechanistic models can be used in combination to gain additional insights. For example, a mechanistic model could be used to identify areas that are clearly outside the species' fundamental niche, and these areas can be marked as absences or excluded from analysis. See for a comparison between mechanistic and correlative models. == Niche models (correlative) == There are a variety of mathematical methods that can be used for fitting, selecting, and evaluating correlative SDMs. Models include "profile" methods, which are simple statistical techniques that use e.g. environmental distance to known sites of occurrence such as

    Read more →
  • Information retrieval

    Information retrieval

    Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an information need. The information need can be specified in the form of a search query. In the case of document retrieval, queries can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images, or sounds. Cross-modal retrieval implies retrieval across modalities. Automated information retrieval systems are used to reduce what has been called information overload. An IR system is a software system that provides access to books, journals, and other documents, as well as storing and managing those documents. Web search engines are the most visible IR applications. == Overview == An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval, a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevance. An object is an entity that is represented by information in a content collection or database. User queries are matched against the database information. However, as opposed to classical SQL queries of a database, in information retrieval the results returned may or may not match the query, so results are typically ranked. This ranking of results is a key difference of information retrieval searching compared to database searching. Depending on the application the data objects may be, for example, text documents, images, audio, mind maps or videos. Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by document surrogates or metadata. Most IR systems compute a numeric score on how well each object in the database matches the query, and rank the objects according to this value. The top ranking objects are then shown to the user. The process may then be iterated if the user wishes to refine the query. == History == there is ... a machine called the Univac ... whereby letters and figures are coded as a pattern of magnetic spots on a long steel tape. By this means the text of a document, preceded by its subject code symbol, can be recorded ... the machine ... automatically selects and types out those references which have been coded in any desired way at a rate of 120 words a minute The idea of using computers to search for relevant pieces of information was popularized in the article As We May Think by Vannevar Bush in 1945. It would appear that Bush was inspired by patents for a 'statistical machine' – filed by Emanuel Goldberg in the 1920s and 1930s – that searched for documents stored on film. The first description of a computer searching for information was described by Holmstrom in 1948, detailing an early mention of the Univac computer. Automated information retrieval systems were introduced in the 1950s: one even featured in the 1957 romantic comedy Desk Set. In the 1960s, the first large information retrieval research group was formed by Gerard Salton at Cornell. By the 1970s several different retrieval techniques had been shown to perform well on small text corpora such as the Cranfield collection (several thousand documents). Large-scale retrieval systems, such as the Lockheed Dialog system, came into use early in the 1970s. In 1992, the US Department of Defense along with the National Institute of Standards and Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text program. The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for evaluation of text retrieval methodologies on a very large text collection. This catalyzed research on methods that scale to huge corpora. The introduction of web search engines has boosted the need for very large scale retrieval systems even further. By the late 1990s, the rise of the World Wide Web fundamentally transformed information retrieval. While early search engines such as AltaVista (1995) and Yahoo! (1994) offered keyword-based retrieval, they were limited in scale and ranking refinement. The breakthrough came in 1998 with the founding of Google, which introduced the PageRank algorithm, using the web's hyperlink structure to assess page importance and improve relevance ranking. During the 2000s, web search systems evolved rapidly with the integration of machine learning techniques. These systems began to incorporate user behavior data (e.g., click-through logs), query reformulation, and content-based signals to improve search accuracy and personalization. In 2009, Microsoft launched Bing, introducing features that would later incorporate semantic web technologies through the development of its Satori knowledge base. Academic analysis have highlighted Bing's semantic capabilities, including structured data use and entity recognition, as part of a broader industry shift toward improving search relevance and understanding user intent through natural language processing. A major leap occurred in 2018, when Google deployed BERT (Bidirectional Encoder Representations from Transformers) to better understand the contextual meaning of queries and documents. This marked one of the first times deep neural language models were used at scale in real-world retrieval systems. BERT's bidirectional training enabled a more refined comprehension of word relationships in context, improving the handling of natural language queries. Because of its success, transformer-based models gained traction in academic research and commercial search applications. Simultaneously, the research community began exploring neural ranking models that outperformed traditional lexical-based methods. Long-standing benchmarks such as the Text REtrieval Conference (TREC), initiated in 1992, and more recent evaluation frameworks Microsoft MARCO(MAchine Reading COmprehension) (2019) became central to training and evaluating retrieval systems across multiple tasks and domains. MS MARCO has also been adopted in the TREC Deep Learning Tracks, where it serves as a core dataset for evaluating advances in neural ranking models within a standardized benchmarking environment. As deep learning became integral to information retrieval systems, researchers began to categorize neural approaches into three broad classes: sparse, dense, and hybrid models. Sparse models, including traditional term-based methods and learned variants like SPLADE, rely on interpretable representations and inverted indexes to enable efficient exact term matching with added semantic signals. Dense models, such as dual-encoder architectures like ColBERT, use continuous vector embeddings to support semantic similarity beyond keyword overlap. Hybrid models aim to combine the advantages of both, balancing the lexical (token) precision of sparse methods with the semantic depth of dense models. This way of categorizing models balances scalability, relevance, and efficiency in retrieval systems. As IR systems increasingly rely on deep learning, concerns around bias, fairness, and explainability have also come to the picture. Research is now focused not just on relevance and efficiency, but on transparency, accountability, and user trust in retrieval algorithms. == Applications == Areas where information retrieval techniques are employed include (the entries are in alphabetical order within each category): === General applications === Digital libraries Information filtering Recommender systems Media search Blog search Image retrieval 3D retrieval Music retrieval News search Speech retrieval Video retrieval Search engines Site search Desktop search Enterprise search Federated search Mobile search Social search Web search === Domain-specific applications === Expert search finding Genomic information retrieval Geographic information retrieval Information retrieval for chemical structures Information retrieval in software engineering Legal information retrieval Vertical search === Other retrieval methods === Methods/Techniques in which information retrieval techniques are employed include: Cross-modal retrieval Adversarial information retrieval Automatic summarization Multi-document summarization Compound term processing Cross-lingual retrieval Document classification Spam filtering Question answering == Model types == In order to effectively retrieve relevant documents by IR strategies, the documents are typically transformed into a suitable representation. Each retrieval strategy incorporates a specific model for its document representation purposes. The picture on the right illustrates the relationship of som

    Read more →
  • Archetype (information science)

    Archetype (information science)

    In the field of informatics, an archetype is a formal re-usable model of a domain concept. Traditionally, the term archetype is used in psychology to mean an idealized model of a person, personality or behaviour (see Archetype). The usage of the term in informatics is derived from this traditional meaning, but applied to domain modelling instead. An archetype is defined by the OpenEHR Foundation (for health informatics) as follows: An archetype is a computable expression of a domain content model in the form of structured constraint statements, based on some reference model. openEHR archetypes are based on the openEHR reference model. Archetypes are all expressed in the same formalism. In general, they are defined for wide re-use, however, they can be specialized to include local particularities. They can accommodate any number of natural languages and terminologies. == Formal specifications == The modern archetype formalism is specified and maintained by the openEHR Foundation, and although originally developed for the health IT domain, is completely domain-independent, and has been used in geospatial modelling, telecommunications, and defence. The archetype formalism consists of a number of specifications including: 'ADL 1.4': original release of Archetype Definition Language (ADL) and Archetype Object Model (AOM); widely implemented in health IT domain; 'ADL 2': modern release of Archetype Definition Language (ADL), Archetype Object Model (AOM), Archetype Identification specification and Archetype Technology Overview. The Archetype Technology Overview provides a short technical overview of the archetype formalism useful for new users. The ADL/AOM 1.4 specifications were provided to ISO TC 215 in 2008 by the openEHR Foundation and became the ISO 13606-2 standard, extant until 2019. ISO TC 215 accepted the AOM 2 specification as the basis for a revision of this standard, which was issued in 2019. In late 2015, the Object Management Group (OMG) accepted an RfP entitled 'Archetype Modeling Language (AML)' as a new candidate standard. This specification is a form of ADL re-engineered as a UML profile so as to enable archetype modelling to be supported within UML tools. == Tools == A number of tools area available for working with archetypes. Most are listed on the openEHR modelling tools page. They include: ADL Designer, a modern AOM2-based web editing application Archetype Editor, an original desktop clinical modelling tool Template Designer, an original desktop clinical templating tool LinkEHR, an archetype and data integration tool ADL Workbench, reference compiler and visualiser tool == Example ==

    Read more →
  • Applied Information Science in Economics

    Applied Information Science in Economics

    The Applied Information Science in Economics (Russian: Прикладная информатика в Экономике) or Applied Computer Science in Economics is a professional qualification generally awarded in Russian Federation. The degree inherited from the U.S.S.R. education system also known as Specialist degree. The degree is awarded after five years of full-time study and includes several internships, course-works, thesis writing and defense. The degree has similarities with German Magister Artium or Diplom degree. However, due to the Bologna Process number of such degrees are declining. Degree focuses on applying mathematical methods in economics involving maximum information technology. It is very close to applied mathematics, but includes also major part of computer science. == List of specialty codes in the education system == 080801 - Applied computer science in economics 351400 - Applied computer science == Fields of activity == Organization and management; Project design; Experimental research; Marketing; Consulting; Operational and Maintenance. == Major == Information Science and Programming. High Level Methods of Information Science and Programming. Information Technologies in Economics. Computer Systems, Networks and Telecommunications Services. Operational Environments, Systems and Shells. Architecture and Design of Information Systems for Companies. Data Bases. Information security. Information Management. Imitative Simulation.

    Read more →
  • ArchiMate

    ArchiMate

    ArchiMate ( AR-ki-mayt) is an open and independent enterprise architecture modeling language to support the description, analysis and visualization of architecture within and across business domains in an unambiguous way. ArchiMate is a technical standard from The Open Group and is based on concepts from the now superseded IEEE 1471 standard. It is supported by various tool vendors and consulting firms. ArchiMate is also a registered trademark of The Open Group. The Open Group has a certification program for ArchiMate users, software tools and courses. ArchiMate distinguishes itself from other languages such as Unified Modeling Language (UML) and Business Process Modeling and Notation (BPMN) by its enterprise modelling scope. Also, UML and BPMN are meant for a specific use and they are quite heavy – containing about 150 (UML) and 250 (BPMN) modeling concepts whereas ArchiMate works with just about 50 (in version 2.0). The goal of ArchiMate is to be ”as small as possible”, not to cover every edge scenario imaginable. To be easy to learn and apply, ArchiMate was intentionally restricted “to the concepts that suffice for modeling the proverbial 80% of practical cases". == Overview == ArchiMate offers a common language for describing the construction and operation of business processes, organizational structures, information flows, IT systems, and technical infrastructure. This insight helps the different stakeholders to design, assess, and communicate the consequences of decisions and changes within and between these business domains. The main concepts and relationships of the ArchiMate language can be seen as a framework, the so-called Archimate Framework: It divides the enterprise architecture into a business, application and technology layer. In each layer, three aspects are considered: active elements, an internal structure and elements that define use or communicate information. One of the objectives of the ArchiMate language is to define the relationships between concepts in different architecture domains. The concepts of this language therefore hold the middle between the detailed concepts, which are used for modeling individual domains (for example, the Unified Modeling Language (UML) for modeling software products), and Business Process Model and Notation (BPMN), which is used for business process modeling. == History == ArchiMate is partly based on the now superseded IEEE 1471 standard. It was developed in the Netherlands by a project team from the Telematica Instituut in cooperation with several Dutch partners from government, industry and academia. Among the partners were Ordina NV, Radboud Universiteit Nijmegen, the Leiden Institute for Advanced Computer Science (LIACS) and the Centrum Wiskunde & Informatica (CWI). Later, tests were performed in organizations such as ABN AMRO, the Dutch Tax and Customs Administration and the ABP. The development process lasted from July 2002 to December 2004, and took about 35 person years and approximately 4 million euros. The development was funded by the Dutch government (Dutch Tax and Customs Administration), and business partners, including ABN AMRO and the ABP Pension Fund. In 2008 the ownership and stewardship of ArchiMate was transferred to The Open Group. It is now managed by the ArchiMate Forum within The Open Group. In February 2009 The Open Group published the ArchiMate 1.0 standard as a formal technical standard. In January 2012 the ArchiMate 2.0 standard, and in 2013 the ArchiMate 2.1 standard was released. In June 2016, the Open Group released version 3.0 of the ArchiMate Specification. An update to Archimate 3.0.1 came out in August 2017. Archimate 3.1 was published 5 November 2019. The latest version of the ArchiMate Specification is version 3.2 released October 2022. Version 3.0 adds enhanced support for capability-oriented strategic modelling, new entities representing physical resources (for modelling the ingredients, equipment and transport resources used in the physical world) and a generic metamodel showing the entity types and the relationships between them. == ArchiMate framework == === Core framework === The main concepts and elements of the ArchiMate language are being presented as ArchiMate core framework. It consists of three layers and three aspects. This creates a matrix of combinations. Every layer has its passive structure, behavior and active structure aspects. ==== Layers ==== ArchiMate has a layered and service-oriented look on architectural models. The higher layers make use of services that are provided by the lower layers. Although, at an abstract level, the concepts that are used within each layer are similar, we define more concrete concepts that are specific for a certain layer. In this context, we distinguish three main layers: The business layer is about business processes, services, functions and events of business units. This layer "offers products and services to external customers, which are realized in the organization by business processes performed by business actors and roles". The application layer is about software applications that "support the components in the business with application services". The technology layer deals "with the hardware and communication infrastructure to support the application layer. This layer offers infrastructural services needed to run applications, realized by computer and communication hardware and system software". Each of these main layers can be further divided in sub-layers. For example, in the business layer, the primary business processes realising the products of a company may make use of a layer of secondary (supporting) business processes; in the application layer, the end-user applications may make use of generic services offered by supporting applications. On top of the business layer, a separate environment layer may be added, modelling the external customers that make use of the services of the organisation (although these may also be considered part of the business layer). In line with service orientation, the most important relation between layers is formed by use relations, which show how the higher layers make use of the services of lower layers. However, a second type of link is formed by realisation relations: elements in lower layers may realise comparable elements in higher layers; e.g., a ‘data object’ (application layer) may realise a ‘business object’ (business layer); or an ‘artifact’ (technology layer) may realise either a ‘data object’ or an ‘application component’ (application layer). ==== Aspects ==== Passive structure is the set of entities on which actions are conducted. In the business layer the example would be information objects, in the application layer data objects and in the technology layer, they could include physical objects. Behavior refers to the processes and functions performed by the actors. "Structural elements are assigned to behavioral elements, to show who or what displays the behavior". Active structure is the set of entities that display some behavior, e.g. business actors, devices, or application components. === Full framework === The Full ArchiMate framework is enriched by the physical layer, which was added to allow modeling of “physical equipment, materials, and distribution networks” and was not present in the previous version. The implementation and migration layer adds elements that allow architects to model a state of transition, to mark parts of the architecture that are temporary for the purpose, as the name says, of implementation and migration. Strategy layer adds three elements: resource, capability and course of action. These elements help to incorporate strategic dimension to the ArchiMate language by allowing it to depict the usage of resources and capabilities in order to achieve some strategic goals. Finally, there is a motivation aspect that allows different stakeholders to describe the motivation of specific actors or domains, which can be quite important when looking at one thing from several different angles. It adds several elements like stakeholder, value, driver, goal, meaning etc. == ArchiMate language == The ArchiMate language is formed as a top-level and is hierarchical. On the top, there is a model. A model is a collection of concepts. A concept can be either an element or a relationship. An element can be either of behavior type, structure, motivation or a so-called composite element (which means that it does not fit just one aspect of the framework, but two or more). The functionality of all concepts without a dependency on a specific layer is described by the generic metamodel. This layer-independent description of concepts is useful when trying to understand the mechanics of the Archimate language. === Concepts === ==== Elements ==== The generic elements are distributed into the same categories as the layers: Active structure elements Behavior elements Passive structure elements Motivation elements Active structure e

    Read more →
  • Reasoning model

    Reasoning model

    A reasoning model, also known as a reasoning language model (RLM) or large reasoning model (LRM), is a type of large language model (LLM) that has been specifically trained to solve complex tasks requiring multiple steps of logical reasoning. These models demonstrate superior performance on logic, mathematics, and programming tasks compared to standard LLMs. They possess the ability to revisit and revise earlier reasoning steps and utilize additional computation during inference as a method to scale performance, complementing traditional scaling approaches based on training data size, model parameters, and training compute. == Overview == Unlike traditional language models that generate responses immediately, reasoning models allocate additional compute, or thinking, time before producing an answer to solve multi-step problems. OpenAI introduced this terminology in September 2024 when it released the o1 series, describing the models as designed to "spend more time thinking" before responding. The company framed o1 as a reset in model naming that targets complex tasks in science, coding, and mathematics, and it contrasted o1's performance with GPT-4o on benchmarks such as AIME and Codeforces. Independent reporting the same week summarized the launch and highlighted OpenAI's claim that o1 automates chain-of-thought style reasoning to achieve large gains on difficult exams. In operation, reasoning models generate internal chains of intermediate steps, then select and refine a final answer. OpenAI reported that o1's accuracy improves as the model is given more reinforcement learning during training and more test-time compute at inference. The company initially chose to hide raw chains and instead return a model-written summary, stating that it "decided not to show" the underlying thoughts so researchers could monitor them without exposing unaligned content to end users. Commercial deployments document separate "reasoning tokens" that meter hidden thinking and a control for "reasoning effort" that tunes how much compute the model uses. These features make the models slower than ordinary chat systems while enabling stronger performance on difficult problems. == History == The research trajectory toward reasoning models combined advances in supervision, prompting, and search-style inference. Early alignment work on reinforcement learning from human feedback showed that models can be fine-tuned to follow instructions with "human feedback" and preference-based rewards. In 2022, Google Research scientists Jason Wei and Denny Zhou showed that chain-of-thought prompting "significantly improves the ability" of large models on complex reasoning tasks. Input → Step 1 → Step 2 → ⋯ → Step n ⏟ Reasoning chain → Answer {\displaystyle {\text{Input}}\rightarrow \underbrace {{\text{Step}}_{1}\rightarrow {\text{Step}}_{2}\rightarrow \cdots \rightarrow {\text{Step}}_{n}} _{\text{Reasoning chain}}\rightarrow {\text{Answer}}} A companion result demonstrated that the simple instruction "Let's think step by step" can elicit zero-shot reasoning. Follow-up work introduced self-consistency decoding, which "boosts the performance" of chain-of-thought by sampling diverse solution paths and choosing the consensus, and tool-augmented methods such as ReAct, a portmanteau of Reason and Act, that prompt models to "generate both reasoning traces" and actions. Research then generalized chain-of-thought into search over multiple candidate plans. The Tree-of-Thoughts framework from Princeton computer scientist Shunyu Yao proposes that models "perform deliberate decision making" by exploring and backtracking over a tree of intermediate thoughts. OpenAI's reported breakthrough focused on supervising reasoning processes rather than only outcomes, with Lightman et al.'s "Let's Verify Step by Step" reporting that rewarding each correct step "significantly outperforms outcome supervision" on challenging math problems and improves interpretability by aligning the chain-of-thought with human judgment. OpenAI's o1 announcement ties these strands together with a large-scale reinforcement learning algorithm that trains the model to refine its own chain of thought, and it reports that accuracy rises with more training compute and more time spent thinking at inference. Together, these developments define the core of reasoning models. They use supervision signals that evaluate the quality of intermediate steps, they exploit inference-time exploration such as consensus or tree search, and they expose controls for how much internal thinking compute to allocate. OpenAI's o1 family made this approach available at scale in September 2024 and popularized the label "reasoning model" for LLMs that deliberately think before they answer. The development of reasoning models illustrates Richard S. Sutton's "bitter lesson" that scaling compute typically outperforms methods based on human-designed insights. This principle was demonstrated by researchers at the Generative AI Research Lab (GAIR), who initially attempted to replicate o1's capabilities using sophisticated methods including tree search and reinforcement learning in late 2024. Their findings, published in the "o1 Replication Journey" series, revealed that knowledge distillation, a comparatively straightforward technique that trains a smaller model to mimic o1's outputs, produced unexpectedly strong performance. This outcome illustrated how direct scaling approaches can, at times, outperform more complex engineering solutions. === Drawbacks === Reasoning models require significantly more computational resources during inference compared to non-reasoning models. Research on the American Invitational Mathematics Examination (AIME) benchmark found that reasoning models were 10 to 74 times more expensive to operate than their non-reasoning counterparts. The extended inference time is attributed to the detailed, step-by-step reasoning outputs that these models generate, which are typically much longer than responses from standard large language models that provide direct answers without showing their reasoning process. One researcher in early 2025 argued that these models may face potential additional denial-of-service concerns with "overthinking attacks." === Releases === ==== 2024 ==== In September 2024, OpenAI released o1-preview, a large language model with enhanced reasoning capabilities. The full version, o1, was released in December 2024. OpenAI initially shared preliminary results on its successor model, o3, in December 2024, with the full o3 model becoming available in 2025. Alibaba released reasoning versions of its Qwen large language models in November 2024. In December 2024, the company introduced QvQ-72B-Preview, an experimental visual reasoning model. In December 2024, Google introduced Deep Research in Gemini, a feature designed to conduct multi-step research tasks. On December 16, 2024, researchers demonstrated that by scaling test-time compute, a relatively small Llama 3B model could outperform a much larger Llama 70B model on challenging reasoning tasks. This experiment suggested that improved inference strategies can unlock reasoning capabilities even in smaller models. ==== 2025 ==== In January 2025, DeepSeek released R1, a reasoning model that achieved performance comparable to OpenAI's o1 at significantly lower computational cost. The release demonstrated the effectiveness of Group Relative Policy Optimization (GRPO), a reinforcement learning technique used to train the model. On January 25, 2025, DeepSeek enhanced R1 with web search capabilities, allowing the model to retrieve information from the internet while performing reasoning tasks. Research during this period further validated the effectiveness of knowledge distillation for creating reasoning models. The s1-32B model achieved strong performance through budget forcing and scaling methods, reinforcing findings that simpler training approaches can be highly effective for reasoning capabilities. On February 2, 2025, OpenAI released Deep Research, a feature powered by their o3 model that enables users to conduct comprehensive research tasks. The system generates detailed reports by automatically gathering and synthesizing information from multiple web sources. OpenAI called GPT-4.5 its "last non-chain-of-thought model", and implemented with GPT-5 a router model that selects a model based on the difficulty of the task. ==== 2026 ==== In January 2026, Moonshot AI released Kimi K2.5, an open-source 1 trillion parameter MoE model with 32 billion active parameters. It uses an “Agent Swarm” system that dynamically decomposes tasks into sub-agents for reasoning and execution, enabling more scalable multi-step problem solving than a single sequential reasoning chain. == Training == Reasoning models follow the familiar large-scale pretraining used for frontier language models, then diverge in the post-training and optimization. OpenAI reports that o1 is trained with a large-

    Read more →
  • Artificial Intelligence Applications Institute

    Artificial Intelligence Applications Institute

    The Artificial Intelligence Applications Institute (AIAI) at the School of Informatics at the University of Edinburgh is a non-profit technology transfer organisation that promoted research in the field of artificial intelligence. == History == The Artificial Intelligence Applications Institute (AIAI) was founded in 1983 at the University of Edinburgh as a specialist research and technology-transfer unit focusing on the practical uses of artificial intelligence (AI). The institute was established by Professor Jim Howe and colleagues from the Science and Engineering Research Council (SERC) Special Interest Group in AI in the Department of Artificial Intelligence, with a mission to apply AI techniques to solve real-world industrial and governmental problems. Under the directorship of Austin Tate, who served from 1985 to 2019, AIAI became one of the leading UK research centres devoted to AI programming systems, intelligent planning systems, decision support, and knowledge-based engineering. It collaborated with both academic partners and international organisations such as the European Space Agency and the UK Ministry of Defence. In 2001, AIAI joined the newly created Centre for Intelligent Systems and their Applications (CISA) within the University's School of Informatics. In December 2019, the institute was renamed the Artificial Intelligence and its Applications Institute to reflect a broader integration of fundamental and applied AI research. == Research programmes == AIAI’s research spans multiple areas of artificial intelligence, including: AI programming Systems - Edinburgh Prolog, Edinburgh Common Lisp, Logo; Knowledge representation and reasoning – development of ontologies, rule-based inference, and semantic modelling; Automated planning and scheduling – intelligent task management systems used in aerospace, manufacturing, and emergency response; Natural language processing and intelligent agents – interaction frameworks for human–computer collaboration; AI ethics and decision-making – research into responsible deployment and evaluation of autonomous systems. The institute also contributes to interdisciplinary fields such as computational creativity, explainable AI, and human–AI interaction. AIAI maintains close collaboration with the Bayes Centre and the Alan Turing Institute through joint research programmes and doctoral training initiatives. == Technology transfer and impact == From its inception, AIAI has combined academic research with technology-transfer activity, offering professional training, industrial consultancy, and bespoke software systems. It pioneered one of the earliest knowledge-based project-management systems, O-Plan, later evolved into the I-Plan framework used for autonomous planning and workflow management.

    Read more →
  • Ontology for Biomedical Investigations

    Ontology for Biomedical Investigations

    The Ontology for Biomedical Investigations (OBI) is an open-access, integrated ontology for the description of biological and clinical investigations. OBI provides a model for the design of an investigation, the protocols and instrumentation used, the materials used, the data generated and the type of analysis performed on it. The project is being developed as part of the OBO Foundry and as such adheres to all the principles therein such as orthogonal coverage (i.e. clear delineation from other foundry member ontologies) and the use of a common formal language. In OBI the common formal language used is the Web Ontology Language (OWL). As of March 2008, a pre-release version of the ontology was made available at the project's SVN repository. == Scope == The Ontology for Biomedical Investigations (OBI) addresses the need for controlled vocabularies to support integration and joint ("cross-omics") analysis of experimental data, a need originally identified in the transcriptomics domain by the FGED Society, which developed the MGED Ontology as an annotation resource for microarray data.Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, et al. (November 2007). "The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration". Nature Biotechnology. 25 (11): 1251–5. doi:10.1038/nbt1346. PMC 2814061. PMID 17989687. OBI uses the basic formal ontology upper-level ontology as a means of describing general entities that do not belong to a specific problem domain. As such, all OBI classes are a subclass of some BFO class. The ontology has the scope of modeling all biomedical investigations and as such contains ontology terms for aspects such as: biological material – for example blood plasma instrument (and parts of an instrument therein) – for example DNA microarray, centrifuge information content – such as an image or a digital information entity such as an electronic medical record design and execution of an investigation (and individual experiments therein) – for example study design, electrophoresis material separation data transformation (incorporating aspects such as data normalization and data analysis) – for example principal components analysis dimensionality reduction, mean calculation Less 'concrete' aspects such as the role a given entity may play in a particular scenario (for example the role of a chemical compound in an experiment) and the function of an entity (for example the digestive function of the stomach to nutriate the body) are also covered in the ontology. == OBI consortium == The MGED Ontology was originally identified in the transcriptomics domain by the FGED Society and was developed to address the needs of data integration. Following a mutual decision to collaborate, this effort later became a wider collaboration between groups such as FGED, PSI and MSI in response to the needs of areas such as transcriptomics, proteomics and metabolomics and the FuGO (Functional Genomics Investigation Ontology) was created. This later became the OBI covering the wider scope of all biomedical investigations. As an international, cross-domain initiative, the OBI consortium draws upon a pool of experts from a variety of fields, not limited to biology. The current list of OBI consortium members is available at the OBI consortium website. The consortium is made up of a coordinating committee which is a combination of two subgroups, the Community Representative (those representing a particular biomedical community) and the Core Developers (ontology developers who may or may not be members of any single community). Separate to the coordinating committee is the Developers Working Group which consists of developers within the communities collaborating in the development of OBI at the discretion of current OBI Consortium members. == Papers on OBI ==

    Read more →
  • Brian Deer Classification System

    Brian Deer Classification System

    The Brian Deer Classification System (BDC) is a library classification system used to organize materials in libraries with specialized Indigenous collections. The system was created in the mid-1970s by Canadian librarian A. Brian Deer, a Kahnawake Mohawk. It has been adapted for use in a British Columbia version, and also by a small number of First Nations libraries in Canada. == History and usage == Deer designed his classification system while working in the library of the National Indian Brotherhood (now the Assembly of First Nations) from 1974 to 1976. Instead of using a standard library classification scheme, such as that of the Library of Congress, he created a new system to organize the library's historic indigenous research materials and papers. He later worked at the library of the Union of British Columbia Indian Chiefs, where he developed a system for its holdings. He returned to Kahnawake, working at its Cultural Centre at Kahnawake and the Kahnawake Branch branch of the Mohawk Nation Office. His system was flexible, and he created new forms for their collections. The new systems Deer created were designed specifically for the materials in each collection according to the concerns of local Indigenous people at the time (for example, categories included land claims, treaty rights, resource management, and Elders' stories). Between 1978 and 1980, the system was adapted for use in British Columbia by Gene Joseph and Keltie McCall while they were working at the Union of British Columbia Indian Chiefs, becoming known as BDC-BC. Joseph later adapted it further for use in the Xwi7xwa Library at University of British Columbia, Vancouver. Though the Brian Deer Classification was not created as a universal classification solution for Indigenous resources, the system has provided a foundation for specialized libraries to create their own localized classification schemes. Variations of the Brian Deer Classification System are used in a small number of Canadian libraries. One prominent library using BDC is the X̱wi7x̱wa Library at the University of British Columbia, which uses a British Columbia-focused version of BDC along with First Nations House of Learning subject headings. The Union of British Columbia Indian Chiefs Resource Centre issued a revised BDC-BC in 2014, with the goal of providing users with a more flexible and culturally appropriate approach to organizing their resources. The Aanischaaukamikw Cree Cultural Institute in Oujé-Bougoumou, Quebec, implemented a local adaptation of BDC when they opened in 2012. In 2020 the Carrier Sekani Tribal Council in Prince George, British Columbia, shifted from organizing its library with the Dewey Decimal Classification to using a version of the BDC. They added new subject heading categories for topics of local interest such as the crisis of Missing and murdered Indigenous women. Simon Fraser University Library began developing the Indigenous Curriculum Resource Centre (ICRC) in 2020, with the physical space opening in 2023. The ICRC is Call to Action 21 of SFU's Aboriginal Reconciliation Council's final report, Walk This Path With Us. Through its collection, the ICRC supports those interested in learning about how and why decolonizing pedagogy and teaching practices are important. The physical items in the collection are catalogued using a modified Brian Deer Classification system. In 2022 Kwantlen Polytechnic University’s χʷəχʷéy̓əm Indigenous Collection released a revised BDC-BC System. This BDC contains works exclusively with Indigenous authored materials and expands the cuttering systems of previous BDC, with the result that much of the collection reflects a spatial relationality. The implementation of this BDC was possible due to the tireless work at Xwi7xwa Library, Union of British Columbia Indian Chiefs Resource Centre, and Simon Fraser University Library's Indigenous Curriculum Resource Centre. == Structure == The high-level organizational structure of BDC reflects a First Nations worldview, with an emphasis on relationships between and among people, animals, and the land. Subcategories demonstrate the relationships among First Nations by grouping them geographically as opposed to alphabetically; the latter is a practice frequently used for specific topics in the Library of Congress Classification. The top-level hierarchy of the X̱wi7x̱wa Library adaptation of BDC-BC demonstrates the emphasis on access to subjects prioritized by a First Nation collection: Reference Materials Local History History International Education Economic Development Housing and Community Development Criminal Justice System Constitution (Canada) and First Nations Self Government Rights and Title Natural Resources Community Resources Health World View Fine Arts Languages Literature The system is not designed to provide a comprehensive description of all topics of interest to North American Indigenous peoples; in addition, its use is limited in scope, being intended for small and specialized libraries. While English is used in the classification scheme as a common language among First Nations peoples and non-Indigenous library users, Indigenous spellings and terminology that local library users would expect to find are used to provide access. Short and easily remembered call numbers are used to facilitate use by both library workers and patrons, with the recognition that Indigenous libraries often have a small staff and limited resources to devote to cataloging. Beyond its simplicity, one potential drawback of the system is its shortage of clear guidelines for application, which provides flexibility but can also result in inconsistencies within and between library catalogs. Because few libraries use the BDC and there are limited examples for use as case studies, implementing the system and keeping it up-to-date can prove a challenge for libraries with limited resources. However, X̱wi7x̱wa Library head librarian Ann Doyle describes the system as "an important part of the body of Indigenous scholarship" that should be retained as a reflection of Indigenous worldviews, as well as for ease of access for Indigenous library users.

    Read more →
  • Adversarial machine learning

    Adversarial machine learning

    Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. Machine learning techniques are mostly designed to work on specific problem sets, under the assumption that the training and test data are generated from the same statistical distribution (IID). However, this assumption is often violated in practical high-stake applications, where users may intentionally supply fabricated data that violates the statistical assumption. Most common attacks in adversarial machine learning include evasion attacks, data poisoning attacks, Byzantine attacks and model extraction. == History == At the MIT Spam Conference in January 2004, John Graham-Cumming showed that a machine-learning spam filter could be used to defeat another machine-learning spam filter by automatically learning which words to add to a spam email to get the email classified as not spam. In 2004, Nilesh Dalvi and others noted that linear classifiers used in spam filters could be defeated by simple "evasion attacks" as spammers inserted "good words" into their spam emails. (Around 2007, some spammers added random noise to fuzz words within "image spam" in order to defeat OCR-based filters.) In 2006, Marco Barreno and others published "Can Machine Learning Be Secure?", outlining a broad taxonomy of attacks. As late as 2013 many researchers continued to hope that non-linear classifiers (such as support vector machines and neural networks) might be robust to adversaries, until Battista Biggio and others demonstrated the first gradient-based attacks on such machine-learning models (2012–2013). In 2012, deep neural networks began to dominate computer vision problems; starting in 2014, Christian Szegedy and others demonstrated that deep neural networks could be fooled by adversaries, again using a gradient-based attack to craft adversarial perturbations. Further work would show that adversarial attacks are harder to produce in uncontrolled environments, due to the different environmental constraints that cancel out the effect of noise. For example, any small rotation or slight illumination on an adversarial image can destroy the adversariality. In addition, researchers such as Google Brain's Nick Frosst point out that it is much easier to make self-driving cars miss stop signs by physically removing the sign itself, rather than creating adversarial examples. Frosst also believes that the adversarial machine learning community incorrectly assumes models trained on a certain data distribution will also perform well on a completely different data distribution. He suggests that a new approach to machine learning should be explored, and is currently working on a unique neural network that has characteristics more similar to human perception than state-of-the-art approaches. While adversarial machine learning continues to be heavily rooted in academia, large tech companies such as Google, Microsoft, and IBM have begun curating documentation and open source code bases to allow others to concretely assess the robustness of machine learning models and minimize the risk of adversarial attacks. === Examples === Examples include attacks in spam filtering, where spam messages are obfuscated through the misspelling of "bad" words or the insertion of "good" words; attacks in computer security, such as obfuscating malware code within network packets or modifying the characteristics of a network flow to mislead intrusion detection; attacks in biometric recognition where fake biometric traits may be exploited to impersonate a legitimate user; or to compromise users' template galleries that adapt to updated traits over time. Researchers showed that by changing only one-pixel it was possible to fool deep learning algorithms. Others 3-D printed a toy turtle with a texture engineered to make Google's object detection AI classify it as a rifle regardless of the angle from which the turtle was viewed. Creating the turtle required only low-cost commercially available 3-D printing technology. A machine-tweaked image of a dog was shown to look like a cat to both computers and humans. A 2019 study reported that humans can guess how machines will classify adversarial images. Researchers discovered methods for perturbing the appearance of a stop sign such that an autonomous vehicle classified it as a merge or speed limit sign. A data poisoning filter called Nightshade was released in 2023 by researchers at the University of Chicago. It was created for use by visual artists to put on their artwork to corrupt the data set of text-to-image models, which usually scrape their data from the internet without the consent of the image creator. McAfee attacked Tesla's former Mobileye system, fooling it into driving 50 mph over the speed limit, simply by adding a two-inch strip of black tape to a speed limit sign. Adversarial patterns on glasses or clothing designed to deceive facial-recognition systems or license-plate readers, have led to a niche industry of "stealth streetwear". An adversarial attack on a neural network can allow an attacker to inject algorithms into the target system. Researchers can also create adversarial audio inputs to disguise commands to intelligent assistants in benign-seeming audio; a parallel literature explores human perception of such stimuli. Clustering algorithms are used in security applications. Malware and computer virus analysis aims to identify malware families, and to generate specific detection signatures. In the context of malware detection, researchers have proposed methods for adversarial malware generation that automatically craft binaries to evade learning-based detectors while preserving malicious functionality. Optimization-based attacks such as GAMMA use genetic algorithms to inject benign content (for example, padding or new PE sections) into Windows executables, framing evasion as a constrained optimization problem that balances misclassification success with the size of the injected payload and showing transferability to commercial antivirus products. Complementary work uses generative adversarial networks (GANs) to learn feature-space perturbations that cause malware to be classified as benign; Mal-LSGAN, for instance, replaces the standard GAN loss with a least-squares objective and modified activation functions to improve training stability and produce adversarial malware examples that substantially reduce true positive rates across multiple detectors. == Challenges in applying machine learning to security == Researchers have observed that the constraints under which machine-learning techniques function in the security domain are different from those of common benchmark domains. Security data may change over time, include mislabeled samples, or reflect adversarial behavior, which complicates evaluation and reproducibility. === Data collection issues === Security datasets vary across formats, including binaries, network traces, and log files. Studies have reported that the process of converting these sources into features can introduce bias or inconsistencies. In addition, time-based leakage can occur when related malware samples are not properly separated across training and testing splits, which may lead to overly optimistic results. === Labeling and ground truth challenges === Malware labels are often unstable because different antivirus engines may classify the same sample in conflicting ways. Ceschin et al. note that families may be renamed or reorganized over time, causing further discrepancies in ground truth and reducing the reliability of benchmarks. === Concept drift === Because malware creators continuously adapt their techniques, the statistical properties of malicious samples also change. This form of concept drift has been widely documented and may reduce model performance unless systems are updated regularly or incorporate mechanisms for incremental learning. === Feature robustness === Researchers differentiate between features that can be easily manipulated and those that are more resistant to modification. For example, simple static attributes, such as header fields, may be altered by attackers, while structural features, such as control-flow graphs, are generally more stable but computationally expensive to extract. === Class imbalance === In realistic deployment environments, the proportion of malicious samples can be extremely low, ranging from 0.01% to 2% of total data. This unbalanced distribution causes models to develop a bias towards the majority class, achieving high accuracy but failing to identify malicious samples. Prior approaches to this problem have included both data-level solutions and sequence-specific models. Methods like n-gram and Long Short-Term Memory (LSTM) networks can model sequential data, but their performance has been shown to decline significantly when malware samples are realistically proportioned in the training set, demonstrating the limitations in

    Read more →
  • Ontology-based data integration

    Ontology-based data integration

    Ontology-based data integration involves the use of one or more ontologies to effectively combine data or information from multiple heterogeneous sources. It is one of the multiple data integration approaches and may be classified as Global-As-View (GAV). The effectiveness of ontology‑based data integration is closely tied to the consistency and expressivity of the ontology used in the integration process. == Background == Data from multiple sources are characterized by multiple types of heterogeneity. The following hierarchy is often used: Syntactic heterogeneity: is a result of differences in representation format of data Schematic or structural heterogeneity: the native model or structure to store data differ in data sources leading to structural heterogeneity. Schematic heterogeneity that particularly appears in structured databases is also an aspect of structural heterogeneity. Semantic heterogeneity: differences in interpretation of the 'meaning' of data are source of semantic heterogeneity System heterogeneity: use of different operating system, hardware platforms lead to system heterogeneity Ontologies, as formal models of representation with explicitly defined concepts and named relationships linking them, are used to address the issue of semantic heterogeneity in data sources. In domains like bioinformatics and biomedicine, the rapid development, adoption and public availability of ontologies [1] Archived 2007-06-16 at the Wayback Machine has made it possible for the data integration community to leverage them for semantic integration of data and information. == The role of ontologies == Ontologies enable the unambiguous identification of entities in heterogeneous information systems and assertion of applicable named relationships that connect these entities together. Specifically, ontologies play the following roles: Content Explication The ontology enables accurate interpretation of data from multiple sources through the explicit definition of terms and relationships in the ontology. Query Model In some systems like SIMS, the query is formulated using the ontology as a global query schema. Verification The ontology verifies the mappings used to integrate data from multiple sources. These mappings may either be user specified or generated by a system. == Approaches using ontologies for data integration == There are three main architectures that are implemented in ontology‑based data integration applications, namely, Single ontology approach A single ontology is used as a global reference model in the system. This is the simplest approach as it can be simulated by other approaches. SIMS is a prominent example of this approach. The Structured Knowledge Source Integration component of Research Cyc is another prominent example of this approach. (Title = Harnessing Cyc to Answer Clinical Researchers' Ad Hoc Queries). The Gellish Taxonomic Dictionary-Ontology follows this approach as well. Multiple ontologies Multiple ontologies, each modeling an individual data source, are used in combination for integration. Though, this approach is more flexible than the single ontology approach, it requires creation of mappings between the multiple ontologies. Ontology mapping is a challenging issue and is focus of large number of research efforts in computer science [2]. The OBSERVER system is an example of this approach. Hybrid approaches The hybrid approach involves the use of multiple ontologies that subscribe to a common, top-level vocabulary. The top-level vocabulary defines the basic terms of the domain. Thus, the hybrid approach makes it easier to use multiple ontologies for integration in presence of the common vocabulary.

    Read more →
  • Virtual directory

    Virtual directory

    In computing, the term virtual directory has a couple of meanings. It may simply designate (for example in IIS) a folder which appears in a path but which is not actually a subfolder of the preceding folder in the path. However, this article will discuss the term in the context of directory services and identity management. A virtual directory or virtual directory server (VDS) in this context is a software layer that delivers a single access point for identity management applications and service platforms. A virtual directory operates as a high-performance, lightweight abstraction layer that resides between client applications and disparate types of identity-data repositories, such as proprietary and standard directories, databases, web services, and applications. A virtual directory receives queries and directs them to the appropriate data sources by abstracting and virtualizing data. The virtual directory integrates identity data from multiple heterogeneous data stores and presents it as though it were coming from one source. This ability to reach into disparate repositories makes virtual directory technology ideal for consolidating data stored in a distributed environment. As of 2011, virtual directory servers most commonly use the LDAP protocol, but more sophisticated virtual directories can also support SQL as well as DSML and SPML. Industry experts have heralded the importance of the virtual directory in modernizing the identity infrastructure. According to Dave Kearns of Network World, "Virtualization is hot and a virtual directory is the building block, or foundation, you should be looking at for your next identity management project." In addition, Gartner analyst, Bob Blakley said that virtual directories are playing an increasingly vital role. In his report, “The Emerging Architecture of Identity Management,” Blakley wrote: “In the first phase, production of identities will be separated from consumption of identities through the introduction of a virtual directory interface.” == Capabilities == Virtual directories can have some or all of the following capabilities: Aggregate identity data across sources to create a single point of access. Create high-availability for authoritative data stores. Act as identity firewall by preventing denial-of-service attacks on the primary data stores through an additional virtual layer. Support a common searchable namespace for centralized authentication. Present a unified virtual view of user information stored across multiple systems. Delegate authentication to backend sources through source-specific security means. Virtualize data sources to support migration from legacy data stores without modifying the applications that rely on them. Enrich identities with attributes pulled from multiple data stores, based on a link between user entries. Some advanced identity virtualization platforms can also: Enable application-specific, customized views of identity data without violating internal or external regulations governing identity data. Reveal contextual relationships between objects through hierarchical directory structures. Develop advanced correlation across diverse sources using correlation rules. Build a global user identity by correlating unique user accounts across various data stores, and enrich identities with attributes pulled from multiple data stores, based on a link between user entries. Enable constant data refresh for real-time updates through a persistent cache. == Advantages == Virtual directories: Enable faster deployment because users do not need to add and sync additional application-specific data sources Leverage existing identity infrastructure and security investments to deploy new services Deliver high availability of data sources Provide application-specific views of identity data which can help avoid the need to develop a master enterprise schema Allow a single view of identity data without violating internal or external regulations governing identity data Act as identity firewalls by preventing denial-of-service attacks on the primary data-stores and providing further security on access to sensitive data Can reflect changes made to authoritative sources in real-time Leverages existing update processes of authoritative sources, so no separate (sometimes manual) process to update a central directory is needed Present a unified virtual view of user information from multiple systems so that it appears to reside in a single system Can secure all backend storage locations with a single security policy == Disadvantages == An original disadvantage is public perception of "push & pull technologies" which is the general classification of "virtual directories" depending on the nature of their deployment. Virtual directories were initially designed and later deployed with "push technologies" in mind, which also contravened with privacy laws of the United States. This is no longer the case. There are, however, other disadvantages in the current technologies. The classical virtual directory based on proxy cannot modify underlying data structures or create new views based on the relationships of data from across multiple systems. So if an application requires a different structure, such as a flattened list of identities, or a deeper hierarchy for delegated administration, a virtual directory is limited. Many virtual directories cannot correlate same-users across multiple diverse sources in the case of duplicate users Virtual directories without advanced caching technologies cannot scale to heterogeneous, high-volume environments. == Sample terminology == Unify metadata: Extract schemas from the local data source, map them to a common format, and link the same identities from different data silos based on a unique identifier. Namespace joining: Create a single large directory by bringing multiple directories together at the namespace level. For instance, if one directory has the namespace "ou=internal,dc=domain,dc=com" and a second directory has the namespace "ou=external,dc=domain,dc=com," then creating a virtual directory with both namespaces is an example of namespace joining. Identity joining: Enrich identities with attributes pulled from multiple data stores, based on a link between user entries. For instance if the user joeuser exists in a directory as "cn=joeuser,ou=users" and in a database with a username of "joeuser" then the "joeuser" identity can be constructed from both the directory and the database. Data remapping: The translation of data inside of the virtual directory. For instance, mapping “uid” to “samaccountname,” so a client application that only supports a standard LDAP-compliant data source is able to search an Active Directory namespace, as well. Query routing: Route requests based on certain criteria, such as “write operations going to a master, while read operations are forwarded to replicas.” Identity routing: Virtual directories may support the routing of requests based on certain criteria (such as write operations going to a master while read operations being forwarded to replicas). Authoritative source: A "virtualized" data repository, such as a directory or database, that the virtual directory can trust for user data. Server groups: Group one or more servers containing the same data and functionality. A typical implementation is the multi-master, multi-replica environment in which replicas process "read" requests and are in one server group, while masters process "write" requests and are in another, so that servers are grouped by their response to external stimuli, even though all share the same data. == Use cases == The following are sample use cases of virtual directories: Integrating multiple directory namespaces to create a central enterprise directory. Supporting infrastructure integrations after mergers and acquisitions. Centralizing identity storage across the infrastructure, making identity information available to applications through various protocols (including LDAP, JDBC, and web services). Creating a single access point for web access management (WAM) tools. Enabling web single sign-on (SSO) across varied sources or domains. Supporting role-based, fine-grained authorization policies Enabling authentication across different security domains using each domain’s specific credential checking method. Improving secure access to information both inside and outside of the firewall.

    Read more →