AI App How To Use

AI App How To Use — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Truth discovery

    Truth discovery

    Truth discovery (also known as truth finding) is the process of choosing the actual true value for a data item when different data sources provide conflicting information on it. Several algorithms have been proposed to tackle this problem, ranging from simple methods like majority voting to more complex ones able to estimate the trustworthiness of data sources. Truth discovery problems can be divided into two sub-classes: single-truth and multi-truth. In the first case only one true value is allowed for a data item (e.g birthday of a person, capital city of a country). While in the second case multiple true values are allowed (e.g. cast of a movie, authors of a book). Typically, truth discovery is the last step of a data integration pipeline, when the schemas of different data sources have been unified and the records referring to the same data item have been detected. == General principles == The abundance of data available on the web makes more and more probable to find that different sources provide (partially or completely) different values for the same data item. This, together with the fact that we are increasing our reliance on data to derive important decisions, motivates the need of developing good truth discovery algorithms. Many currently available methods rely on a voting strategy to define the true value of a data item. Nevertheless, recent studies, have shown that, if we rely only on majority voting, we could get wrong results even in 30% of the data items. The solution to this problem is to assess the trustworthiness of the sources and give more importance to votes coming from trusted sources. Ideally, supervised learning techniques could be exploited to assign a reliability score to sources after hand-crafted labeling of the provided values; unfortunately, this is not feasible since the number of needed labeled examples should be proportional to the number of sources, and in many applications the number of sources can be prohibitive. == Single-truth vs multi-truth discovery == Single-truth and multi-truth discovery are two very different problems. Single-truth discovery is characterized by the following properties: only one true value is allowed for each data item; different values provided for a given data item oppose to each other; values and sources can either be correct or erroneous. While in the multi-truth case the following properties hold: the truth is composed by a set of values; different values could provide a partial truth; claiming one value for a given data item does not imply opposing to all the other values; the number of true values for each data item is not known a priori. Multi-truth discovery has unique features that make the problem more complex and should be taken into consideration when developing truth-discovery solutions. The examples below point out the main differences of the two methods. Knowing that in both examples the truth is provided by source 1, in the single truth case (first table) we can say that sources 2 and 3 oppose to the truth and as a result provide wrong values. On the other hand, in the second case (second table), sources 2 and 3 are neither correct nor erroneous, they instead provide a subset of the true values and at the same time they do not oppose the truth. == Source trustworthiness == The vast majority of truth discovery methods are based on a voting approach: each source votes for a value of a certain data item and, at the end, the value with the highest vote is select as the true one. In the more sophisticated methods, votes do not have the same weight for all the data sources, more importance is indeed given to votes coming from trusted sources. Source trustworthiness usually is not known a priori but estimated with an iterative approach. At each step of the truth discovery algorithm the trustworthiness score of each data source is refined, improving the assessment of the true values that in turn leads to a better estimation of the trustworthiness of the sources. This process usually ends when all the values reach a convergence state. Source trustworthiness can be based on different metrics, such as accuracy of provided values, copying values from other sources and domain coverage. Detecting copying behaviors is very important, in fact, copy allows to spread false values easily making truth discovery very hard, since many sources would vote for the wrong values. Usually systems decrease the weight of votes associated to copied values or even don’t count them at all. == Single-truth methods == Most of the currently available truth discovery methods have been designed to work well only in the single-truth case. Below are reported some of the characteristics of the most relevant typologies of single-truth methods and how different systems model source trustworthiness. === Majority voting === Majority voting is the simplest method, the most popular value is selected as the true one. Majority voting is commonly used as a baseline when assessing the performances of more complex methods. === Web-link based === These methods estimate source trustworthiness exploiting a similar technique to the one used to measure authority of web pages based on web links. The vote assigned to a value is computed as the sum of the trustworthiness of the sources that provide that particular value, while the trustworthiness of a source is computed as the sum of the votes assigned to the values that the source provides. === Information-retrieval based === These methods estimate source trustworthiness using similarity measures typically used in information retrieval. Source trustworthiness is computed as the cosine similarity (or other similarity measures) between the set of values provided by the source and the set of values considered true (either selected in a probabilistic way or obtained from a ground truth). === Bayesian based === These methods use Bayesian inference to define the probability of a value being true conditioned on the values provided by all the sources. P ( v ∣ ψ ( o ) ) = P ( ψ ( o ) ∣ v ) ⋅ P ( v ) P ( ψ ( o ) ) {\displaystyle P(v\mid \psi (o))={\frac {P(\psi (o)\mid v)\cdot P(v)}{P(\psi (o))}}} where v {\displaystyle \textstyle v} is a value provided for a data item o {\displaystyle \textstyle o} and ψ ( o ) {\displaystyle \textstyle \psi (o)} is the set of the observed values provided by all the sources for that specific data item. The trustworthiness of a source is then computed based on the accuracy of the values that provides. Other more complex methods exploit Bayesian inference to detect copying behaviors and use these insights to better assess source trustworthiness. == Multi-truth methods == Due to its complexity, less attention has been devoted to the study of the multi-truth discovery Below are reported two typologies of multi-truth methods and their characteristics. === Bayesian based === These methods use Bayesian inference to define the probability of a group of values being true conditioned on the values provided by all the data sources. In this case, since there could be multiple true values for each data item, and sources can provide multiple values for a single data item, it is not possible to consider values individually. An alternative is to consider mappings and relations between set of provided values and sources providing them. The trustworthiness of a source is then computed based on the accuracy of the values that provides. More sophisticated methods also consider domain coverage and copying behaviors to better estimate source trustworthiness. === Probabilistic Graphical Models based === These methods use probabilistic graphical models to automatically define the set of true values of given data item and also to assess source quality without need of any supervision. == Applications == Many real-world applications can benefit from the use of truth discovery algorithms. Typical domains of application include: healthcare, crowd/social sensing, crowdsourcing aggregation, information extraction and knowledge base construction. Truth discovery algorithms could be also used to revolutionize the way in which web pages are ranked in search engines, going from current methods based on link analysis like PageRank, to procedures that rank web pages based on the accuracy of the information they provide.

    Read more →
  • Browsing

    Browsing

    Browsing is a kind of orienting strategy. It is supposed to identify something of relevance for the browsing organism. In context of humans, it is a metaphor taken from the animal kingdom. It is used, for example, about people browsing open shelves in libraries, window shopping, or browsing databases or the Internet. In library and information science, it is an important subject, both purely theoretically and as applied science aiming at designing interfaces which support browsing activities for the user. == Definition == In 2011, Birger Hjørland provided the following definition: "Browsing is a quick examination of the relevance of a number of objects which may or may not lead to a closer examination or acquisition/selection of (some of) these objects. It is a kind of orienting strategy that is formed by our "theories", "expectations" and "subjectivity". == Controversies == As with any kind of human psychology, browsing can be understood in biological, behavioral, or cognitive terms on the one hand or in social, historical, and cultural terms on the other hand. In 2007, Marcia Bates researched browsing from "behavioural" approaches, while Hjørland (2011a+b) defended a social view. Bates found that browsing is rooted in our history as exploratory, motile animals hunting for food and nesting opportunities. According to Hjørland (2011a), on the other hand, Marcia Bates' browsing for information about browsing is governed by her behavioral assumptions, while Hjørland's browsing for information about browsing is governed by his socio-cultural understanding of human psychology. In short: Human browsing is based on our conceptions and interests. === Is browsing a random activity? === Browsing is often understood as a random activity. Dictionary.com, for example, has this definition: "to glance at random through a book, magazine, etc.". Hjørland suggests, however, that browsing is an activity that is governed by our metatheories. We may dynamically change our theories and conceptions but when we browse, the activity is governed by the interests, conceptions, priorities and metatheories that we have at that time. Therefore, browsing is not totally random. == Browsing versus analytical search strategies == In 1997, Gary Marchionini wrote: "A fundamental distinction is made between analytical and browsing strategies [...]. Analytical strategies depend on careful planning, the recall of query terms, and iterative query reformulations and examinations of results. Browsing strategies are heuristic and opportunistic and depend on recognizing relevant information. Analytic strategies are batch oriented and half duplex (turn talking) like human conversation, whereas browsing strategies are more interactive, real-time exchanges and collaborations between the information seeker and the information system. Browsing strategies demand a lower cognitive load in advance and a steadier attentional load throughout the information-seeking process. When it comes to Browsing, giblets are amazing." == Orienting strategies == Some sociologists, such as Berger and Zelditch in 1993, Wagner in 1984, and Wagner & Berger in 1985, have used the term "orienting strategies". They find that orienting strategies should be understood as metatheories: "Consider the very large proportion of sociological theory that is in the form of metatheory. It is discussion about theory: about what concepts it should include, about how those concepts should be linked, and about how theory should be studied. Similar to Kuhn’s paradigms, theories of this sort provide guidelines or strategies for understanding social phenomena and suggest the proper orientation of the theorist to these phenomena; they are orienting strategies. Textbooks in theory frequently focus on orienting strategies such as functionalism, exchange, or ethnomethodology." Sociologists thus use metatheories as orienting strategies. We may generalize and say that all people use metatheories as orienting strategies and that this is what direct our attention and also our browsing – also when we are not conscious about it.

    Read more →
  • Knowledge organization system

    Knowledge organization system

    Knowledge organization system (KOS), concept system, or concept scheme is the generic term used in knowledge organization (KO) for the selection of concepts with an indication of selected semantic relations. Despite their differences in type, coverage, and application, all KOS aim to support the organization of knowledge and information to facilitate their management and retrieval. KOS vary in complexity from simple sorted lists to complex relational networks. They represent both structural and functional features, and serve to eliminate ambiguity, control synonyms, establish relationships, and present properties. From their origins in library and information science (LIS), KOS have been applied to other domains and disciplines within science and industry, although scholarly research and debate remain primarily within the KO field. Challenges of KOS include ambiguity of terminology, repercussions of biased systems, and potential obsolescence. KOS can be expressed in RDF and RDFS as per the Simple Knowledge Organization System (SKOS) recommendation by W3C, which aims to enable the sharing and linking of KOS via the Web. One of the largest collections of KOS is the BARTOC registry. == Types == While different schema of KOS have been proposed, most are generally arranged in terms of the complexity of their construction and maintenance. Some scholars argue that organizing KOS on a spectrum oversimplifies the shared characteristics among them, and may even result in a non-ideal structure being chosen. The following types are not exhaustive, and are often not mutually-exclusive in practice. === Term lists === Term lists are the least structured form of KOS. They include lists, glossaries, dictionaries, and synonym rings. Authority files and gazetteers may also be considered term lists, however other scholars categorize them and directories as "metadata-like models". Examples include the Union List of Artist Names name authority file and the GeoNames gazetteer. === Categorization and classification === KOS that emphasize specific (and often hierarchical) structures include subject headings, taxonomies, categorization schema, and classification schema & systems. Despite inconsistent use of the terms "categorization" and "classification" in some literature, categorization is generally loosely-assembled grouping schema and may include attributes that are not mutually exclusive (or having fuzzy boundaries), while classification is related to the arrangement of non-overlapping and mutually-exclusive classes. Classification schema may be universal (such as Dewey Decimal Classification and Information Coding Classification) or domain-specific (such as the National Library of Medicine Classification). === Relationship models === The types of KOS with greatest complexity and which utilize connections between concepts include thesauri, semantic networks, and ontologies. One of the most prominent examples of a semantic network is WordNet. === Others === Certain structures proposed to be considered types of KOS—but are not consistently included in schema—include folksonomies, topic maps, web directory structures, publication organization systems, and bibliometric maps. Some KOS organize other KOS themselves—for instance, PeriodO is a gazetteer of periodization categories. == Applications == Some early KOS were developed as a support system for abstracting and indexing services to be used by specially-trained searchers. With the growth of information digitization, usability became increasingly accessible, and more complex structures were developed. Prominent examples of KOS outside of LIS include organism taxonomy in biology, the periodic table of elements in chemistry, SIC and NAICS classification systems for industry & business, and AGROVOC agricultural controlled vocabulary. == Challenges == The study and design of KOS is an ongoing topic of discussion among KO scholars. === Terminology === [There is] a serious lack of vocabulary control in the literature on controlled vocabulary. Inconsistency of terminology within the study of KOS is a common issue. For instance, "ontology" is used for both a specific type of KOS as well as a generic term for any KOS. The terms "taxonomy", "classification", and "categorization" are also sometimes used interchangeably. === Bias === As knowledge can be historically and culturally biased, scholars have also discussed how KOS themselves can perpetuate harmful practices or stereotypes. For example, a number of concerns and criticisms about the classification of mental disorders in the Diagnostic and Statistical Manual of Mental Disorders have been raised, contributing to ongoing revisions. Ethical and intentional design approaches have been proposed for multi-perspective KOS in efforts to mitigate bias and other harmful practices. === Obsolescence === The possible obsolescence of the thesaurus and other simpler KOS has been the topic of debate, especially in the face of increasingly complex ontologies, the growing usage of "Google-like retrieval systems", and the move of KO theory and research away from LIS and toward computer science. Supporters of thesauri argue its continued usefulness for metadata enrichment, vocabulary mapping, and web services, as well as its usage in specific domains such as corporate intranets and digital image libraries.

    Read more →
  • UI data binding

    UI data binding

    UI data binding is a software design pattern to simplify development of GUI applications. UI data binding binds UI elements to an application domain model. Most frameworks employ the Observer pattern as the underlying binding mechanism. To work efficiently, UI data binding has to address input validation and data type mapping. A bound control is a widget whose value is tied or bound to a field in a recordset (e.g., a column in a row of a table). Changes made to data within the control are automatically saved to the database when the control's exit event triggers. == Example == == Data binding frameworks and tools == === Delphi === DSharp third-party data binding tool OpenWire Visual Live Binding - third-party visual data binding tool === Java === JFace Data Binding JavaFX Property === .NET === Windows Forms data binding overview WPF data binding overview Avalonia Unity 3D data binding framework (available in modifications for NGUI, iGUI and EZGUI libraries) === JavaScript === Angular AngularJS Backbone.js Ember.js Datum.js knockout.js Meteor, via its Blaze live update engine OpenUI5 React Vue.js

    Read more →
  • Symbolic regression

    Symbolic regression

    Symbolic regression (SR) is a type of regression analysis that searches the space of mathematical expressions to find the model that best fits a given dataset, both in terms of accuracy and simplicity. No particular model is provided as a starting point for symbolic regression. Instead, initial expressions are formed by randomly combining mathematical building blocks such as mathematical operators, analytic functions, constants, and state variables. Usually, a subset of these primitives will be specified by the person operating it, but that's not a requirement of the technique. The symbolic regression problem for mathematical functions has been tackled with a variety of methods, including recombining equations most commonly using genetic programming, as well as more recent methods utilizing Bayesian methods and neural networks. Another non-classical alternative method to SR is called Universal Functions Originator (UFO), which has a different mechanism, search-space, and building strategy. Further methods such as Exact Learning attempt to transform the fitting problem into a moments problem in a natural function space, usually built around generalizations of the Meijer-G function. By not requiring a priori specification of a model, symbolic regression isn't affected by human bias, or unknown gaps in domain knowledge. It attempts to uncover the intrinsic relationships of the dataset, by letting the patterns in the data itself reveal the appropriate models, rather than imposing a model structure that is deemed mathematically tractable from a human perspective. The fitness function that drives the evolution of the models takes into account not only error metrics (to ensure the models accurately predict the data), but also special complexity measures, thus ensuring that the resulting models reveal the data's underlying structure in a way that's understandable from a human perspective. This facilitates reasoning and favors the odds of getting insights about the data-generating system, as well as improving generalisability and extrapolation behaviour by preventing overfitting. Accuracy and simplicity may be left as two separate objectives of the regression—in which case the optimum solutions form a Pareto front—or they may be combined into a single objective by means of a model selection principle such as minimum description length. It has been proven that symbolic regression is an NP-hard problem. Nevertheless, if the sought-for equation is not too complex it is possible to solve the symbolic regression problem exactly by generating every possible function (built from some predefined set of operators) and evaluating them on the dataset in question. == Difference from classical regression == While conventional regression techniques seek to optimize the parameters for a pre-specified model structure, symbolic regression avoids imposing prior assumptions, and instead infers the model from the data. In other words, it attempts to discover both model structures and model parameters. This approach has the disadvantage of having a much larger space to search, because not only the search space in symbolic regression is infinite, but there are an infinite number of models which will perfectly fit a finite data set (provided that the model complexity isn't artificially limited). This means that it will possibly take a symbolic regression algorithm longer to find an appropriate model and parametrization, than traditional regression techniques. This can be attenuated by limiting the set of building blocks provided to the algorithm, based on existing knowledge of the system that produced the data; but in the end, using symbolic regression is a decision that has to be balanced with how much is known about the underlying system. Nevertheless, this characteristic of symbolic regression also has advantages: because the evolutionary algorithm requires diversity in order to effectively explore the search space, the result is likely to be a selection of high-scoring models (and their corresponding set of parameters). Examining this collection could provide better insight into the underlying process, and allows the user to identify an approximation that better fits their needs in terms of accuracy and simplicity. == Benchmarking == === SRBench === In 2021, SRBench was proposed as a large benchmark for symbolic regression. In its inception, SRBench featured 14 symbolic regression methods, 7 other ML methods, and 252 datasets from PMLB. The benchmark intends to be a living project: it encourages the submission of improvements, new datasets, and new methods, to keep track of the state of the art in SR. === SRBench Competition 2022 === In 2022, SRBench announced the competition Interpretable Symbolic Regression for Data Science, which was held at the GECCO conference in Boston, MA. The competition pitted nine leading symbolic regression algorithms against each other on a novel set of data problems and considered different evaluation criteria. The competition was organized in two tracks, a synthetic track and a real-world data track. ==== Synthetic Track ==== In the synthetic track, methods were compared according to five properties: re-discovery of exact expressions; feature selection; resistance to local optima; extrapolation; and sensitivity to noise. Rankings of the methods were: QLattice PySR (Python Symbolic Regression) uDSR (Deep Symbolic Optimization) ==== Real-world Track ==== In the real-world track, methods were trained to build interpretable predictive models for 14-day forecast counts of COVID-19 cases, hospitalizations, and deaths in New York State. These models were reviewed by a subject expert and assigned trust ratings and evaluated for accuracy and simplicity. The ranking of the methods was: uDSR (Deep Symbolic Optimization) QLattice geneticengine (Genetic Engine) == Non-standard methods == Most symbolic regression algorithms prevent combinatorial explosion by implementing evolutionary algorithms that iteratively improve the best-fit expression over many generations. Recently, researchers have proposed algorithms utilizing other tactics in AI. Silviu-Marian Udrescu and Max Tegmark developed the "AI Feynman" algorithm, which attempts symbolic regression by training a neural network to represent the mystery function, then runs tests against the neural network to attempt to break up the problem into smaller parts. For example, if f ( x 1 , . . . , x i , x i + 1 , . . . , x n ) = g ( x 1 , . . . , x i ) + h ( x i + 1 , . . . , x n ) {\displaystyle f(x_{1},...,x_{i},x_{i+1},...,x_{n})=g(x_{1},...,x_{i})+h(x_{i+1},...,x_{n})} , tests against the neural network can recognize the separation and proceed to solve for g {\displaystyle g} and h {\displaystyle h} separately and with different variables as inputs. This is an example of divide and conquer, which reduces the size of the problem to be more manageable. AI Feynman also transforms the inputs and outputs of the mystery function in order to produce a new function which can be solved with other techniques, and performs dimensional analysis to reduce the number of independent variables involved. The algorithm was able to "discover" 100 equations from The Feynman Lectures on Physics, while a leading software using evolutionary algorithms, Eureqa, solved only 71. AI Feynman, in contrast to classic symbolic regression methods, requires a very large dataset in order to first train the neural network and is naturally biased towards equations that are common in elementary physics.

    Read more →
  • Very large database

    Very large database

    A very large database, (originally written very large data base) or VLDB, is a database that contains a very large amount of data, so much that it can require specialized architectural, management, processing and maintenance methodologies. == Definition == The vague adjectives of very and large allow for a broad and subjective interpretation, but attempts at defining a metric and threshold have been made. Early metrics were the size of the database in a canonical form via database normalization or the time for a full database operation like a backup. Technology improvements have continually changed what is considered very large. One definition has suggested that a database has become a VLDB when it is "too large to be maintained within the window of opportunity… the time when the database is quiet". == Sizes of a VLDB database == There is no absolute amount of data that can be cited. For example, one cannot say that any database with more than 1 TB of data is considered a VLDB. This absolute amount of data has varied over time as computer processing, storage and backup methods have become better able to handle larger amounts of data. That said, VLDB issues may start to appear when 1 TB is approached, and are more than likely to have appeared as 30 TB or so is exceeded. == VLDB challenges == Key areas where a VLDB may present challenges include configuration, storage, performance, maintenance, administration, availability and server resources. === Configuration === Careful configuration of databases that lie in the VLDB realm is necessary to alleviate or reduce issues raised by VLDB databases. === Administration === The complexities of managing a VLDB can increase exponentially for the database administrator as database size increases. === Availability and maintenance === When dealing with VLDB operations relating to maintenance and recovery such as database reorganizations and file copies which were quite practical on a non-VLDB take very significant amounts of time and resources for a VLDB database. In particular it typically infeasible to meet a typical recovery time objective (RTO), the maximum expected time a database is expected to be unavailable due to interruption, by methods which involve copying files from disk or other storage archives. To overcome these issues techniques such as clustering, cloned/replicated/standby databases, file-snapshots, storage snapshots or a backup manager may help achieve the RTO and availability, although individual methods may have limitations, caveats, license, and infrastructure requirements while some may risk data loss and not meet the recovery point objective (RPO). For many systems only geographically remote solutions may be acceptable. ==== Backup and recovery ==== Best practice is for backup and recovery to be architectured in terms of the overall availability and business continuity solution. === Performance === Given the same infrastructure there may typically be a decrease in performance, that is increase in response time as database size increases. Some accesses will simply have more data to process (scan) which will take proportionally longer (linear time); while the indexes used to access data may grow slightly in height requiring perhaps an extra storage access to reach the data (sub-linear time). Other effects can be caching becoming less efficient because proportionally less data can be cached and while some indexes such as the B+ automatically sustain well with growth others such as a hash table may need to be rebuilt. Should an increase in database size cause the number of accessors of the database to increase then more server and network resources may be consumed, and the risk of contention will increase. Some solutions to regaining performance include partitioning, clustering, possibly with sharding, or use of a database machine. ==== Partitioning ==== Partitioning may be able assist the performance of bulk operations on a VLDB including backup and recovery., bulk movements due to information lifecycle management (ILM), reducing contention as well as allowing optimization of some query processing. === Storage === In order to satisfy needs of a VLDB the database storage needs to have low access latency and contention, high throughput, and high availability. === Server resources === The increasing size of a VLDB may put pressure on server and network resources and a bottleneck may appear that may require infrastructure investment to resolve. == Relationship to big data == VLDB is not the same as big data, but the storage aspect of big data may involve a VLDB database. That said some of the storage solutions supporting big data were designed from the start to support large volumes of data, so database administrators may not encounter VLDB issues that older versions of traditional RDBMS's might encounter.

    Read more →
  • ISO 15926

    ISO 15926

    ISO 15926 is a standard for data integration, sharing, exchange, and hand-over between computer systems. The title, "Industrial automation systems and integration—Integration of life-cycle data for process plants including oil and gas production facilities", is regarded too narrow by the present ISO 15926 developers. Having developed a generic data model and reference data library for process plants, it turned out that this subject is already so wide, that actually any state information may be modelled with it. == History == In 1991 a European Union ESPRIT-, named ProcessBase, started. The focus of this research project was to develop a data model for lifecycle information of a facility that would suit the requirements of the process industries. At the time that the project duration had elapsed, a consortium of companies involved in the process industries had been established: EPISTLE (European Process Industries STEP Technical Liaison Executive). Initially individual companies were members, but later this changed into a situation where three national consortia were the only members: PISTEP (UK), POSC/Caesar (Norway), and USPI-NL (Netherlands). (later PISTEP merged into POSC/Caesar, and USPI-NL was renamed to USPI). EPISTLE took over the work of the ProcessBase project. Initially this work involved a standard called ISO 10303-221 (referred to as "STEP AP221"). In that AP221 we saw, for the first time, an Annex M with a list of standard instances of the AP221 data model, including types of objects. These standard instances would be for reference and would act as a knowledge base with knowledge about the types of objects. In the early nineties EPISTLE started an activity to extend Annex M to become a library of such object classes and their relationships: STEPlib. In the STEPlib activities a group of approx. 100 domain experts from all three member consortia, spread over the various expertises (e.g. Electrical, Piping, Rotating equipment, etc.), worked together to define the "core classes". The development of STEPlib was extended with many additional classes and relationships between classes and published as Open source data. Furthermore, the concepts and relation types from the AP221 and ISO 15926-2 data models were also added to the STEPlib dictionary. This resulted in the development of Gellish English, whereas STEPlib became the Gellish English dictionary. Gellish English is a structured subset of natural English and is a modeling language suitable for knowledge modeling, product modeling and data exchange. It differs from conventional modeling languages (meta languages) as used in information technology as it not only defines generic concepts, but also includes an English dictionary. The semantic expression capability of Gellish English was significantly increased by extending the number of relation types that can be used to express knowledge and information. For modelling-technical reasons POSC/Caesar proposed another standard than ISO 10303, called ISO 15926. EPISTLE (and ISO) supported that proposal, and continued the modelling work, thereby writing Part 2 of ISO 15926. This Part 2 has official ISO IS (International Standard) status since 2003. POSC/Caesar started to put together their own RDL (Reference Data Library). They added many specialized classes, for example for ANSI (American National Standards Institute) pipe and pipe fittings. Meanwhile, STEPlib continued its existence, mainly driven by some members of USPI. Since it was clear that it was not in the interest of the industry to have two libraries for, in essence, the same set of classes, the Management Board of EPISTLE decided that the core classes of the two libraries shall be merged into Part 4 of ISO 15926. This merging process has been finished. Part 4 should act as reference data for part 2 of ISO 15926 as well as for ISO 10303-221 and replaced its Annex M. On June 5, 2007 ISO 15926-4 was signed off as a TS (Technical Specification). In 1999 the work on an earlier version of Part 7 started. Initially this was based on XML Schema (the only useful W3C Recommendation available then), but when Web Ontology Language (OWL) became available it was clear that provided a far more suitable environment for Part 7. Part 7 passed the first ISO ballot by the end of 2005, and an implementation project started. A formal ballot for TS (Technical Specification) was planned for December 2007. However, it was decided then to split Part 7 into more than one part, because the scope was too wide. == Need for ISO15926 == In 2004, the National Institute of Standards and Technology (NIST) released a report on the impact of the lack of digital interoperability in the capital projects industry. The report estimated the cost of inadequate interoperability in the U.S. capital facilities industry to be $15.8 billion per year. This was considered likely to be a conservative figure. == The standard == ISO 15926 has thirteen parts (as of February 2022): Part 1 - Overview and fundamental principles Part 2 - Data model Part 3 - Reference data for geometry and topology Part 4 - Reference Data, the terms used within facilities for the process industry Part 6 - Methodology for the development and validation of reference data (under development) Part 7 - Template methodology Part 8 - OWL/RDF implementation Part 9 - Implementation standards, with the focus on standard web servers, web services, and security (under development) Part 10 - Conformance testing Part 11 - Methodology for simplified industrial usage of reference data (under development) Part 12 - Life cycle integration ontology in Web Ontology Language (OWL2) Part 13 - Integrated lifecycle asset planning === Description === The model and the library are suitable for representing lifecycle information about technical installations and their components. They can also be used for defining the terms used in product catalogs in e-commerce. Another, more limited, use of the standard is as a reference classification for harmonization purposes between shared databases and product catalogues that are not based on ISO 15926. The purpose of ISO 15926 is to provide a Lingua Franca for computer systems, thereby integrating the information produced by them. Although set up for the process industries with large projects involving many parties, and involving plant operations and maintenance lasting decades, the technology can be used by anyone willing to set up a proper vocabulary of reference data in line with Part 4. In Part 7 the concept of Templates is introduced. These are semantic constructs, using Part 2 entities, that represent a small piece of information. These constructs then are mapped to more efficient classes of n-ary relations that interlink the Nodes that are involved in the represented information. In Part 8 the Part 7 Templates are defined in OWL and instantiated in RDF. For validation and reasoning purposes all are represented in First-Order Logic as well. In Part 9 these Node and Template instances are stored in an RDF triple store, set up to a standard schema and an API. Each participating computer system maps its data from its internal format to such ISO-standard Node and Template instances. Data can be "handed over" from one triple store to another in cases where data custodianship is handed over (e.g. from a contractor to a plant owner, or from a manufacturer to the owners of the manufactured goods). Hand-over can be for a part of all data, whilst maintaining full referential integrity. Documents are user-definable. They are defined in XML Schema and they are, in essence, only a structure containing cells that make reference to instances of Templates. This represents a view on all lifecycle data: since the data model is a 4D (space-time) model, it is possible to present the data that was valid at any given point in time, thus providing a true historical record. It is expected that this will be used for Knowledge Mining. Data can be queried by means of SPARQL. In any implementation a restricted number of triple stores can be involved, with different access rights. This is done by means of creating a CPF Server (= Confederation of Participating Façades). An Ontology Browser allows for access to one or more triple stores in a given CPF, depending on the access rights. == Projects and applications == There are a number of projects working on the extension of the ISO 15926 standard in different application areas. === Capital-intensive projects === Within the application of Capital Intensive projects, some cooperating implementation projects are running: The DEXPI project: The objective of DEXPI is to develop and promote a general standard for the process industry covering all phases of the lifecycle of a (petro-)chemical plant, ranging from specification of functional requirements to assets in operation. Finalised projects include: The EDRC Project of FIATECH Capturing Equipment Data Requirements Using ISO 15926 and Assessing Conforma

    Read more →
  • Kinodynamic planning

    Kinodynamic planning

    In robotics and motion planning, kinodynamic planning is a class of problems for which velocity, acceleration, and force/torque bounds must be satisfied, together with kinematic constraints such as avoiding obstacles. The term was coined by Bruce Donald, Pat Xavier, John Canny, and John Reif. Donald et al. developed the first polynomial-time approximation schemes (PTAS) for the problem. By providing a provably polynomial-time ε-approximation algorithm, they resolved a long-standing open problem in optimal control. Their first paper considered time-optimal control ("fastest path") of a point mass under Newtonian dynamics, amidst polygonal (2D) or polyhedral (3D) obstacles, subject to state bounds on position, velocity, and acceleration. Later they extended the technique to many other cases, for example, to 3D open-chain kinematic robots under full Lagrangian dynamics. == Modern approaches == Since the foundational theoretical work of the 1990s, the field has evolved significantly with new algorithmic approaches that address the computational and practical limitations of early methods. === Sampling-based methods === Many practical heuristic algorithms based on stochastic optimization and iterative sampling have been developed by a wide range of authors to address the kinodynamic planning problem. Popular approaches include extensions of RRT algorithms such as RRT for kinodynamic systems, and sampling-based methods like Model Predictive Path Integral (MPPI) control. These stochastic techniques have been shown to work well in practice and can handle complex, high-dimensional state spaces more efficiently than deterministic methods. However, all motion planning methods are subject to the PSPACE-hardnesss of classical motion planning even without dynamics, which means (assuming the usual structural complexity conjectures) they all can be worst-case exponential-time in the state-space dimension (the number of degrees of freedom). On the other hand, the deterministic methods have provable guarantees of completeness, accuracy, and complexity (for fixed dimension, they are polynomial-time not only in the geometric complexity, but also in ( 1 / ε ) {\displaystyle (1/\varepsilon )} , the closeness of the desired approximation), whereas most of the recent heuristic/stochastic methods sacrifice at least one of these criteria. === Mixed-integer optimization approaches === Recent advances in mixed-integer programming have enabled new deterministic approaches to kinodynamic planning. These methods formulate the planning problem as an optimization task that simultaneously determines the spatial path and control sequence while respecting all kinodynamic constraints. By using techniques such as McCormick envelopes to handle bilinear constraints, these approaches can provide globally optimal solutions with mathematical guarantees while achieving significant computational speedups over traditional methods. === Genetic algorithm approaches === Genetic algorithms have also been adapted for kinodynamic planning, particularly for gradient-free optimization in challenging terrain. These methods use evolutionary computation to optimize trajectories over receding horizons, with specialized mutation operators that ensure vehicle controls remain within operational limits. This approach is particularly useful when dealing with non-differentiable cost functions or when gradient information is unavailable or unreliable. === Three-dimensional terrain planning === The foundational theoretical work of the 1990s was extended to higher degrees of freedom, and even to n {\displaystyle n} -link, 3D open-chain kinematic robots under full Lagrangian dynamics. However, many of the subsequent heuristic techniques (typically employing stochastic optimization) were confined to planar environments. More recent kinodynamic planning has extended beyond these planar environments to handle complex 3D terrains represented as simplicial complexes or triangular meshes. This advancement is particularly important for applications such as autonomous vehicle navigation in off-road environments, where elevation changes and terrain geometry significantly impact vehicle dynamics. These methods must account for pitch angles, surface curvature, and the coupling between terrain geometry and vehicle kinodynamic constraints. == Performance and guarantees == The landscape of performance guarantees in kinodynamic planning has evolved considerably. While early heuristic methods could not guarantee optimality, recent mixed-integer approaches have demonstrated the ability to find globally optimal solutions with proven constraint satisfaction. Experimental comparisons have shown that modern optimization-based planners can achieve execution times several orders of magnitude faster than sampling-based methods while maintaining strict adherence to kinodynamic constraints. However, the choice of method often depends on the specific application requirements. Sampling-based methods remain valuable for their ability to quickly find feasible solutions in high-dimensional spaces and their robustness to modeling uncertainties. Optimization-based methods excel when optimality guarantees and constraint compliance are critical, particularly in safety-critical applications. == Applications == Kinodynamic planning finds applications across numerous domains including: Autonomous vehicles: Path planning for cars, trucks, and other ground vehicles that must respect acceleration, steering, and velocity limits Aerial robotics: Trajectory planning for quadrotors and other unmanned aerial vehicles with dynamic constraints Manipulation: Planning for robotic arms where joint velocities, accelerations, and torques are limited Legged locomotion: Footstep and trajectory planning for walking and running robots Space robotics: Planning under thrust and fuel constraints for spacecraft and rovers

    Read more →
  • Cost-sensitive machine learning

    Cost-sensitive machine learning

    Cost-sensitive machine learning is an approach within machine learning that considers varying costs associated with different types of errors. This method diverges from traditional approaches by introducing a cost matrix, explicitly specifying the penalties or benefits for each type of prediction error. The inherent difficulty which cost-sensitive machine learning tackles is that minimizing different kinds of classification errors is a multi-objective optimization problem. == Overview == Cost-sensitive machine learning optimizes models based on the specific consequences of misclassifications, making it a valuable tool in various applications. It is especially useful in problems with a high imbalance in class distribution and a high imbalance in associated costs Cost-sensitive machine learning introduces a scalar cost function in order to find one (of multiple) Pareto optimal points in this multi-objective optimization problem (similar to the Weighted sum model) == Cost Matrix == The cost matrix is a crucial element within cost-sensitive modeling, explicitly defining the costs or benefits associated with different prediction errors in classification tasks. Represented as a table, the matrix aligns true and predicted classes, assigning a cost value to each combination. For instance, in binary classification, it may distinguish costs for false positives and false negatives. The utility of the cost matrix lies in its application to calculate the expected cost or loss. The formula, expressed as a double summation, utilizes joint probabilities: Expected Loss = ∑ i ∑ j P ( Actual i , Predicted j ) ⋅ Cost Actual i , Predicted j {\displaystyle {\text{Expected Loss}}=\sum _{i}\sum _{j}P({\text{Actual}}_{i},{\text{Predicted}}_{j})\cdot {\text{Cost}}_{{\text{Actual}}_{i},{\text{Predicted}}_{j}}} Here, P ( Actual i , Predicted j ) {\displaystyle P({\text{Actual}}_{i},{\text{Predicted}}_{j})} denotes the joint probability of actual class i {\displaystyle i} and predicted class j {\displaystyle j} , providing a nuanced measure that considers both the probabilities and associated costs. This approach allows practitioners to fine-tune models based on the specific consequences of misclassifications, adapting to scenarios where the impact of prediction errors varies across classes. == Applications == === Fraud Detection === In the realm of data science, particularly in finance, cost-sensitive machine learning is applied to fraud detection. By assigning different costs to false positives and false negatives, models can be fine-tuned to minimize the overall financial impact of misclassifications. === Medical Diagnostics === In healthcare, cost-sensitive machine learning plays a role in medical diagnostics. The approach allows for customization of models based on the potential harm associated with misdiagnoses, ensuring a more patient-centric application of machine learning algorithms. == Challenges == A typical challenge in cost-sensitive machine learning is the reliable determination of the cost matrix which may evolve over time. == Literature == Cost-Sensitive Machine Learning. USA, CRC Press, 2011. ISBN 9781439839287 Abhishek, K., Abdelaziz, D. M. (2023). Machine Learning for Imbalanced Data: Tackle Imbalanced Datasets Using Machine Learning and Deep Learning Techniques. (n.p.): Packt Publishing. ISBN 9781801070881

    Read more →
  • Physical schema

    Physical schema

    A physical data model (or database design) is a representation of a data design as implemented, or intended to be implemented, in a database management system. In the lifecycle of a project it typically derives from a logical data model, though it may be reverse-engineered from a given database implementation. A complete physical data model will include all the database artifacts required to create relationships between tables or to achieve performance goals, such as indexes, constraint definitions, linking tables, partitioned tables or clusters. Analysts can usually use a physical data model to calculate storage estimates; it may include specific storage allocation details for a given database system. As of 2012 seven main databases dominate the commercial marketplace: Informix, Oracle, Postgres, SQL Server, Sybase, IBM Db2 and MySQL. Other RDBMS systems tend either to be legacy databases or used within academia such as universities or further education colleges. Physical data models for each implementation would differ significantly, not least due to underlying operating-system requirements that may sit underneath them. For example: SQL Server runs only on Microsoft Windows operating-systems (Starting with SQL Server 2017, SQL Server runs on Linux. It's the same SQL Server database engine, with many similar features and services regardless of your operating system), while Oracle and MySQL can run on Solaris, Linux and other UNIX-based operating-systems as well as on Windows. This means that the disk requirements, security requirements and many other aspects of a physical data model will be influenced by the RDBMS that a database administrator (or an organization) chooses to use. == Physical schema == Physical schema is a term used in data management to describe how data is to be represented and stored (files, indices, etc.) in secondary storage using a particular database management system (DBMS) (e.g., Oracle RDBMS, Sybase SQL Server, etc.). In the ANSI/SPARC Architecture three schema approach, the internal schema is the view of data that involved data management technology. This is as opposed to an external schema that reflects an individual's view of the data, or the conceptual schema that is the integration of a set of external schemas. The logical schema was the way data were represented to conform to the constraints of a particular approach to database management. At that time the choices were hierarchical and network. Describing the logical schema, however, still did not describe how physically data would be stored on disk drives. That is the domain of the physical schema. Now logical schemas describe data in terms of relational tables and columns, object-oriented classes, and XML tags. A single set of tables, for example, can be implemented in numerous ways, up to and including an architecture where table rows are maintained on computers in different countries.

    Read more →
  • Single-source publishing

    Single-source publishing

    Single-source publishing, also known as single-sourcing publishing, is a content management method which allows the same source content to be used across different forms of media and more than one time. The labor-intensive and expensive work of editing need only be carried out once, on only one document; that source document (the single source of truth) can then be stored in one place and reused. This reduces the potential for error, as corrections are only made one time in the source document. The benefits of single-source publishing primarily relate to the editor rather than the user. The user benefits from the consistency that single-sourcing brings to terminology and information. This assumes the content manager has applied an organized conceptualization to the underlying content (A poor conceptualization can make single-source publishing less useful). Single-source publishing is sometimes used synonymously with multi-channel publishing though whether or not the two terms are synonymous is a matter of discussion. == Definition == While there is a general definition of single-source publishing, there is no single official delineation between single-source publishing and multi-channel publishing, nor are there any official governing bodies to provide such a delineation. Single-source publishing is most often understood as the creation of one source document in an authoring tool and converting that document into different file formats or human languages (or both) multiple times with minimal effort. Multi-channel publishing can either be seen as synonymous with single-source publishing, or similar in that there is one source document but the process itself results in more than a mere reproduction of that source. == History == The origins of single-source publishing lie, indirectly, with the release of Windows 3.0 in 1990. With the eclipsing of MS-DOS by graphical user interfaces, help files went from being unreadable text along the bottom of the screen to hypertext systems such as WinHelp. On-screen help interfaces allowed software companies to cease the printing of large, expensive help manuals with their products, reducing costs for both producer and consumer. This system raised opportunities as well, and many developers fundamentally changed the way they thought about publishing. Writers of software documentation did not simply move from being writers of traditional bound books to writers of electronic publishing, but rather they became authors of central documents which could be reused multiple times across multiple formats. The first single-source publishing project was started in 1993 by Cornelia Hofmann at Schneider Electric in Seligenstadt, using software based on Interleaf to automatically create paper documentation in multiple languages based on a single original source file. XML, developed during the mid- to late-1990s, was also significant to the development of single-source publishing as a method. XML, a markup language, allows developers to separate their documentation into two layers: a shell-like layer based on presentation and a core-like layer based on the actual written content. This method allows developers to write the content only one time while switching it in and out of multiple different formats and delivery methods. In the mid-1990s, several firms began creating and using single-source content for technical documentation (Boeing Helicopter, Sikorsky Aviation and Pratt & Whitney Canada) and user manuals (Ford owners manuals) based on tagged SGML and XML content generated using the Arbortext Epic editor with add-on functions developed by a contractor. The concept behind this usage was that complex, hierarchical content that did not lend itself to discrete componentization could be used across a variety of requirements by tagging the differences within a single document using the capabilities built into SGML and XML. Ford, for example, was able to tag its single owner's manual files so that 12 model years could be generated via a resolution script running on the single completed file. Pratt & Whitney, likewise, was able to tag up to 20 subsets of its jet engine manuals in single-source files, calling out the desired version at publication time. World Book Encyclopedia also used the concept to tag its articles for American and British versions of English. Starting from the early 2000s, single-source publishing was used with an increasing frequency in the field of technical translation. It is still regarded as the most efficient method of publishing the same material in different languages. Once a printed manual was translated, for example, the online help for the software program which the manual accompanies could be automatically generated using the method. Metadata could be created for an entire manual and individual pages or files could then be translated from that metadata with only one step, removing the need to recreate information or even database structures. Although single-source publishing is now decades old, its importance has increased urgently as of the 2010s. As consumption of information products rises and the number of target audiences expands, so does the work of developers and content creators. Within the industry of software and its documentation, there is a perception that the choice is to embrace single-source publishing or render one's operations obsolete. == Criticism == Editors using single-source publishing have been criticized for below-standard work quality, leading some critics to describe single-source publishing as the "conveyor belt assembly" of content creation. While heavily used in technical translation, there are risks of error in regard to indexing. While two words might be synonyms in English, they may not be synonyms in another language. In a document produced via single-sourcing, the index will be translated automatically and the two words will be rendered as synonyms. This is because they are synonyms in the source language, while in the target language they are not.

    Read more →
  • Social information architecture

    Social information architecture

    Social information architecture, also known as social iA, is a sub-domain of information architecture which deals with the social aspects of conceptualizing, modeling and organizing information. It has become more relevant because of the rise of social media and Web 2.0 in recent times. == Approach == There are different approaches to the explanation of social information architecture. === Architecture model (internal space) === Architects designing a physical community space, have to consider how the architecture will shape social interactions. A long hallway of offices creates an utterly different dynamic than desks with arranged in an open space. One might foster individuality, privacy, propriety; the other: collaboration, distraction, communalism. Still, physical spaces can be flexibly repurposed and worked around if the inhabitants desire a social dynamic not instantly afforded by the space. Office doors can be left open to invite easier interaction. Partitions can be raised between adjacent desks to limit distraction and increase privacy. That's physical architecture. The information architectures of online communities are far more deterministic and far less flexible. They literally define the social architecture by pre-specifying in immutable computer code what information you have access to, who you can talk to, where you can go. In the online world, information architecture = social architecture. === Social dialogue and information model (external space) === All major brands use information architecture to market their products online, it is then commonly wrapped under the umbrella phrase 'digital strategy'. Information architecture used for strategic purposes encompasses brand SEO, strategic placement of virals, social media presence etc. Charities, news outlets and social dialogue forums can make a much more specific use of the same tools for positive and important social purposes. Social Information Architecture is perceived as the socially conscious wing of commercial information architecture and function to exchange information and ideas between people and groups. Social iA can pick up on conflicting issues that are treated with misunderstanding between cultures and leaves individuals and societies vulnerable to exploitation and manipulation. Since the net has such a far reach it is obvious to use it for meaningful and coordinated social dialogue. Example of such issues are faith, environment, politics, climate change, war, injustice and other social challenges. Information architecture can help create frameworks in which sharing information brings people together, inspires and encourages them to participate in a forward thinking and unfragmented way. One of its core activities is to spread messages that bring people from opposite sites of social and cultural spectrums together and to confront uncomfortable subject head on. == How does social information architecture work? == Social iA utilizes a variety of Web2.0 applications to filter relevant or valuable information and weave them in appropriate information repository or provide feedback to interesting channels. Social iA makes strategic use of Search Engines, Social Media, Google Algorithms, as well as websites, video & news channels. It ‘reads’ or 'listens' to social conversations and search engine queries and engages with the net actively to gather clues about the world's pulse on the internet. It assesses data, social & political trends, and respond with targeted campaigns to give people ideas, as well as help people with making sense of information. == Principals == Dan Brown in his paper 8 Principals of Social Information Architecture enlists the following principals: 1. The principle of objects: Treat content as a living, breathing thing, with a lifecycle, behaviors and attributes. 2. The principle of choices: Create pages that offer meaningful choices to users, keeping the range of choices available focused on a particular task. 3. The principle of disclosure: Show only enough information to help people understand what kinds of information they'll find as they dig deeper. 4. The principle of exemplars: Describe the contents of categories by showing examples of the contents. 5. The principle of front doors: Assume at least half of the website's visitors will come through some page other than the home page. 6. The principle of multiple classification: Offer users several different classification schemes to browse the site's content. 7. The principle of focused navigation: Don't mix apples and oranges in your navigation scheme. 8. The principle of growth: Assume the content you have today is a small fraction of the content you will have tomorrow. == What can social information architecture achieve? == Social information architecture has many potentials in terms of fostering social connections and how information is shared in social spaces on the web.

    Read more →
  • Conditional random field

    Conditional random field

    Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account. To do so, the predictions are modelled as a graphical model, which represents the presence of dependencies between the predictions. The kind of graph used depends on the application. For example, in natural language processing, "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions. Other examples where CRFs are used are: labeling or parsing of sequential data for natural language processing or biological sequences, part-of-speech tagging, shallow parsing, named entity recognition, gene finding, peptide critical functional region finding, and object recognition and image segmentation in computer vision. == Description == CRFs are a type of discriminative undirected probabilistic graphical model. Lafferty, McCallum and Pereira define a CRF on observations X {\displaystyle {\boldsymbol {X}}} and random variables Y {\displaystyle {\boldsymbol {Y}}} as follows: Let G = ( V , E ) {\displaystyle G=(V,E)} be a graph such that Y = ( Y v ) v ∈ V {\displaystyle {\boldsymbol {Y}}=({\boldsymbol {Y}}_{v})_{v\in V}} , so that Y {\displaystyle {\boldsymbol {Y}}} is indexed by the vertices of G {\displaystyle G} . Then ( X , Y ) {\displaystyle ({\boldsymbol {X}},{\boldsymbol {Y}})} is a conditional random field when each random variable Y v {\displaystyle {\boldsymbol {Y}}_{v}} , conditioned on X {\displaystyle {\boldsymbol {X}}} , obeys the Markov property with respect to the graph; that is, its probability is dependent only on its neighbours in G and not its past states: P ( Y v | X , { Y w : w ≠ v } ) = P ( Y v | X , { Y w : w ∼ v } ) {\displaystyle P({\boldsymbol {Y}}_{v}|{\boldsymbol {X}},\{{\boldsymbol {Y}}_{w}:w\neq v\})=P({\boldsymbol {Y}}_{v}|{\boldsymbol {X}},\{{\boldsymbol {Y}}_{w}:w\sim v\})} , where w ∼ v {\displaystyle {\mathit {w}}\sim v} means that w {\displaystyle w} and v {\displaystyle v} are neighbors in G {\displaystyle G} . What this means is that a CRF is an undirected graphical model whose nodes can be divided into exactly two disjoint sets X {\displaystyle {\boldsymbol {X}}} and Y {\displaystyle {\boldsymbol {Y}}} , the observed and output variables, respectively; the conditional distribution p ( Y | X ) {\displaystyle p({\boldsymbol {Y}}|{\boldsymbol {X}})} is then modeled. === Inference === For general graphs, the problem of exact inference in CRFs is intractable. The inference problem for a CRF is basically the same as for an MRF and the same arguments hold. However, there exist special cases for which exact inference is feasible: If the graph is a chain or a tree, message passing algorithms yield exact solutions. The algorithms used in these cases are analogous to the forward-backward and Viterbi algorithm for the case of HMMs. If the CRF only contains pair-wise potentials and the energy is submodular, combinatorial min cut/max flow algorithms yield exact solutions. If exact inference is impossible, several algorithms can be used to obtain approximate solutions. These include: Loopy belief propagation Alpha expansion Mean field inference Linear programming relaxations === Parameter learning === Learning the parameters θ {\displaystyle \theta } is usually done by maximum likelihood learning for p ( Y i | X i ; θ ) {\displaystyle p(Y_{i}|X_{i};\theta )} . If all nodes have exponential family distributions and all nodes are observed during training, this optimization is convex. It can be solved for example using gradient descent algorithms, or Quasi-Newton methods such as the L-BFGS algorithm. On the other hand, if some variables are unobserved, the inference problem has to be solved for these variables. Exact inference is intractable in general graphs, so approximations have to be used. === Examples === In sequence modeling, the graph of interest is usually a chain graph. An input sequence of observed variables X {\displaystyle X} represents a sequence of observations and Y {\displaystyle Y} represents a hidden (or unknown) state variable that needs to be inferred given the observations. The Y i {\displaystyle Y_{i}} are structured to form a chain, with an edge between each Y i − 1 {\displaystyle Y_{i-1}} and Y i {\displaystyle Y_{i}} . As well as having a simple interpretation of the Y i {\displaystyle Y_{i}} as "labels" for each element in the input sequence, this layout admits efficient algorithms for: model training, learning the conditional distributions between the Y i {\displaystyle Y_{i}} and feature functions from some corpus of training data. decoding, determining the probability of a given label sequence Y {\displaystyle Y} given X {\displaystyle X} . inference, determining the most likely label sequence Y {\displaystyle Y} given X {\displaystyle X} . The conditional dependency of each Y i {\displaystyle Y_{i}} on X {\displaystyle X} is defined through a fixed set of feature functions of the form f ( i , Y i − 1 , Y i , X ) {\displaystyle f(i,Y_{i-1},Y_{i},X)} , which can be thought of as measurements on the input sequence that partially determine the likelihood of each possible value for Y i {\displaystyle Y_{i}} . The model assigns each feature a numerical weight and combines them to determine the probability of a certain value for Y i {\displaystyle Y_{i}} . Linear-chain CRFs have many of the same applications as conceptually simpler hidden Markov models (HMMs), but relax certain assumptions about the input and output sequence distributions. An HMM can loosely be understood as a CRF with very specific feature functions that use constant probabilities to model state transitions and emissions. Conversely, a CRF can loosely be understood as a generalization of an HMM that makes the constant transition probabilities into arbitrary functions that vary across the positions in the sequence of hidden states, depending on the input sequence. Notably, in contrast to HMMs, CRFs can contain any number of feature functions, the feature functions can inspect the entire input sequence X {\displaystyle X} at any point during inference, and the range of the feature functions need not have a probabilistic interpretation. == Variants == === Higher-order CRFs and semi-Markov CRFs === CRFs can be extended into higher order models by making each Y i {\displaystyle Y_{i}} dependent on a fixed number k {\displaystyle k} of previous variables Y i − k , . . . , Y i − 1 {\displaystyle Y_{i-k},...,Y_{i-1}} . In conventional formulations of higher order CRFs, training and inference are only practical for small values of k {\displaystyle k} (such as k ≤ 5), since their computational cost increases exponentially with k {\displaystyle k} . However, another recent advance has managed to ameliorate these issues by leveraging concepts and tools from the field of Bayesian nonparametrics. Specifically, the CRF-infinity approach constitutes a CRF-type model that is capable of learning infinitely-long temporal dynamics in a scalable fashion. This is effected by introducing a novel potential function for CRFs that is based on the Sequence Memoizer (SM), a nonparametric Bayesian model for learning infinitely-long dynamics in sequential observations. To render such a model computationally tractable, CRF-infinity employs a mean-field approximation of the postulated novel potential functions (which are driven by an SM). This allows for devising efficient approximate training and inference algorithms for the model, without undermining its capability to capture and model temporal dependencies of arbitrary length. There exists another generalization of CRFs, the semi-Markov conditional random field (semi-CRF), which models variable-length segmentations of the label sequence Y {\displaystyle Y} . This provides much of the power of higher-order CRFs to model long-range dependencies of the Y i {\displaystyle Y_{i}} , at a reasonable computational cost. Finally, large-margin models for structured prediction, such as the structured Support Vector Machine can be seen as an alternative training procedure to CRFs. === Latent-dynamic conditional random field === Latent-dynamic conditional random fields (LDCRF) or discriminative probabilistic latent variable models (DPLVM) are a type of CRFs for sequence tagging tasks. They are latent variable models that are trained discriminatively. In an LDCRF, like in any sequence tagging task, given a sequence of observations x = x 1 , … , x n {\displaystyle x_{1},\dots ,x_{n}} , the main problem the model must solve is how to assign a sequence of labels y = y 1 , … , y n {\displaystyle y_{1},\dots ,y_{n}} from one finite set

    Read more →
  • Enterprise bus matrix

    Enterprise bus matrix

    The enterprise bus matrix is a data warehouse planning tool and model created by Ralph Kimball, and is part of the data warehouse bus architecture. The matrix is the logical definition of one of the core concepts of Kimball's approach to dimensional modeling conformed dimension. The bus matrix defines part of the data warehouse bus architecture and is an output of the business requirements phase in the Kimball lifecycle. It is applied in the following phases of dimensional modeling and development of the data warehouse. The matrix can be categorized as a hybrid model, being part technical design tool, part project management tool and part communication tool == Background == The need for an enterprise bus matrix stems from the way one goes about creating the overall data warehouse environment. Historically there have been two approaches: a structured, centralized and planned approach and a more loosely defined, department specific approach, in which solutions are developed in a more independent matter. Autonomous projects can result in a range of isolated stove pipe data marts. Naturally each approach has its issues; the visionary approach often struggles with long delivery cycles and lack of reaction time as needs emerge and scope issues arise. On the other hand, the development of isolated data marts leads to stovepipe systems that lack synergy in development. Over time this approach will lead to a so-called data-mart-in-a-box architecture where interoperability and lack of cohesion is apparent, and can hinder the realization of an overall enterprise data warehouse. As an attempt to handle this issue, Ralph Kimball introduced the enterprise bus. == Description == The bus matrix purpose is one of high abstraction and visionary planning on the data warehouse architectural level. By dictating coherency in the development and implementation of an overall data warehouse the bus architecture approach enables an overall vision of the broader enterprise integration and consistency while at the same time dividing the problem into more manageable parts – all in a technology and software independent manner. The bus matrix and architecture builds upon the concept of conformed dimensions, creating a structure of common dimensions that ideally can be used across the enterprise by all business processes related to the data warehouse and the corresponding fact tables from which they derive their context. According to Kimball and Margy Ross's article “Differences of Opinion” "The Enterprise Data warehouse built on the bus architecture ”identifies and enforces the relationship between business process metrics (facts) and descriptive attributes (dimensions)”. The concept of a bus is well known in the language of information technology, and is what reflects the conformed dimension concept in the data warehouse, creating the skeletal structure where all parts of a system connect, ensuring interoperability and consistency of data, and at the same time considers future expansion. This makes the conformed dimensions act as the integration ‘glue’, creating a robust backbone of the enterprise Data Warehouse.

    Read more →
  • Broadcast (parallel pattern)

    Broadcast (parallel pattern)

    Broadcast is a collective communication primitive in parallel programming to distribute programming instructions or data to nodes in a cluster. It is the reverse operation of reduction. The broadcast operation is widely used in parallel algorithms, such as matrix-vector multiplication, Gaussian elimination and shortest paths. The Message Passing Interface implements broadcast in MPI_Bcast. == Definition == A message M [ 1.. m ] {\displaystyle M[1..m]} of length m {\displaystyle m} should be distributed from one node to all other p − 1 {\displaystyle p-1} nodes. T byte {\displaystyle T_{\text{byte}}} is the time it takes to send one byte. T start {\displaystyle T_{\text{start}}} is the time it takes for a message to travel to another node, independent of its length. Therefore, the time to send a package from one node to another is t = s i z e × T byte + T start {\displaystyle t=\mathrm {size} \times T_{\text{byte}}+T_{\text{start}}} . p {\displaystyle p} is the number of nodes and the number of processors. == Binomial Tree Broadcast == With Binomial Tree Broadcast the whole message is sent at once. Each node that has already received the message sends it on further. This grows exponentially as each time step the amount of sending nodes is doubled. The algorithm is ideal for short messages but falls short with longer ones as during the time when the first transfer happens only one node is busy. Sending a message to all nodes takes log 2 ⁡ ( p ) t {\displaystyle \log _{2}(p)t} time which results in a runtime of log 2 ⁡ ( p ) ( m T byte + T start ) {\displaystyle \log _{2}(p)(mT_{\text{byte}}+T_{\text{start}})} == Linear Pipeline Broadcast == The message is split up into k {\displaystyle k} packages and sent piecewise from node n {\displaystyle n} to node n + 1 {\displaystyle n+1} . The time needed to distribute the first message piece is p t = m k T byte + T start {\textstyle pt={\frac {m}{k}}T_{\text{byte}}+T_{\text{start}}} whereby t {\displaystyle t} is the time needed to send a package from one processor to another. Sending a whole message takes ( p + k ) ( m T byte k + T start ) = ( p + k ) t = p t + k t {\displaystyle (p+k)\left({\frac {mT_{\text{byte}}}{k}}+T_{\text{start}}\right)=(p+k)t=pt+kt} . Optimal is to choose k = m ( p − 2 ) T byte T start {\displaystyle k={\sqrt {\frac {m(p-2)T_{\text{byte}}}{T_{\text{start}}}}}} resulting in a runtime of approximately m T byte + p T start + m p T start T byte {\displaystyle mT_{\text{byte}}+pT_{\text{start}}+{\sqrt {mpT_{\text{start}}T_{\text{byte}}}}} The run time is dependent on not only message length but also the number of processors that play roles. This approach shines when the length of the message is much larger than the amount of processors. == Pipelined Binary Tree Broadcast == This algorithm combines Binomial Tree Broadcast and Linear Pipeline Broadcast, which makes the algorithm work well for both short and long messages. The aim is to have as many nodes work as possible while maintaining the ability to send short messages quickly. A good approach is to use Fibonacci trees for splitting up the tree, which are a good choice as a message cannot be sent to both children at the same time. This results in a binary tree structure. We will assume in the following that communication is full-duplex. The Fibonacci tree structure has a depth of about d ≈ log Φ ⁡ ( p ) {\displaystyle d\approx \log _{\Phi }(p)} whereby Φ = 1 + 5 2 {\displaystyle \Phi ={\frac {1+{\sqrt {5}}}{2}}} the golden ratio. The resulting runtime is ( m k T byte + T start ) ( d + 2 k − 2 ) {\textstyle ({\frac {m}{k}}T_{\text{byte}}+T_{\text{start}})(d+2k-2)} . Optimal is k = n ( d − 2 ) T byte 3 T start {\displaystyle k={\sqrt {\frac {n(d-2)T_{\text{byte}}}{3T_{\text{start}}}}}} . This results in a runtime of 2 m T byte + T start log Φ ⁡ ( p ) + 2 m log Φ ⁡ ( p ) T start T byte {\displaystyle 2mT_{\text{byte}}+T_{\text{start}}\log _{\Phi }(p)+{\sqrt {2m\log _{\Phi }(p)T_{\text{start}}T_{\text{byte}}}}} . == Two Tree Broadcast (23-Broadcast) == === Definition === This algorithm aims to improve on some disadvantages of tree structure models with pipelines. Normally in tree structure models with pipelines (see above methods), leaves receive just their data and cannot contribute to send and spread data. The algorithm concurrently uses two binary trees to communicate over. Those trees will be called tree A and B. Structurally in binary trees there are relatively more leave nodes than inner nodes. Basic Idea of this algorithm is to make a leaf node of tree A be an inner node of tree B. It has also the same technical function in opposite side from B to A tree. This means, two packets are sent and received by inner nodes and leaves in different steps. === Tree construction === The number of steps needed to construct two parallel-working binary trees is dependent on the amount of processors. Like with other structures one processor can is the root node who sends messages to two trees. It is not necessary to set a root node, because it is not hard to recognize that the direction of sending messages in binary tree is normally top to bottom. There is no limitation on the number of processors to build two binary trees. Let the height of the combined tree be h = ⌈log(p + 2)⌉. Tree A and B can have a height of h − 1 {\displaystyle h-1} . Especially, if the number of processors correspond to p = 2 h − 1 {\displaystyle p=2^{h}-1} , we can make both sides trees and a root node. To construct this model efficiently and easily with a fully built tree, we can use two methods called "Shifting" and "Mirroring" to get second tree. Let assume tree A is already modeled and tree B is supposed to be constructed based on tree A. We assume that we have p {\displaystyle p} processors ordered from 0 to p − 1 {\displaystyle p-1} . ==== Shifting ==== The "Shifting" method, first copies tree A and moves every node one position to the left to get tree B. The node, which will be located on -1, becomes a child of processor p − 2 {\displaystyle p-2} . ==== Mirroring ==== "Mirroring" is ideal for an even number of processors. With this method tree B can be more easily constructed by tree A, because there are no structural transformations in order to create the new tree. In addition, a symmetric process makes this approach simple. This method can also handle an odd number of processors, in this case, we can set processor p − 1 {\displaystyle p-1} as root node for both trees. For the remaining processors "Mirroring" can be used. === Coloring === We need to find a schedule in order to make sure that no processor has to send or receive two messages from two trees in a step. The edge, is a communication connection to connect two nodes, and can be labelled as either 0 or 1 to make sure that every processor can alternate between 0 and 1-labelled edges. The edges of A and B can be colored with two colors (0 and 1) such that no processor is connected to its parent nodes in A and B using edges of the same color- no processor is connected to its children nodes in A or B using edges of the same color. In every even step the edges with 0 are activated and edges with 1 are activated in every odd step. === Time complexity === In this case the number of packet k is divided in half for each tree. Both trees are working together the total number of packets k = k / 2 + k / 2 {\displaystyle k=k/2+k/2} (upper tree + bottom tree) In each binary tree sending a message to another nodes takes 2 i {\displaystyle 2i} steps until a processor has at least a packet in step i {\displaystyle i} . Therefore, we can calculate all steps as d := log 2 ⁡ ( p + 1 ) ⇒ log 2 ⁡ ( p + 1 ) ≈ log 2 ⁡ ( p ) {\displaystyle d:=\log _{2}(p+1)\Rightarrow \log _{2}(p+1)\approx \log _{2}(p)} . The resulting run time is T ( m , p , k ) ≈ ( m k T byte + T start ) ( 2 d + k − 1 ) {\textstyle T(m,p,k)\approx ({\frac {m}{k}}T_{\text{byte}}+T_{\text{start}})(2d+k-1)} . (Optimal k = m ( 2 d − 1 ) T byte / T start {\textstyle k={\sqrt {{m(2d-1)T_{\text{byte}}}/{T_{\text{start}}}}}} ) This results in a run time of T ( m , p ) ≈ m T byte + T start ⋅ 2 log 2 ⁡ ( p ) + m ⋅ 2 log 2 ⁡ ( p ) T start T byte {\displaystyle T(m,p)\approx mT_{\text{byte}}+T_{\text{start}}\cdot 2\log _{2}(p)+{\sqrt {m\cdot 2\log _{2}(p)T_{\text{start}}T_{\text{byte}}}}} . == ESBT-Broadcasting (Edge-disjoint Spanning Binomial Trees) == In this section, another broadcasting algorithm with an underlying telephone communication model will be introduced. A Hypercube creates network system with p = 2 d ( d = 0 , 1 , 2 , 3 , . . . ) {\displaystyle p=2^{d}(d=0,1,2,3,...)} . Every node is represented by binary 0 , 1 {\displaystyle {0,1}} depending on the number of dimensions. Fundamentally ESBT(Edge-disjoint Spanning Binomial Trees) is based on hypercube graphs, pipelining( m {\displaystyle m} messages are divided by k {\displaystyle k} packets) and binomial trees. The Processor 0 d {\displaystyle 0^{d}} cyclically spreads packets to roots of ESB

    Read more →