AI Analytics Data

AI Analytics Data — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Comparison of video editing software

    Comparison of video editing software

    This is a comparison of non-linear video editing software applications. See also a more complete list of video editing software. == General information == This table gives basic general information about the different editors: === Active === === Discontinued / Inactive === ==== Definition ==== professional: used for full length Hollywood movies; professional (small): mainly used for paid commercials, short films or podcasts/YouTube channels; prosumer: Mainly targeting private use, anything that can do more than just trimming a film; basic: trimming a film; == System requirements == This table lists the operating systems that different editors can run on without emulation, as well as other system requirements. Note that minimum system requirements are listed; some features (like High Definition support) may be unavailable with these specifications. "Unix" includes the similar Linux, BSD and Unix-like operating systems. == High definition/High resolution import == The table below indicates the ability of each program to import various High Definition video or High resolution video formats for editing. == Feature set == == Output options == Please note that recording to Blu-ray does not imply 1080@50p/60p . Most only support up to 1080i 25/30 frames per second recording. Also not all formats can be output.

    Read more →
  • Higuchi dimension

    Higuchi dimension

    In fractal geometry, the Higuchi dimension (or Higuchi fractal dimension (HFD)) is an approximate value for the box-counting dimension of the graph of a real-valued function or time series. This value is obtained via an algorithmic approximation so one also talks about the Higuchi method. It has many applications in science and engineering and has been applied to subjects like characterizing primary waves in seismograms, clinical neurophysiology and analyzing changes in the electroencephalogram in Alzheimer's disease. == Formulation of the method == The original formulation of the method is due to T. Higuchi. Given a time series X : { 1 , … , N } → R {\displaystyle X:\{1,\dots ,N\}\to \mathbb {R} } consisting of N {\displaystyle N} data points and a parameter k m a x ≥ 2 {\displaystyle k_{\mathrm {max} }\geq 2} the Higuchi Fractal dimension (HFD) of X {\displaystyle X} is calculated in the following way: For each k ∈ { 1 , … , k m a x } {\displaystyle k\in \{1,\dots ,k_{\mathrm {max} }}\} and m ∈ { 1 , … , k } {\displaystyle m\in \{1,\dots ,k}\} define the length L m ( k ) {\displaystyle L_{m}(k)} by L m ( k ) = N − 1 ⌊ N − m k ⌋ k 2 ∑ i = 1 ⌊ N − m k ⌋ | X N ( m + i k ) − X N ( m + ( i − 1 ) k ) | . {\displaystyle L_{m}(k)={\frac {N-1}{\lfloor {\frac {N-m}{k}}\rfloor k^{2}}}\sum _{i=1}^{\lfloor {\frac {N-m}{k}}\rfloor }|X_{N}(m+ik)-X_{N}(m+(i-1)k)|.} The length L ( k ) {\displaystyle L(k)} is defined by the average value of the k {\displaystyle k} lengths L 1 ( k ) , … , L k ( k ) {\displaystyle L_{1}(k),\dots ,L_{k}(k)} , L ( k ) = 1 k ∑ m = 1 k L m ( k ) . {\displaystyle L(k)={\frac {1}{k}}\sum _{m=1}^{k}L_{m}(k).} The slope of the best-fitting linear function through the data points { ( log ⁡ 1 k , log ⁡ L ( k ) ) } {\displaystyle \left\{\left(\log {\frac {1}{k}},\log L(k)\right)\right\}} is defined to be the Higuchi fractal dimension of the time-series X {\displaystyle X} . == Application to functions == For a real-valued function f : [ 0 , 1 ] → R {\displaystyle f:[0,1]\to \mathbb {R} } one can partition the unit interval [ 0 , 1 ] {\displaystyle [0,1]} into N {\displaystyle N} equidistantly intervals [ t j , t j + 1 ) {\displaystyle [t_{j},t_{j+1})} and apply the Higuchi algorithm to the times series X ( j ) = f ( t j ) {\displaystyle X(j)=f(t_{j})} . This results into the Higuchi fractal dimension of the function f {\displaystyle f} . It was shown that in this case the Higuchi method yields an approximation for the box-counting dimension of the graph of f {\displaystyle f} as it follows a geometrical approach (see Liehr & Massopust 2020). == Robustness and stability == Applications to fractional Brownian functions and the Weierstrass function reveal that the Higuchi fractal dimension can be close to the box-dimension. On the other hand, the method can be unstable in the case where the data X ( 1 ) , … , X ( N ) {\displaystyle X(1),\dots ,X(N)} are periodic or if subsets of it lie on a horizontal line (see Liehr & Massopust 2020).

    Read more →
  • Operational data store

    Operational data store

    An operational data store (ODS) is used for operational reporting and as a source of data for the enterprise data warehouse (EDW). It is a complementary element to an EDW in a decision support environment, and is used for operational reporting, controls, and decision making, as opposed to the EDW, which is used for tactical and strategic decision support. An ODS is a database designed to integrate data from multiple sources for additional operations on the data, for reporting, controls and operational decision support. Unlike a production master data store, the data is not passed back to operational systems. It may be passed for further operations and to the data warehouse for reporting. An ODS should not be confused with an enterprise data hub (EDH). An operational data store will take transactional data from one or more production systems and loosely integrate it, in some respects it is still subject oriented, integrated and time variant, but without the volatility constraints. This integration is mainly achieved through the use of EDW structures and content. An ODS is not an intrinsic part of an EDH solution, although an EDH may be used to subsume some of the processing performed by an ODS and the EDW. An EDH is a broker of data. An ODS is certainly not. Because the data originates from multiple sources, the integration often involves cleaning, resolving redundancy and checking against business rules for integrity. An ODS is usually designed to contain low-level or atomic (indivisible) data (such as transactions and prices) with limited history that is captured "real time" or "near real time" as opposed to the much greater volumes of data stored in the data warehouse generally on a less-frequent basis. == General use == The general purpose of an ODS is to integrate data from disparate source systems in a single structure, using data integration technologies like data virtualization, data federation, or extract, transform, and load (ETL). This will allow operational access to the data for operational reporting, master data or reference data management. An ODS is not a replacement or substitute for a data warehouse or for a data hub but in turn could become a source.

    Read more →
  • Car–Parrinello molecular dynamics

    Car–Parrinello molecular dynamics

    Car–Parrinello molecular dynamics (CPMD) refers to either a method used in molecular dynamics (also known as the Car–Parrinello method) or the computational chemistry software package used to implement this method. The CPMD method is one of the major methods for calculating ab initio molecular dynamics (ab initio MD or AIMD). Ab initio molecular dynamics (AIMD) is a computational method that uses first principles through quantum mechanics to simulate the motion of atoms in a system. It is a type of molecular dynamics (MD) simulation that does not rely on empirical potentials or force fields to describe the interactions between atoms, but rather calculates these interactions entirely from the electronic structure of the system using quantum mechanics. In an ab initio MD simulation, the total energy of the system is calculated at each time step using density functional theory (DFT), Hartree-Fock (HF), or other electronic structure calculation methods. The forces acting on each atom are then determined from the gradient of the energy with respect to the atomic coordinates, and the equations of motion are solved to predict the trajectory of the atoms. AIMD permits chemical bond breaking and forming events to occur and accounts for electronic polarization effect. Therefore, Ab initio MD simulations can be used to study a wide range of phenomena, including the structural, thermodynamic, and dynamic properties of materials and chemical reactions. They are particularly useful for systems that are not well described by empirical potentials or force fields, such as systems with strong electronic correlation or systems with many degrees of freedom. However, ab initio MD simulations are computationally demanding and require significant computational resources. The CPMD method is related to the more common Born–Oppenheimer molecular dynamics (BOMD) method in that the quantum mechanical effect of the electrons is included in the calculation of energy and forces for the classical motion of the nuclei. CPMD and BOMD are different types of AIMD. However, whereas BOMD treats the electronic structure problem within the time-independent Schrödinger equation, CPMD explicitly includes the electrons as active degrees of freedom, via (fictitious) dynamical variables. The software is a parallelized plane wave / pseudopotential implementation of density functional theory, particularly designed for ab initio molecular dynamics. == Car–Parrinello method == The Car–Parrinello method is a type of molecular dynamics, usually employing periodic boundary conditions, planewave basis sets, and density functional theory, proposed by Roberto Car and Michele Parrinello in 1985 while working at SISSA, who were subsequently awarded the Dirac Medal by ICTP in 2009. In contrast to Born–Oppenheimer molecular dynamics wherein the nuclear (ions) degree of freedom are propagated using ionic forces which are calculated at each iteration by approximately solving the electronic problem with conventional matrix diagonalization methods, the Car–Parrinello method explicitly introduces the electronic degrees of freedom as (fictitious) dynamical variables, writing an extended Lagrangian for the system which leads to a system of coupled equations of motion for both ions and electrons. In this way, an explicit electronic minimization at each time step, as done in Born–Oppenheimer MD, is not needed: after an initial standard electronic minimization, the fictitious dynamics of the electrons keeps them on the electronic ground state corresponding to each new ionic configuration visited along the dynamics, thus yielding accurate ionic forces. In order to maintain this adiabaticity condition, it is necessary that the fictitious mass of the electrons is chosen small enough to avoid a significant energy transfer from the ionic to the electronic degrees of freedom. This small fictitious mass in turn requires that the equations of motion are integrated using a smaller time step than the one (1–10 fs) commonly used in Born–Oppenheimer molecular dynamics. Currently, the CPMD method can be applied to systems that consist of a few tens or hundreds of atoms and access timescales on the order of tens of picoseconds. == General approach == In CPMD the core electrons are usually described by a pseudopotential and the wavefunction of the valence electrons are approximated by a plane wave basis set. The ground state electronic density (for fixed nuclei) is calculated self-consistently, usually using the density functional theory method. Kohn-Sham equations are often used to calculate the electronic structure, where electronic orbitals are expanded in a plane-wave basis set. Then, using that density, forces on the nuclei can be computed, to update the trajectories (using, e.g. the Verlet integration algorithm). In addition, however, the coefficients used to obtain the electronic orbital functions can be treated as a set of extra spatial dimensions, and trajectories for the orbitals can be calculated in this context. == Fictitious dynamics == CPMD is an approximation of the Born–Oppenheimer MD (BOMD) method. In BOMD, the electrons' wave function must be minimized via matrix diagonalization at every step in the trajectory. CPMD uses fictitious dynamics to keep the electrons close to the ground state, preventing the need for a costly self-consistent iterative minimization at each time step. The fictitious dynamics relies on the use of a fictitious electron mass (usually in the range of 400 – 800 a.u.) to ensure that there is very little energy transfer from nuclei to electrons, i.e. to ensure adiabaticity. Any increase in the fictitious electron mass resulting in energy transfer would cause the system to leave the ground-state BOMD surface. === Lagrangian === L = 1 2 ( ∑ I n u c l e i M I R ˙ I 2 + μ ∑ i o r b i t a l s ∫ d r | ψ ˙ i ( r , t ) | 2 ) − E [ { ψ i } , { R I } ] + ∑ i j Λ i j ( ∫ d r ψ i ψ j − δ i j ) , {\displaystyle {\mathcal {L}}={\frac {1}{2}}\left(\sum _{I}^{\mathrm {nuclei} }\ M_{I}{\dot {\mathbf {R} }}_{I}^{2}+\mu \sum _{i}^{\mathrm {orbitals} }\int d\mathbf {r} \ |{\dot {\psi }}_{i}(\mathbf {r} ,t)|^{2}\right)-E\left[\{\psi _{i}\},\{\mathbf {R} _{I}\}\right]+\sum _{ij}\Lambda _{ij}\left(\int d\mathbf {r} \ \psi _{i}\psi _{j}-\delta _{ij}\right),} where μ {\displaystyle \mu } is the fictitious mass parameter; E[{ψi},{RI}] is the Kohn–Sham energy density functional, which outputs energy values when given Kohn–Sham orbitals and nuclear positions. === Orthogonality constraint === ∫ d r ψ i ∗ ( r , t ) ψ j ( r , t ) = δ i j , {\displaystyle \int d\mathbf {r} \ \psi _{i}^{}(\mathbf {r} ,t)\psi _{j}(\mathbf {r} ,t)=\delta _{ij},} where δij is the Kronecker delta. === Equations of motion === The equations of motion are obtained by finding the stationary point of the Lagrangian under variations of ψi and RI, with the orthogonality constraint. M I R ¨ I = − ∇ I E [ { ψ i } , { R I } ] {\displaystyle M_{I}{\ddot {\mathbf {R} }}_{I}=-\nabla _{I}\,E\left[\{\psi _{i}\},\{\mathbf {R} _{I}\}\right]} μ ψ ¨ i ( r , t ) = − δ E δ ψ i ∗ ( r , t ) + ∑ j Λ i j ψ j ( r , t ) , {\displaystyle \mu {\ddot {\psi }}_{i}(\mathbf {r} ,t)=-{\frac {\delta E}{\delta \psi _{i}^{}(\mathbf {r} ,t)}}+\sum _{j}\Lambda _{ij}\psi _{j}(\mathbf {r} ,t),} where Λij is a Lagrangian multiplier matrix to comply with the orthonormality constraint. === Born–Oppenheimer limit === In the formal limit where μ → 0, the equations of motion approach Born–Oppenheimer molecular dynamics. == Software packages == There are a number of software packages available for performing AIMD simulations. Some of the most widely used packages include: CP2K: an open-source software package for AIMD. Quantum Espresso: an open-source package for performing DFT calculations. It includes a module for AIMD. VASP: a commercial software package for performing DFT calculations. It includes a module for AIMD. Gaussian: a commercial software package that can perform AIMD. NWChem: an open-source software package for AIMD. LAMMPS: an open-source software package for performing classical and ab initio MD simulations. SIESTA: an open-source software package for AIMD. ORCA: a general-purpose quantum chemistry package. == Applications == Studying the behavior of water across different environments, such as near a hydrophobic graphene sheet. Investigating the structure and dynamics of liquid water at ambient temperature. Solving the heat transfer problems (heat conduction and thermal radiation), such as in Si/Ge superlattices. Probing the proton transfer along hydrogen-bonds in different environments, such as in 1D water chains inside carbon nanotubes. Evaluating the critical point of crystals, composites, and solid-state materials, such as aluminum. Predicting and modelling different phases and phase transitions, such as in the amorphous phase of the phase-change memory material GeSbTe. Studying the combustion of combustibles, such as lignite-water systems. Measuring th

    Read more →
  • Video editing software

    Video editing software

    Video editing software or a video editor is software used for performing the post-production video editing of digital video sequences on a non-linear editing system (NLE). It has replaced traditional flatbed celluloid film editing tools and analog video tape editing machines. Video editing software serves a lot of purposes, such as filmmaking, audio commentary, and general editing of video content. In NLE software, the user manipulates sections of video, images, and audio on a sequence. These clips can be trimmed, cut, and manipulated in many different ways. When editing is finished, the user exports the sequence as a video file. == Components == === Timeline === NLE software is typically based on a timeline interface where sections moving image video recordings, known as clips, are laid out in sequence and played back. The NLE offers a range of tools for trimming, splicing, cutting, and arranging clips across the timeline. Another kind of clip is a text clip, used to add text to a video, such as title screens or movie credits. Audio clips can additionally be mixed together, such as mixing a soundtrack with multiple sound effects. Typically, the timeline is divided into multiple rows on the y-axis for different clips playing simultaneously, whereas the x-axis represents the run time of the video. Effects such as transitions can be performed on each clip, such as a crossfade effect going from one scene to another. === Exporting === Since video editors represent a project with a file format specific to the program, one needs to export the video file in order to publish it. Once a project is complete, the editor can then export to movies in a variety of formats in a context that may range from broadcast tape formats to compressed video files for web publishing (such as on an online video platform or personal website), optical media, or saved to mobile devices. To facilitate editing, source video typically has a higher resolution than the desired output. Therefore, higher resolution video needs to be downscaled during exporting, or after exporting in a process known as transsizing. === Visual effects === As digital video editing advanced, visual effects became possible, and is part of the standard toolkit, usually found in prosumer and professional grade software. A common ability is to do compositing techniques such as chroma keying or luma keying, among others, which allow different objects to look as if they are in the same scene. A different kind of visual effects is motion capture. Software such as Blender can perform motion capture to make animated objects follow an actor's movements. === Additional features === Most professional video editors are able to do color grading, which is to manipulate visual attributes of a video such as contrast to enhance output, and improve emotional impact. Some video editors such as iMovie include stock footage available for use. == Hardware requirements == As video editing puts great demands on storage and graphics performance, especially at high resolutions such as 4K, and for videos with many visual effects, powerful hardware is often required. It is not uncommon for a computer built for video editing to have a lot of drive capacity, and a powerful graphics processing unit, which optimally has hardware accelerated video encoding. Having sufficient disk space is important since videos can take up large amounts of storage, depending on the resolution and compression format used. Each minute of a Full HD (1080p) video at 30 fps takes up 60MB of space. When visual effects are used, a server farm can be employed to speed up the rendering process. == Examples == Video editing software can be divided into consumer grade, which focuses on ease-of-use, along with professional grade software, which focuses on feature availability, and advanced editing techniques. The typical use case for the former is to edit personal videos on the go, when more advanced editing is not required. === Consumer grade === Photos (Apple) Google Photos YouTube Create === Prosumer grade === ==== Proprietary software ==== iMovie CyberLink PowerDirector === Professional grade === ==== Proprietary software ==== Final Cut Pro Adobe Premiere Pro DaVinci Resolve Vegas Pro Lightworks Camtasia Media Composer ==== Free and open source software ==== Avidemux Blender Cinelerra Flowblade Kdenlive OpenShot Shotcut While most video editing software has been separate from the operating systems, some operating systems have had a video editor installed by default, such as Windows Movie Maker in Windows XP, or as a component of the default photo viewer, such as the Photos app on iOS. Some social media platforms, such as TikTok and Instagram may include a rudimentary video editor to trim clips.

    Read more →
  • World Congress of Universal Documentation

    World Congress of Universal Documentation

    The World Congress of Universal Documentation was held from 16 to 21 August 1937 in Paris, France. Delegates from 45 countries met to discuss means by which all of the world's information, in print, in manuscript, and in other forms, could be efficiently organized and made accessible. == The Congress in the history of information science == The Congress, held at the Trocadéro under "the auspices" of the Institut International de Bibliographie, was "the apotheosis" of a general movement in the 1930s towards the classification of the growing mass of information and the improvement of access to that information. For the first time in the history of information science, technological means were beginning to catch up with theoretical ends, and the discussions at the conference reflected that fact. Its participation in the Congress was one of the first projects of the American Documentation Institute (ADI). Participants in the conference discussed what has been more recently called "a continuously updated hypertext encyclopedia." Joseph Reagle sees many of the ideas considered at the conference as forerunners of some of the key goals and norms of Wikipedia. == Microfilm == The main resolution adopted by the congress proposed that microfilm be used to make information universally available. Watson Davis, chairman of the American delegation and president of the ADI, stated that the volume of information being produced created difficult problems of access and preservation, but that these could be solved by the use of microfilm. In his address to the Congress, Davis said: Most immediate and practical to put into operation is the microfilming of material in libraries upon demand. It will become fashionable and economical to send a potential book borrower a little strip of microfilm for his permanent possession instead of the book and then badgering him to return it before he has had a chance to use it effectively. I believe that reading machines for microfilm will become as common as typewriters in studies and laboratories. If the principal libraries and information centers of the world will cooperate in such "bibliofilm services," as they are called, if they exchange orders and have essentially uniform methods, forms for ordering, standard microfilm format and production methods and comparable if not uniform prices, the resources of any library will be placed at the disposal of any scholar or scientist anywhere in the world. All the libraries cooperating will merge into one world library without loss of identity or individuality. The world's documentation will become available to even the most isolated and individualistic scholar. The Congress included two separate exhibits on microfilm. One was of the equipment used at the Bibliothèque nationale de France and the other, coordinated by Herman H. Fussler of the University of Chicago, consisting of "an entire microfilm laboratory," complete with cameras, a darkroom, and various kinds of reading machines. Emanuel Goldberg presented a paper on an early copying camera he had invented. Other resolutions passed by the Congress concerned uniform standards for the preparation of articles, for classifying books and other documents, for indexing newspapers and periodicals, and for cooperation between libraries. == H. G. Wells == In his address to the Congress, H. G. Wells said that he thought that his idea of the "world brain" was a precursor to the ideas other delegates were proposing, and explicitly linked the projects being discussed to the work of the encyclopédistes: I am speaking of a process of mental organization throughout the world which I believe to be as inevitable as anything can be in human affairs. All the distresses and horrors of the present time are fundamentally intellectual. The world has to pull its mind together, and this [Congress] is the beginning of its efforts. Civilization is a Phoenix. It perishes in flames and even as it dies it is born again. This synthesis of knowledge upon which you are working is the necessary beginning of a new world. It is good to be meeting here in Paris where the first encyclopedia of power was made. It would be impossible to overrate our debt to Diderot and his associates. == Other participants == Participants in the Congress included authors, librarians, scholars, archivists, scientists, and editors. Some of the notable people in attendance not mentioned above were:

    Read more →
  • Geospatial metadata

    Geospatial metadata

    Geospatial metadata (also geographic metadata) is a type of metadata applicable to geographic data and information. Such objects may be stored in a geographic information system (GIS) or may simply be documents, data-sets, images or other objects, services, or related items that exist in some other native environment but whose features may be appropriate to describe in a (geographic) metadata catalog (may also be known as a data directory or data inventory). == Definition == ISO 19115:2013 "Geographic Information – Metadata" from ISO/TC 211, the industry standard for geospatial metadata, describes its scope as follows: [This standard] provides information about the identification, the extent, the quality, the spatial and temporal aspects, the content, the spatial reference, the portrayal, distribution, and other properties of digital geographic data and services. ISO 19115:2013 also provides for non-digital mediums: Though this part of ISO 19115 is applicable to digital data and services, its principles can be extended to many other types of resources such as maps, charts, and textual documents as well as non-geographic data. The U.S. Federal Geographic Data Committee (FGDC) describes geospatial metadata as follows: A metadata record is a file of information, usually presented as an XML document, which captures the basic characteristics of a data or information resource. It represents the who, what, when, where, why and how of the resource. Geospatial metadata commonly document geographic digital data such as Geographic Information System (GIS) files, geospatial databases, and earth imagery but can also be used to document geospatial resources including data catalogs, mapping applications, data models and related websites. Metadata records include core library catalog elements such as Title, Abstract, and Publication Data; geographic elements such as Geographic Extent and Projection Information; and database elements such as Attribute Label Definitions and Attribute Domain Values. == History == The growing appreciation of the value of geospatial metadata through the 1980s and 1990s led to the development of a number of initiatives to collect metadata according to a variety of formats either within agencies, communities of practice, or countries/groups of countries. For example, NASA's "DIF" metadata format was developed during an Earth Science and Applications Data Systems Workshop in 1987, and formally approved for adoption in 1988. Similarly, the U.S. FGDC developed its geospatial metadata standard over the period 1992–1994. The Spatial Information Council of Australia and New Zealand (ANZLIC), a combined body representing spatial data interests in Australia and New Zealand, released version 1 of its "metadata guidelines" in 1996. ISO/TC 211 undertook the task of harmonizing the range of formal and de facto standards over the approximate period 1999–2002, resulting in the release of ISO 19115 "Geographic Information – Metadata" in 2003 and a subsequent revision in 2013. As of 2011 individual countries, communities of practice, agencies, etc. have started re-casting their previously used metadata standards as "profiles" or recommended subsets of ISO 19115, occasionally with the inclusion of additional metadata elements as formal extensions to the ISO standard. The growth in popularity of Internet technologies and data formats, such as Extensible Markup Language (XML) during the 1990s led to the development of mechanisms for exchanging geographic metadata on the web. In 2004, the Open Geospatial Consortium released the current version (3.1) of Geography Markup Language (GML), an XML grammar for expressing geospatial features and corresponding metadata. With the growth of the Semantic Web in the 2000s, the geospatial community has begun to develop ontologies for representing semantic geospatial metadata. Some examples include the Hydrology and Administrative ontologies developed by the Ordnance Survey in the United Kingdom. == ISO 19115: Geographic information – Metadata == ISO 19115 is a standard of the International Organization for Standardization (ISO). The standard is part of the ISO geographic information suite of standards (19100 series). ISO 19115 and its parts define how to describe geographical information and associated services, including contents, spatial-temporal purchases, data quality, access and rights to use. The objective of this International Standard is to provide a clear procedure for the description of digital geographic data-sets so that users will be able to determine whether the data in a holding will be of use to them and how to access the data. By establishing a common set of metadata terminology, definitions and extension procedures, this standard promotes the proper use and effective retrieval of geographic data. ISO 19115 was revised in 2013 to accommodate growing use of the internet for metadata management, as well as add many new categories of metadata elements (referred to as codelists) and the ability to limit the extent of metadata use temporally or by user. == ISO 19139 Geographic information Metadata XML schema implementation == ISO 19139:2012 provides the XML implementation schema for ISO 19115 specifying the metadata record format and may be used to describe, validate, and exchange geospatial metadata prepared in XML. The standard is part of the ISO geographic information suite of standards (19100 series), and provides a spatial metadata XML (spatial metadata eXtensible Mark-up Language (smXML)) encoding, an XML schema implementation derived from ISO 19115, Geographic information – Metadata. The metadata includes information about the identification, constraint, extent, quality, spatial and temporal reference, distribution, lineage, and maintenance of the digital geographic data-set. == Metadata directories == Also known as metadata catalogues or data directories. (need discussion of, and subsections on GCMD, FGDC metadata gateway, ASDD, European and Canadian initiatives, etc. etc.) GIS Inventory – National GIS Inventory System which is maintained by the US-based National States Geographic Information Council (NSGIC) as a tool for the entire US GIS Community. Its primary purpose is to track data availability and the status of geographic information system (GIS) implementation in state and local governments to aid the planning and building of statewide spatial data infrastructures (SSDI). The Random Access Metadata for Online Nationwide Assessment (RAMONA) database is a critical component of the GIS Inventory. RAMONA moves its FGDC-compliant metadata (CSDGM Standard) for each data layer to a web folder and a Catalog Service for the Web (CSW) that can be harvested by Federal programs and others. This provides far greater opportunities for discovery of user information. The GIS Inventory website was originally created in 2006 by NSGIC under award NA04NOS4730011 from the Coastal Services Center, National Oceanic and Atmospheric Administration, U.S. Department of Commerce. The Department of Homeland Security has been the principal funding source since 2008 and they supported the development of the Version 5 during 2011/2012 under Order Number HSHQDC-11-P-00177. The Federal Emergency Management Agency and National Oceanic and Atmospheric Administration have provided additional resources to maintain and improve the GIS Inventory. Some US Federal programs require submission of CSDGM-Compliant Metadata for data created under grants and contracts that they issue. The GIS Inventory provides a very simple interface to create the required Metadata. GCMD - Global Change Master Directory's goal is to enable users to locate and obtain access to Earth science data sets and services relevant to global change and Earth science research. The GCMD database holds more than 20,000 descriptions of Earth science data sets and services covering all aspects of Earth and environmental sciences. ECHO - The EOS Clearing House (ECHO) is a spatial and temporal metadata registry, service registry, and order broker. It allows users to more efficiently search and access data and services through the Reverb Client or Application Programmer Interfaces (APIs). ECHO stores metadata from a variety of science disciplines and domains, totalling over 3400 Earth science data sets and over 118 million granule records. GoGeo - GoGeo is a service run by EDINA (University of Edinburgh) and is supported by Jisc. GoGeo allows users to conduct geographically targeted searches to discover geospatial datasets. GoGeo searches many data portals from the HE and FE community and beyond. GoGeo also allows users to create standards compliant metadata through its Geodoc metadata editor. == Geospatial metadata tools == There are many proprietary GIS or geospatial products that support metadata viewing and editing on GIS resources. For example, ESRI's ArcGIS Desktop, SOCET GXP, Autodesk's AutoCAD Map 3D 2008, Arcitecta's Mediaflux and Intergraph's Geo

    Read more →
  • Data management plan

    Data management plan

    A data management plan or DMP is a formal document that outlines how data are to be handled both during a research project, and after the project is completed. The goal of a data management plan is to consider the many aspects of data management, metadata generation, data preservation, and analysis before the project begins; this may lead to data being well-managed in the present, and prepared for preservation in the future. DMPs were originally used in 1966 to manage aeronautical and engineering projects' data collection and analysis, and expanded across engineering and scientific disciplines in the 1970s and 1980s. Up until the early 2000s, DMPs were used "for projects of great technical complexity, and for limited mid-study data collection and processing purposes". In the 2000s and later, E-research and economic policies drove the development and uptake of DMPs. == Importance == Preparing a data management plan before data are collected is claimed to ensure that data are in the correct format, organized well, and better annotated. This could arguably save time in the long term because there is no need to re-organize, re-format, or try to remember details about data. It is also claimed to increase research efficiency since both the data collector and other researchers might be able to understand and use well-annotated data in the future. One component of a data management plan is data archiving and preservation. By deciding on an archive ahead of time, the data collector can format data during collection to make its future submission to a database easier. If data are preserved, they are more relevant since they can be re-used by other researchers. It also allows the data collector to direct requests for data to the database, rather than address requests individually. A frequent argument in favor of preservation is that data that are preserved have the potential to lead to new, unanticipated discoveries, and they prevent duplication of scientific studies that have already been conducted. Data archiving also provides insurance against loss by the data collector. In the 2010s, funding agencies increasingly required data management plans as part of the proposal and evaluation process, despite little or no evidence of their efficacy. == Major components == "There is no general and definitive list of topics that should be covered in a DMP for a research project", and researchers are often left to their own devices as to how to fill out a DMP. === Information about data and data format === A description of data to be produced by the project. This might include (but is not limited to) data that are: Experimental Observational Raw or derived Physical collections Models Simulations Curriculum materials Software Images How will the data be acquired? When and where will they be acquired? After collection, how will the data be processed? Include information about Software used Algorithms Scientific workflows File formats that will be used, justify those formats, and describe the naming conventions used. Quality assurance & quality control measures that will be taken during sample collection, analysis, and processing. If existing data are used, what are their origins? How will the data collected be combined with existing data? What is the relationship between the data collected and existing data? How will the data be managed in the short-term? Consider the following: Version control for files Backing up data and data products Security & protection of data and data products Who will be responsible for management === Metadata content and format === Metadata are the contextual details, including any information important for using data. This may include descriptions of temporal and spatial details, instruments, parameters, units, files, etc. Metadata is commonly referred to as "data about data". Issues to be considered include: How detailed has the metadata to be in order to make the data meaningful? How will the metadata be created and/or captured? Examples include lab notebooks, GPS hand-held units, Auto-saved files on instruments, etc. What format will be used for the metadata? What are the metadata standards commonly used in the respective scientific discipline? There should be justification for the format chosen. === Policies for access, sharing, and re-use === Describe any obligations that exist for sharing data collected. These may include obligations from funding agencies, institutions, other professional organizations, and legal requirements. Include information about how data will be shared, including when the data will be accessible, how long the data will be available, how access can be gained, and any rights that the data collector reserves for using data. Address any ethical or privacy issues with data sharing Address intellectual property & copyright issues. Who owns the copyright? What are the institutional, publisher, and/or funding agency policies associated with intellectual property? Are there embargoes for political, commercial, or patent reasons? Describe the intended future uses/users for the data Indicate how the data should be cited by others. How will the issue of persistent citation be addressed? For example, if the data will be deposited in a public archive, will the dataset have a persistent identifier (e.g., ARK, DOI, Handle, PURL, URN) assigned to it? === Long-term storage and data management === Researchers should identify an appropriate archive for the long-term preservation of their data. By identifying the archive early in the project, the data can be formatted, transformed, and documented appropriately to meet the requirements of the archive. Researchers should consult colleagues and professional societies in their discipline to determine the most appropriate database, and include a backup archive in their data management plan in case their first choice goes out of existence. Early in the project, the primary researcher should identify what data will be preserved in an archive. Usually, preserving the data in its most raw form is desirable, although data derivatives and products can also be preserved. An individual should be identified as the primary contact person for archived data, and ensure contact information is always kept up-to-date in case there are requests for data or information about data. === Budget === Data management and preservation costs may be considerable, depending on the nature of the project. By anticipating costs ahead of time, researchers ensure that the data will be properly managed and archived. Potential expenses that should be considered are Human resources and staff as they handle data preparation, management, documentation, and preservation Hardware and/or software needed for data management, backing up, security, documentation, and preservation Costs associated with submitting the data to an archive The data management plan should include how these costs will be paid. == NSF Data Management Plan == All grant proposals submitted to National Science Foundation (NSF) must include a Data Management Plan that is no more than two pages. This is a supplement (not part of the 15-page proposal) and should describe how the proposal will conform to the Award and Administration Guide policy (see below). It may include the following: The types of data The standards to be used for data and metadata format and content Policies for access and sharing Policies and provisions for re-use Plans for archiving data Policy summarized from the NSF Award and Administration Guide, Section 4 (Dissemination and Sharing of Research Results): Promptly publish with appropriate authorship Share data, samples, physical collections, and supporting materials with others, within a reasonable time frame Share software and inventions Investigators can keep their legal rights over their intellectual property, but they still have to make their results, data, and collections available to others Policies will be implemented via Proposal review Award negotiations and conditions Support/incentives == ESRC Data Management Plan == Since 1995, the UK's Economic and Social Research Council (ESRC) have had a research data policy in place. The current ESRC Research Data Policy states that research data created as a result of ESRC-funded research should be openly available to the scientific community to the maximum extent possible, through long-term preservation and high-quality data management. ESRC requires a data management plan for all research award applications where new data are being created. Such plans are designed to promote a structured approach to data management throughout the data lifecycle, resulting in better quality data that is ready to archive for sharing and re-use. The UK Data Service, the ESRC's flagship data service, provides practical guidance on research data management planning suitable for social science researchers in the UK and around the world. ESRC has a longstanding arrangement with the UK Data A

    Read more →
  • Zeuthen strategy

    Zeuthen strategy

    The Zeuthen strategy in cognitive science is a negotiation strategy used by some artificial agents. Its purpose is to measure the willingness to risk conflict. An agent will be more willing to risk conflict if it does not have much to lose in case that the negotiation fails. In contrast, an agent is less willing to risk conflict when it has more to lose. The value of a deal is expressed in its utility. An agent has much to lose when the difference between the utility of its current proposal and the conflict deal is high. When both agents use the monotonic concession protocol, the Zeuthen strategy leads them to agree upon a deal in the negotiation set. This set consists of all conflict free deals, which are individually rational and Pareto optimal, and the conflict deal, which maximizes the Nash product. The strategy was introduced in 1930 by the Danish economist Frederik Zeuthen. == Three key questions == The Zeuthen strategy answers three open questions that arise when using the monotonic concession protocol, namely: Which deal should be proposed at first? On any given round, who should concede? In case of a concession, how much should the agent concede? The answer to the first question is that any agent should start with its most preferred deal, because that deal has the highest utility for that agent. The second answer is that the agent with the smallest value of Risk(i,t) concedes, because the agent with the lowest utility for the conflict deal profits most from avoiding conflict. To the third question, the Zeuthen strategy suggests that the conceding agent should concede just enough raise its value of Risk(i,t) just above that of the other agent. This prevents the conceding agent to have to concede again in the next round. == Risk == Risk ( i , t ) = { 1 U i ( δ ( i , t ) ) = 0 U i ( δ ( i , t ) ) − U i ( δ ( j , t ) ) U i ( δ ( i , t ) ) otherwise {\displaystyle {\text{Risk}}(i,t)={\begin{cases}1&U_{i}(\delta (i,t))=0\\{\frac {U_{i}(\delta (i,t))-U_{i}(\delta (j,t))}{U_{i}(\delta (i,t))}}&{\text{otherwise}}\end{cases}}} Risk(i,t) is a measurement of agent i's willingness to risk conflict. The risk function formalizes the notion that an agent's willingness to risk conflict is the ratio of the utility that agent would lose by accepting the other agent's proposal to the utility that agent would lose by causing a conflict. Agent i is said to be using a rational negotiation strategy if at any step t + 1 that agent i sticks to his last proposal, Risk(i,t) > Risk(j,t). == Sufficient concession == If agent i makes a sufficient concession in the next step, then, assuming that agent j is using a rational negotiation strategy, if agent j does not concede in the next step, he must do so in the step after that. The set of all sufficient concessions of agent i at step t is denoted SC(i, t). == Minimal sufficient concession == δ ′ = arg ⁡ max δ ∈ S C ( A , t ) { U A ( δ ) } {\displaystyle \delta '=\arg \max _{\delta \in {SC(A,t)}}\{U_{A}(\delta )\}} is the minimal sufficient concession of agent A in step t. Agent A begins the negotiation by proposing δ ( A , 0 ) = arg ⁡ max δ ∈ N S U A ( δ ) {\displaystyle \delta (A,0)=\arg \max _{\delta \in {NS}}U_{A}(\delta )} and will make the minimal sufficient concession in step t + 1 if and only if Risk(A,t) ≤ Risk(B,t). Theorem If both agents are using Zeuthen strategies, then they will agree on δ = arg ⁡ max δ ′ ∈ N S { π ( δ ′ ) } , {\displaystyle \delta =\arg \max _{\delta '\in {NS}}\{\pi (\delta ')\},} that is, the deal which maximizes the Nash product. Proof Let δA = δ(A,t). Let δB = δ(B,t). According to the Zeuthen strategy, agent A will concede at step t {\displaystyle t} if and only if R i s k ( A , t ) ≤ R i s k ( B , t ) . {\displaystyle Risk(A,t)\leq Risk(B,t).} That is, if and only if U A ( δ A ) − U A ( δ B ) U A ( δ A ) ≤ U B ( δ B ) − U B ( δ A ) U B ( δ B ) {\displaystyle {\frac {U_{A}(\delta _{A})-U_{A}(\delta _{B})}{U_{A}(\delta _{A})}}\leq {\frac {U_{B}(\delta _{B})-U_{B}(\delta _{A})}{U_{B}(\delta _{B})}}} U B ( δ B ) ( U A ( δ A ) − U A ( δ B ) ) ≤ U A ( δ A ) ( U B ( δ B ) − U B ( δ A ) ) {\displaystyle U_{B}(\delta _{B})(U_{A}(\delta _{A})-U_{A}(\delta _{B}))\leq U_{A}(\delta _{A})(U_{B}(\delta _{B})-U_{B}(\delta _{A}))} U A ( δ A ) U B ( δ B ) − U A ( δ B ) U B ( δ B ) ≤ U A ( δ A ) U B ( δ B ) − U A ( δ A ) U B ( δ A ) {\displaystyle U_{A}(\delta _{A})U_{B}(\delta _{B})-U_{A}(\delta _{B})U_{B}(\delta _{B})\leq U_{A}(\delta _{A})U_{B}(\delta _{B})-U_{A}(\delta _{A})U_{B}(\delta _{A})} − U A ( δ B ) U B ( δ B ) ≤ − U A ( δ A ) U B ( δ A ) {\displaystyle -U_{A}(\delta _{B})U_{B}(\delta _{B})\leq -U_{A}(\delta _{A})U_{B}(\delta _{A})} U A ( δ A ) U B ( δ A ) ≤ U A ( δ B ) U B ( δ B ) {\displaystyle U_{A}(\delta _{A})U_{B}(\delta _{A})\leq U_{A}(\delta _{B})U_{B}(\delta _{B})} π ( δ A ) ≤ π ( δ B ) {\displaystyle \pi (\delta _{A})\leq \pi (\delta _{B})} Thus, Agent A will concede if and only if δ A {\displaystyle \delta _{A}} does not yield the larger product of utilities. Therefore, the Zeuthen strategy guarantees a final agreement that maximizes the Nash Product.

    Read more →
  • CENDI

    CENDI

    CENDI (Commerce, Energy, NASA, Defense Information Managers Group) is an interagency group of senior Scientific and Technical Information (STI) managers from 14 United States federal agencies. CENDI managers cooperate by exchanging information and ideas, collaborating to address common issues, and undertaking joint initiatives. CENDI's accomplishments range from impacting federal information policy to educating a broad spectrum of stakeholders on all aspects of federal STI systems, including its value to research and the taxpayer, and to operational improvements in agency and interagency STI operations. == History == CENDI traces its roots to the Committee on Scientific and Technical Information (COSATI) of the Federal Council on Science and Technology. COSATI was established in the early 1960s to coordinate the management of the results from the U.S. government's increasing commitment to scientific research and technology development. The scientific and technical information (STI) managers of the government's major research and development (R&D) agencies worked within COSATI to standardize guidelines for cataloging and indexing technical reports. COSATI ceased formal operations in the early 1970s. To continue the cooperation begun under COSATI, managers of agency STI programs from Commerce (National Technical Information Service), Energy (Office of Scientific and Technical Information), NASA (HQ/STI Division), and Defense (Defense Technical Information Center) began meeting periodically to discuss common topics and stimulate more effective cooperation. In 1985, a Memorandum of Understanding was signed by the four charter agencies and CENDI was established. From this small core of STI managers, CENDI has grown to its current membership, which represents the major science agencies, the national libraries, and agencies involved in the dissemination and long-term management of scientific and technical information. The vision of CENDI is to facilitate cooperative enterprise where capabilities are shared and challenges are faced together so that the sum of the accomplishments is greater than each individual agency can achieve on its own amongst federal STI agencies. The abbreviation CENDI refers to the "Commerce, Energy, NASA, Defense Information Managers Group". == Membership == New members from other federal R&D information organizations may be admitted by unanimous agreement of the members. However, it is the intent of the group that membership in CENDI should remain small and focus on organizations with STI or supporting responsibilities. Each agency provides funding to CENDI. == Members == The members of CENDI are: Defense Technical Information Center (United States Department of Defense) Office of Research and Development and Office of Environmental Information (United States Environmental Protection Agency) Government Printing Office Library of Congress NASA Scientific and Technical Information Program National Agricultural Library (United States Department of Agriculture) National Archives and Records Administration National Library of Education (United States Department of Education) National Library of Medicine (United States Department of Health and Human Services) National Science Foundation National Technical Information Service (United States Department of Commerce) National Transportation Library (United States Department of Transportation) Office of Scientific and Technical Information (United States Department of Energy) USGS/Biological Resources Discipline (United States Department of the Interior) == Mission and operation == CENDI's mission is to help improve the productivity of federal science- and technology-based programs through effective scientific, technical, and related information support systems. In fulfilling its mission, CENDI agencies play an important role in addressing science- and technology-based national priorities and strengthening U.S. competitiveness. === Goals === STI Coordination and Leadership: Provide coordination and leadership for information exchange on important STI policy issues. Improvement of STI Systems: Promote the development of improved STI systems through the productive interrelationship of content and technology. STI Understanding: Promote better understanding of STI and STI management. === Principals and Alternates === CENDI is made up of senior federal STI managers and each organization appoints a Principal representative. This person is the point of contact for that organization within CENDI. Each Principal has an Alternate. The Principals and Alternates comprise the main group that meets on a regular basis, usually every other month. === Secretariat === A Tennessee-based information management company, -- Information International Associates, Inc., currently serves as the CENDI Secretariat. The Secretariat provides day-to-day operations to CENDI. The Secretariat prepares the necessary materials for the Principals' meetings, provides support for the working group and task group meetings, assists in developing papers, and maintains the CENDI files and outreach tools. === Task Groups and Working Groups === The chair(s) of a working group is appointed by the Principals and has the overall responsibility for the group's activities. The Secretariat provides support at the request of the Working Group chair(s). The Working Groups and Task Groups that are currently operating are: Copyright and Intellectual Property Working Group Distribution Markings Task Group Digital Preservation Task Group Digitization Specifications Task Group Image Metadata Task Group Science.gov (see below) STI Policy Working Group Terminology Resources Task Group === Science.gov and Worldwidescience.org === In 2001, in response to the April 2001 workshop on "Strengthening the Public Information Infrastructure for Science", and taking into consideration a request from Firstgov (now USA.gov) to develop specialized topical portals, CENDI formed an alliance to develop an interagency website for access to STI. This website, called Science.gov, is a one-stop source of STI, including both selected, authoritative government websites and deep Web databases of technical reports, journal articles, conference proceedings, and other published materials. Through the volunteer efforts of members and involving over 100 staff, content and architecture is developed for the site. The Science.gov website is hosted by the Department of Energy (DOE) Office of Scientific and Technical Information (OSTI). The site was formally launched in December 2002. As a result of the success of Science.gov, under DOE leadership and in cooperation with the International Council of Scientific and Technical Information, a worldwide coordination across national portals called WorldWideScience was launched in 2008. === Work with non-member organizations === CENDI works with several cooperating non-member organizations on a regular basis. These agencies are in academia, federal government, legal and policy analysis, international, non-governmental, and private organizations.

    Read more →
  • Record linkage

    Record linkage

    Record linkage (also known as data matching, data linkage, entity resolution, and many other terms) is the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, and databases). Record linkage is necessary when joining different data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), which may be due to differences in record shape, storage location, or curator style or preference. A data set that has undergone RL-oriented reconciliation may be referred to as being cross-linked. == Naming conventions == "Record linkage" is the term used by statisticians, epidemiologists, and historians, among others, to describe the process of joining records from one data source with another that describe the same entity. However, many other terms are used for this process. Unfortunately, this profusion of terminology has led to few cross-references between these research communities. Computer scientists often refer to it as "data matching" or as the "object identity problem". Commercial mail and database applications refer to it as "merge/purge processing" or "list washing". Other names used to describe the same concept include: "coreference/entity/identity/name/record resolution", "entity disambiguation/linking", "fuzzy matching", "duplicate detection", "deduplication", "record matching", "(reference) reconciliation", "object identification", "data/information integration" and "conflation". While they share similar names, record linkage and linked data are two separate approaches to processing and structuring data. Although both involve identifying matching entities across different data sets, record linkage standardly equates "entities" with human individuals; by contrast, Linked Data is based on the possibility of interlinking any web resource across data sets, using a correspondingly broader concept of identifier, namely a URI. == History == The initial idea of record linkage goes back to Halbert L. Dunn in his 1946 article titled "Record Linkage" published in the American Journal of Public Health. Howard Borden Newcombe then laid the probabilistic foundations of modern record linkage theory in a 1959 article in Science. These were formalized in 1969 by Ivan Fellegi and Alan Sunter, in their pioneering work "A Theory For Record Linkage", where they proved that the probabilistic decision rule they described was optimal when the comparison attributes were conditionally independent. In their work they recognized the growing interest in applying advances in computing and automation to large collections of administrative data, and the Fellegi-Sunter theory remains the mathematical foundation for many record linkage applications. Since the late 1990s, various machine learning techniques have been developed that can, under favorable conditions, be used to estimate the conditional probabilities required by the Fellegi-Sunter theory. Several researchers have reported that the conditional independence assumption of the Fellegi-Sunter algorithm is often violated in practice; however, published efforts to explicitly model the conditional dependencies among the comparison attributes have not resulted in an improvement in record linkage quality. On the other hand, machine learning or neural network algorithms that do not rely on these assumptions often provide far higher accuracy, when sufficient labeled training data is available. Record linkage can be done entirely without the aid of a computer, but the primary reasons computers are often used to complete record linkages are to reduce or eliminate manual review and to make results more easily reproducible. Computer matching has the advantages of allowing central supervision of processing, better quality control, speed, consistency, and better reproducibility of results. == Methods == === Data preprocessing === Record linkage is highly sensitive to the quality of the data being linked, so all data sets under consideration (particularly their key identifier fields) should ideally undergo a data quality assessment before record linkage. Many key identifiers for the same entity can be presented quite differently between (and even within) data sets, which can greatly complicate record linkage unless understood ahead of time. For example, key identifiers for a man named William J. Smith might appear in three different data sets as follows: In this example, the different formatting styles lead to records that look different but in fact all refer to the same entity with the same logical identifier values. Most, if not all, record linkage strategies would result in more accurate linkage if these values were first normalized or standardized into a consistent format (e.g., all names are "Surname, Given name", and all dates are "YYYY/MM/DD"). Standardization can be accomplished through simple rule-based data transformations or more complex procedures such as lexicon-based tokenization and probabilistic hidden Markov models. Several of the packages listed in the Software Implementations section provide some of these features to simplify the process of data standardization. === Entity resolution === Entity resolution is an operational intelligence process, typically powered by an entity resolution engine or middleware, whereby organizations can connect disparate data sources with a view to understand possible entity matches and non-obvious relationships across multiple data silos. It analyzes all of the information relating to individuals and/or entities from multiple sources of data, and then applies likelihood and probability scoring to determine which identities are a match and what, if any, non-obvious relationships exist between those identities. Entity resolution engines are typically used to uncover risk, fraud, and conflicts of interest, but are also useful tools for use within customer data integration (CDI) and master data management (MDM) requirements. Typical uses for entity resolution engines include terrorist screening, insurance fraud detection, USA Patriot Act compliance, organized retail crime ring detection and applicant screening. For example, across different data silos – employee records, vendor data, watch lists, etc. – an organization may have several variations of an entity named ABC, which may or may not be the same individual. These entries may, in fact, appear as ABC1, ABC2, or ABC3 within those data sources. By comparing similarities between underlying attributes such as address, date of birth, or social security number, the user can eliminate some possible matches and confirm others as very likely matches. Entity resolution engines then apply rules, based on common sense logic, to identify hidden relationships across the data. In the example above, perhaps ABC1 and ABC2 are not the same individual, but rather two distinct people who share common attributes such as address or phone number. ==== Data matching ==== While entity resolution solutions include data matching technology, many data matching offerings do not fit the definition of entity resolution. Here are four factors that distinguish entity resolution from data matching, according to John Talburt, director of the UALR Center for Advanced Research in Entity Resolution and Information Quality: Works with both structured and unstructured records, and it entails the process of extracting references when the sources are unstructured or semi-structured Uses elaborate business rules and concept models to deal with missing, conflicting, and corrupted information Utilizes non-matching, asserted linking (associate) information in addition to direct matching Uncovers non-obvious relationships and association networks (i.e. who's associated with whom) In contrast to data quality products, more powerful identity resolution engines also include a rules engine and workflow process, which apply business intelligence to the resolved identities and their relationships. These advanced technologies make automated decisions and impact business processes in real time, limiting the need for human intervention. === Deterministic record linkage === The simplest kind of record linkage, called deterministic or rules-based record linkage, generates links based on the number of individual identifiers that match among the available data sets. Two records are said to match via a deterministic record linkage procedure if all or some identifiers (above a certain threshold) are identical. Deterministic record linkage is a good option when the entities in the data sets are identified by a common identifier, or when there are several representative identifiers (e.g., name, date of birth, and sex when identifying a person) whose quality of data is relatively high. As an example, consider two standardized data sets, Set A and Set B, that contain different bits of information about patients in a hospital system. T

    Read more →
  • Very large database

    Very large database

    A very large database, (originally written very large data base) or VLDB, is a database that contains a very large amount of data, so much that it can require specialized architectural, management, processing and maintenance methodologies. == Definition == The vague adjectives of very and large allow for a broad and subjective interpretation, but attempts at defining a metric and threshold have been made. Early metrics were the size of the database in a canonical form via database normalization or the time for a full database operation like a backup. Technology improvements have continually changed what is considered very large. One definition has suggested that a database has become a VLDB when it is "too large to be maintained within the window of opportunity… the time when the database is quiet". == Sizes of a VLDB database == There is no absolute amount of data that can be cited. For example, one cannot say that any database with more than 1 TB of data is considered a VLDB. This absolute amount of data has varied over time as computer processing, storage and backup methods have become better able to handle larger amounts of data. That said, VLDB issues may start to appear when 1 TB is approached, and are more than likely to have appeared as 30 TB or so is exceeded. == VLDB challenges == Key areas where a VLDB may present challenges include configuration, storage, performance, maintenance, administration, availability and server resources. === Configuration === Careful configuration of databases that lie in the VLDB realm is necessary to alleviate or reduce issues raised by VLDB databases. === Administration === The complexities of managing a VLDB can increase exponentially for the database administrator as database size increases. === Availability and maintenance === When dealing with VLDB operations relating to maintenance and recovery such as database reorganizations and file copies which were quite practical on a non-VLDB take very significant amounts of time and resources for a VLDB database. In particular it typically infeasible to meet a typical recovery time objective (RTO), the maximum expected time a database is expected to be unavailable due to interruption, by methods which involve copying files from disk or other storage archives. To overcome these issues techniques such as clustering, cloned/replicated/standby databases, file-snapshots, storage snapshots or a backup manager may help achieve the RTO and availability, although individual methods may have limitations, caveats, license, and infrastructure requirements while some may risk data loss and not meet the recovery point objective (RPO). For many systems only geographically remote solutions may be acceptable. ==== Backup and recovery ==== Best practice is for backup and recovery to be architectured in terms of the overall availability and business continuity solution. === Performance === Given the same infrastructure there may typically be a decrease in performance, that is increase in response time as database size increases. Some accesses will simply have more data to process (scan) which will take proportionally longer (linear time); while the indexes used to access data may grow slightly in height requiring perhaps an extra storage access to reach the data (sub-linear time). Other effects can be caching becoming less efficient because proportionally less data can be cached and while some indexes such as the B+ automatically sustain well with growth others such as a hash table may need to be rebuilt. Should an increase in database size cause the number of accessors of the database to increase then more server and network resources may be consumed, and the risk of contention will increase. Some solutions to regaining performance include partitioning, clustering, possibly with sharding, or use of a database machine. ==== Partitioning ==== Partitioning may be able assist the performance of bulk operations on a VLDB including backup and recovery., bulk movements due to information lifecycle management (ILM), reducing contention as well as allowing optimization of some query processing. === Storage === In order to satisfy needs of a VLDB the database storage needs to have low access latency and contention, high throughput, and high availability. === Server resources === The increasing size of a VLDB may put pressure on server and network resources and a bottleneck may appear that may require infrastructure investment to resolve. == Relationship to big data == VLDB is not the same as big data, but the storage aspect of big data may involve a VLDB database. That said some of the storage solutions supporting big data were designed from the start to support large volumes of data, so database administrators may not encounter VLDB issues that older versions of traditional RDBMS's might encounter.

    Read more →
  • Networked Help Desk

    Networked Help Desk

    Networked Help Desk is an open standard initiative to provide a common API for sharing customer support tickets between separate instances of issue tracking, bug tracking, customer relationship management (CRM) and project management systems to improve customer service and reduce vendor lock-in. The initiative was created by Zendesk in June 2011 in collaboration with eight other founding member organizations including Atlassian, New Relic, OTRS, Pivotal Tracker, ServiceNow and SugarCRM. The first integration, between Zendesk and Atlassian's issue tracking product, Jira, was announced at the 2011 Atlassian Summit. By August 2011, 34 member companies had joined the initiative. A year after launching, over 50 organizations had joined. Within Zendesk instances this feature is branded as ticket sharing. == Basis == Support tools are generally built around a common paradigm that begins with a customer making a request or an incident report, these create a ticket. Each ticket has a progress status and is updated with annotations and attachments. These annotations and attachments may be visible to the customer (public), or only visible to analysts (private). Customers are notified of progress made on their ticket until it is complete. If the people necessary to complete a ticket are using separate support tools, additional overhead is introduced in maintaining the relevant information in the ticket in each tool while notifying the customer of progress made by each group in completing their ticket. For example, if a customer support issue is caused by a software bug and reported to a help desk using one system, and then the fix is documented by the developers in another, and analyzed in a customer relationship management tool, keeping the records in each system up-to-date and notifying the customer manually using a swivel chair approach is unnecessarily time-consuming and error-prone. If information is not transferred correctly, a customer may have to re-explain their problem each time their ticket is transferred. For systems with the Networked Help Desk API implemented, it is possible for several different applications related to a customer's support experience to synchronize data in one uniquely identified shared ticket. While many applications in these domains have implemented APIs that allow data to be imported, exported and modified, Network Help Desk provide a common standard for customer support information to automatically synchronize between several systems. Once implemented, two systems can quickly share tickets with just a configuration change as they both understand the same interface. Communication between two instances on a specific ticket occurs in three steps, an invitation agreement, sharing of ticket data and continued synchronization of tickets. The standard allows for "full delegation" (analysts in both systems each make public and private comments and synchronize status) as well as "partial delegation" where the instance receiving the ticket can only make private comments and status changes are not synchronized. Tickets may be shared with multiple instances. == Implementation list ==

    Read more →
  • Basic Formal Ontology

    Basic Formal Ontology

    Basic Formal Ontology (BFO) is a top-level ontology developed by Barry Smith and colleagues to promote interoperability among domain ontologies. The BFO methodology accomplishes this through a process of downward population. BFO is a formal ontology. The structure of BFO is based on a division of entities into two disjoint categories of continuant and occurrent, the former consists of objects and spatial regions, the latter contains processes conceived as extended through (or spanning) time. BFO thereby seeks to consolidate both time and space within a single framework A guide to building BFO-conformant domain ontologies was published by MIT Press in 2015. In 2021, the standard ISO/IEC 21838-2:2021 Information Technology — Top-level Ontologies (TLO) — Part 2: Basic Formal Ontology (BFO) was published by the Joint Technical Committee of the International Standards Organization and the International Electrotechnical Commission. ISO/IEC 21838 is a multi-part standard. Part 1 of the standard specifies the requirements that must be met if an ontology is to be classified as a top-level ontology by the standard. == History == BFO arose against the background of research in ontologies in the domain of geospatial information science by David Mark, Pierre Grenon, Achille Varzi and others, with a special role for the study of vagueness and of the ways sharp boundaries in the geospatial and other domains are created by fiat. BFO has passed through four major releases. 2001: release of BFO 1 2007: release of BFO 1.1 2015: release of BFO 2.0 2020: release of BFO 2020 2021: release of BFO 2020 as an ISO/IEC Standard The current revision was released in 2020, and this forms the basis of the standard ISO/IEC 21838-2, which was released by the Joint Committee of the International Standards Organization and International Electrotechnical Commission in 2021. == Applications == BFO has been adopted as a foundational ontology by over 650 ontology projects, principally in the areas of biomedical ontology, security and defense (intelligence) ontology, and industry ontologies. Example applications of BFO can be seen in the Ontology for Biomedical Investigations (OBI). In January 2024, BFO and the Common Core Ontologies (CCO), a suite of BFO-extension ontologies, were adopted as the "baseline standards for formal DOD and IC ontology" development work in the DOD and Intelligence Community. A memorandum to this effect was signed by the chief data officers of the DOD, the Office of the Director of National Intelligence and the Chief Digital and Artificial Intelligence Office.

    Read more →
  • Best arm identification

    Best arm identification

    Best arm identification (BAI) is a sequential one-player game where the player has to find the best action (arm) among a list of actions (arms) by collecting information in the most efficient way. It is a multi-armed bandit game as a player only gets information about an arm by playing it. The most common objective in multi-armed bandit games is to minimize the regret (i.e., play the best action as much as possible), but in BAI, the goal is to find the best arm as efficiently as possible. This problem naturally arises in scenarios such as adaptive clinical trials where the number of patients is limited and the quantification of the confidence in a treatment is important. It also arises in hyperparameter optimization where the goal is to find the optimal choice of hyperparameters for an algorithm with the smallest possible number of experiments, as it can be costly in terms of time, energy, or money. == Stochastic multi-armed bandit == The stochastic multi-armed bandit (MAB) is a sequential game with one player and K {\displaystyle K} actions (arms). Each arm has an unknown probability distribution associated with it. At each turn, the player has to choose one action and receive an observation from the probability distribution associated with the arm. The more you play an arm, the more you get information on its probability distribution. === Best arm identification === In BAI the goal is to find the arm that has the probability distribution with the highest mean. BAI may be either fixed confidence or fixed horizon. In a fixed-confidence game, a confidence level δ {\displaystyle \delta } is fixed at the beginning of the game and the goal is to find the best arm with this confidence level in as few turns as possible. In a fixed horizon game, the number of turns T {\displaystyle T} is fixed, and the goal is to find the best arm with the highest possible confidence in T {\displaystyle T} turns. === Math formalisation === We have one player and K {\displaystyle K} actions (arms). Behind each arm k ∈ { 1 , … , K } {\displaystyle k\in \{1,\ldots ,K\}} lies an unknown distribution ν k {\displaystyle \nu _{k}} with mean μ k {\displaystyle \mu _{k}} . Each distribution ν k {\displaystyle \nu _{k}} belongs to a known family D {\displaystyle {\mathcal {D}}} (such as the set of Gaussian distributions or Bernoulli distributions). At each time step t {\displaystyle t} , the player selects an arm a t {\displaystyle a_{t}} and observes an independent sample X t ∼ ν a t {\displaystyle X_{t}\sim \nu _{a_{t}}} from the corresponding distribution. We will note μ ∗ := max μ a {\displaystyle \mu ^{}:=\max \mu _{a}} the highest mean. An arm a {\displaystyle a} that satisfies μ a = μ ∗ {\displaystyle \mu _{a}=\mu ^{}} is called an optimal arm; otherwise it is called suboptimal arm. In best arm identification (BAI) the objective is to identify an optimal arm. Two main settings for BAI appear in the literature: Fixed confidence: In this setting, one typically assumes that there exists a unique optimal arm. A confidence level δ ∈ ( 0 , 1 ) {\displaystyle \delta \in (0,1)} is specified at the beginning. The algorithm must stop at some finite stopping time τ δ < + ∞ {\displaystyle \tau _{\delta }<+\infty } and return an arm a ^ τ δ {\displaystyle {\hat {a}}_{\tau _{\delta }}} such that the probability of error is bounded: P ( a ^ τ δ ≠ a ∗ ) ≤ δ {\displaystyle \mathbb {P} ({\hat {a}}_{\tau _{\delta }}\neq a^{})\leq \delta } . The objective is to minimize the expected sample complexity E [ τ δ ] {\displaystyle \mathbb {E} [\tau _{\delta }]} . Such a setting appears, for example, when a constraint on the confidence is required (for example, if we require a confidence level of 95%, so δ = 1 − 0.95 = 0.05 {\displaystyle \delta =1-0.95=0.05} ). Fixed horizon: In this setting, the number of samples T {\displaystyle T} is fixed in advance. The goal is to design an algorithm that minimizes the probability of misidentifying the optimal arm: P ( a ^ T ≠ a ∗ ) {\displaystyle \mathbb {P} ({\hat {a}}_{T}\neq a^{})} . This setting appears when the number of experiments is limited (for drug tests, the number of patients can be fixed in advance). === Example of simple modelling === In the case where we have K {\displaystyle K} treatments and we want to be sure with a confidence level of 95% which treatment is the best to heal a specific disease. Each treatment heals or does not heal the disease with a probability μ k {\displaystyle \mu _{k}} , which means that each distribution is a Bernoulli distribution, so D {\displaystyle {\mathcal {D}}} is the set of Bernoulli distributions. We can use a BAI algorithm to minimize E [ τ 0.05 ] {\displaystyle \mathbb {E} [\tau _{0.05}]} , the number of patients required to find the best treatment with probability 95%. == Applications == Best arm identification naturally arises in several practical domains: Adaptive clinical trials: The objective is to identify the most effective treatment based on sequentially collected patient data. Each treatment can be modeled as having an underlying distribution of outcomes. The goal is to identify the treatment with the highest expected outcome with high confidence (fixed confidence setting δ {\displaystyle \delta } ) while minimizing the number of drug test patients (minimise E [ τ δ ] {\displaystyle \mathbb {E} [\tau _{\delta }]} ), as it costs to pay patients for this and we would like to use as little as possible less effective drugs. Hyperparameter tuning: Selecting the best configuration for machine learning models efficiently by treating each hyperparameter setting as an arm. The goal is to find the best hyperparameter with as few experiments possible as experiments are costly in time and in energy == Fixed confidence level == In the fixed-confidence setting, the goal is to design an algorithm that identifies the best arm with a prescribed confidence level δ {\displaystyle \delta } while minimizing the expected number of samples. Any such algorithm requires two key components: Stopping rule: A decision criterion that determines when to stop sampling. Formally, this defines a stopping time τ δ {\displaystyle \tau _{\delta }} and returns an arm a ^ τ δ {\displaystyle {\hat {a}}_{\tau _{\delta }}} such that P ( a ^ τ δ ≠ a ⋆ ) ≤ δ {\displaystyle \mathbb {P} ({\hat {a}}_{\tau _{\delta }}\neq a^{\star })\leq \delta } and P ( τ δ < + ∞ ) = 1 {\displaystyle \mathbb {P} (\tau _{\delta }<+\infty )=1} . Sampling rule: A policy π {\displaystyle \pi } that, at each round t {\displaystyle t} , selects the next arm to sample a t {\displaystyle a_{t}} based on all previous observations ( a s , X s ) s < t {\displaystyle (a_{s},X_{s})_{s Read more →