AI Coding Neovim

AI Coding Neovim — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Oracle Cloud Platform

    Oracle Cloud Platform

    Oracle Cloud Platform refers to a Platform as a Service (PaaS) offerings by Oracle Corporation as part of Oracle Cloud Infrastructure. These offerings are used to build, deploy, integrate and extend applications in the cloud. The offerings support a variety of programming languages, databases, tools and frameworks including Oracle-specific, open source and third-party software and systems. == Deployment models == Oracle Cloud Platform offers public, private and hybrid cloud deployment models. == Architecture == Oracle Cloud Platform provides both Infrastructure as a Service (IaaS) and Platform as a Service (PaaS). The infrastructure is offered through a global network of Oracle managed data centers. Oracle deploys their cloud in Regions. Inside each Region are at least three fault-independent Availability Domains. Each of these Availability Domains contains an independent data center with power, thermal and network isolation. Oracle Cloud is generally available in North America, EMEA, APAC and Japan with announced South America and US Govt. regions coming soon.

    Read more →
  • Overcategorization

    Overcategorization

    Overcategorization or category clutter is a phenomenon during classification where too many categories or classes are assigned to a document, record, or item. Overcategorization is related to the library and information science (LIS) concepts of document classification and subject indexing. It is also related to online shopping where excessive product categories can overwhelm users with too many choices or make it more difficult for customers to find the products they need. Although these categories are intended to improve organization and ease of navigation when shipping online, too many categories can lower customer satisfaction, increase difficulty navigating the online store, and reduce future shopping intentions. In LIS, the ideal number of terms that should be assigned to classify an item are measured by the variables precision and recall. Assigning few category labels that are most closely related to the content of the item being classified will result in searches that have high precision, I.e., where a high proportion of the results are closely related to the query. Assigning more category labels to each item will reduce the precision of each search, but increase the recall, retrieving more relevant results. Related LIS concepts include exhaustivity of indexing and information overload. == Basic principles == If too many categories are assigned to a given document, the implications for users depend on how informative the links are. If the user is able to distinguish between useful and not useful links, the damage is limited: The user only wastes time selecting links. In many cases, however, the user cannot judge whether or not a given link will turn out to be fruitful. In that case he or she has to follow the link and to read or skim another document. The worst case scenario is, of course, that even after reading the new document the user is unable to decide whether or not it might be useful if its subject matter is not thoroughly investigated. Overcategorization also has another unpleasant implication: It makes the system (for example in Wikipedia) difficult to maintain in a consistent way. If the system is inconsistent, it means that when the user considers the links in a given category, he or she will not find all documents relevant to that category. Basically, the problem of overcategorization should be understood from the perspective of relevance and the traditional measures of recall and precision. If too few relevant categories are assigned to a document, recall may decrease. If too many non-relevant categories are assigned, precision becomes lower. The hard job is to say which categories are fruitful or relevant for future use of the document.

    Read more →
  • Algorithm engineering

    Algorithm engineering

    Algorithm engineering focuses on the design, analysis, implementation, optimization, profiling and experimental evaluation of computer algorithms, bridging the gap between algorithmics theory and practical applications of algorithms in software engineering. It is a general methodology for algorithmic research. == Origins == In 1995, a report from an NSF-sponsored workshop "with the purpose of assessing the current goals and directions of the Theory of Computing (TOC) community" identified the slow speed of adoption of theoretical insights by practitioners as an important issue and suggested measures to reduce the uncertainty by practitioners whether a certain theoretical breakthrough will translate into practical gains in their field of work, and tackle the lack of ready-to-use algorithm libraries, which provide stable, bug-free and well-tested implementations for algorithmic problems and expose an easy-to-use interface for library consumers. But also, promising algorithmic approaches have been neglected due to difficulties in mathematical analysis. The term "algorithm engineering" was first used with specificity in 1997, with the first Workshop on Algorithm Engineering (WAE97), organized by Giuseppe F. Italiano. == Difference from algorithm theory == Algorithm engineering does not intend to replace or compete with algorithm theory, but tries to enrich, refine and reinforce its formal approaches with experimental algorithmics (also called empirical algorithmics). This way it can provide new insights into the efficiency and performance of algorithms in cases where the algorithm at hand is less amenable to algorithm theoretic analysis, formal analysis pessimistically suggests bounds which are unlikely to appear on inputs of practical interest, the algorithm relies on the intricacies of modern hardware architectures like data locality, branch prediction, instruction stalls, instruction latencies which the machine model used in Algorithm Theory is unable to capture in the required detail, the crossover between competing algorithms with different constant costs and asymptotic behaviors needs to be determined. == Methodology == Some researchers describe algorithm engineering's methodology as a cycle consisting of algorithm design, analysis, implementation and experimental evaluation, joined by further aspects like machine models or realistic inputs. They argue that equating algorithm engineering with experimental algorithmics is too limited, because viewing design and analysis, implementation and experimentation as separate activities ignores the crucial feedback loop between those elements of algorithm engineering. === Realistic models and real inputs === While specific applications are outside the methodology of algorithm engineering, they play an important role in shaping realistic models of the problem and the underlying machine, and supply real inputs and other design parameters for experiments. === Design === Compared to algorithm theory, which usually focuses on the asymptotic behavior of algorithms, algorithm engineers need to keep further requirements in mind: Simplicity of the algorithm, implementability in programming languages on real hardware, and allowing code reuse. Additionally, constant factors of algorithms have such a considerable impact on real-world inputs that sometimes an algorithm with worse asymptotic behavior performs better in practice due to lower constant factors. === Analysis === Some problems can be solved with heuristics and randomized algorithms in a simpler and more efficient fashion than with deterministic algorithms. Unfortunately, this makes even simple randomized algorithms difficult to analyze because there are subtle dependencies to be taken into account. === Implementation === Huge semantic gaps between theoretical insights, formulated algorithms, programming languages and hardware pose a challenge to efficient implementations of even simple algorithms, because small implementation details can have rippling effects on execution behavior. The only reliable way to compare several implementations of an algorithm is to spend an considerable amount of time on tuning and profiling, running those algorithms on multiple architectures, and looking at the generated machine code. === Experiments === See: Experimental algorithmics === Application engineering === Implementations of algorithms used for experiments differ in significant ways from code usable in applications. While the former prioritizes fast prototyping, performance and instrumentation for measurements during experiments, the latter requires thorough testing, maintainability, simplicity, and tuning for particular classes of inputs. === Algorithm libraries === Stable, well-tested algorithm libraries like LEDA play an important role in technology transfer by speeding up the adoption of new algorithms in applications. Such libraries reduce the required investment and risk for practitioners, because it removes the burden of understanding and implementing the results of academic research. == Conferences == Two main conferences on Algorithm Engineering are organized annually, namely: Symposium on Experimental Algorithms (SEA), established in 1997 (formerly known as WEA). SIAM Meeting on Algorithm Engineering and Experiments (ALENEX), established in 1999. The 1997 Workshop on Algorithm Engineering (WAE'97) was held in Venice (Italy) on September 11–13, 1997. The Third International Workshop on Algorithm Engineering (WAE'99) was held in London, UK in July 1999. The first Workshop on Algorithm Engineering and Experimentation (ALENEX99) was held in Baltimore, Maryland on January 15–16, 1999. It was sponsored by DIMACS, the Center for Discrete Mathematics and Theoretical Computer Science (at Rutgers University), with additional support from SIGACT, the ACM Special Interest Group on Algorithms and Computation Theory, and SIAM, the Society for Industrial and Applied Mathematics.

    Read more →
  • Irish logarithm

    Irish logarithm

    The Irish logarithm was a system of number manipulation invented by Percy Ludgate for machine multiplication. The system used a combination of mechanical cams as lookup tables and mechanical addition to sum pseudo-logarithmic indices to produce partial products, which were then added to produce results. The technique is similar to Zech logarithms (also known as Jacobi logarithms), but uses a system of indices original to Ludgate. == Concept == Ludgate's algorithm compresses the multiplication of two single decimal numbers into two table lookups (to convert the digits into indices), the addition of the two indices to create a new index which is input to a second lookup table that generates the output product. Because both lookup tables are one-dimensional, and the addition of linear movements is simple to implement mechanically, this allows a less complex mechanism than would be needed to implement a two-dimensional 10×10 multiplication lookup table. Ludgate stated that he deliberately chose the values in his tables to be as small as he could make them; given this, Ludgate's tables can be simply constructed from first principles, either via pen-and-paper methods, or a systematic search using only a few tens of lines of program code. They do not correspond to either Zech logarithms, Remak indexes or Korn indexes. == Pseudocode == The following is an implementation of Ludgate's Irish logarithm algorithm in the Python programming language: Table 1 is taken from Ludgate's original paper; given the first table, the contents of Table 2 can be trivially derived from Table 1 and the definition of the algorithm. Note since that the last third of the second table is entirely zeros, this could be exploited to further simplify a mechanical implementation of the algorithm.

    Read more →
  • Inpainting

    Inpainting

    Inpainting is a conservation process where damaged, deteriorated, or missing parts of an artwork are filled in to present a complete image. This process is commonly used in image restoration. It can be applied to both physical and digital art mediums such as oil or acrylic paintings, chemical photographic prints, sculptures, or digital images and video. With its roots in physical artwork, such as painting and sculpture, traditional inpainting is performed by a trained art conservator who has carefully studied the artwork to determine the mediums and techniques used in the piece, potential risks of treatments, and ethical appropriateness of treatment. == History == The modern use of inpainting can be traced back to Pietro Edwards (1744–1821), Director of the Restoration of the Public Pictures in Venice, Italy. Using a scientific approach, Edwards focused his restoration efforts on the intentions of the artist. It was during the 1930 International Conference for the Study of Scientific Methods for the Examination and Preservation of Works of Art, that the modern approach to inpainting was established. Helmut Ruhemann (1891–1973), a German restorer and conservator, led the discussions on the use of inpainting in conservation. Helmut Ruhemann was a leading figure in modernizing restoration and conservation. His greatest contribution to the field of conservation "was his insistence on following the methods of the original painter exactly, and on understanding the painter's artistic intention". After his career of over 40 years as a conservator, Ruhemann published his treatise The Cleaning of Paintings: Problems & Potentialities in 1968. In describing his method, Ruhemann states that "The surface [of the fill] should be slightly lower than that of the surrounding paint to allow for the thickness of the inpainting...Inpainting medium should look and behave like the original medium, but must not darken with age." Cesare Brandi (1906–1988) developed the teoria del restauro, the inpainting approach combining aesthetics and psychology. However, this approach was used primarily by Italian restorers and conservators, with the terminology becoming widespread in the 1990s. Technological advancements led to new applications of inpainting. Widespread use of digital techniques range from entirely automatic computerized inpainting to tools used to simulate the process manually. Since the mid-1990s, the process of inpainting has evolved to include digital media. More commonly known as image or video interpolation, a form of estimation, digital inpainting includes the use of computer software that relies on sophisticated algorithms to replace lost or corrupted parts of the image data. == Ethics == In order to preserve the integrity of an original artwork, any inpainting technique or treatment applied to physical or digital work should be reversible or distinguishable from the original content of the artwork. Prior to any treatments, conservators proceed according to the American Institute of Conservation of Historical and Artistic Works. There are several ethic considerations before Inpainting can be justified. Various deliberation decisions over the ethical appropriateness of the amount and type of inpainting done, resides on many factors. As most conservation treatments, inpainting's ethical questions rest mainly with authenticity, reversibility and documentation.Any intervention to compensate for loss should be documented in treatment records and reports and should be detectable by common examination methods. Such compensation should be reversible and should not falsely modify the known aesthetic, conceptual, and physical characteristics of the cultural property, especially by removing or obscuring original material.New technologies and the aesthetic demand for perfect images without imperfections challenge conservators' ethical practices to protect the integrity of originals. == Methods == Inpainting methods and techniques depend on the desired goal and type of image being treated. Treatments to fill in the gaps are different between physical and digital art. In inpainting, detailed records of the initial state of the images can help with the treatment and replicate the original closer. === Physical inpainting === Inpainting is rooted in the conservation and restoration of paintings. Inpainting can aim to make a visual improvement to the artwork as a whole by repairing missing or damaged parts using methods and materials equivalent to the original artist's work. ==== Application techniques ==== By studying the painting methods of various artists and the composition of paints used historically, conservators are able to restore works very closely to their original visual appearance. The picture as a whole determines how to fill in the gap. Helmut Ruhemann's inpainting techniques by Jessell have procedures to "preserve" the quality of oil and tempera paintings. === Digital inpainting === Many programs are able to reconstruct missing or damaged areas of digital photographs and videos. Most widely known for use with digital images is Adobe Photoshop. Given the various abilities of the digital camera and the digitization of old photos, inpainting has become an automatic process that can be performed on digital images. The inpainting techniques can be applied to object removal, text removal, and other automatic modifications of images and videos. In video special effects, inpainting is usually performed after video matting. They can also be observed in applications like image compression and super-resolution. In photography and cinema, it is used for film restoration to reverse, repair, or mitigate deterioration (e.g., physical damage such as cracks in photographs, scratches and dust spots in film, or chemical damage resulting in image loss; performed infrared cleaning). It can also be used for removing red-eye, the stamped date from photographs, and objects for creative effect. This technique can be used to replace any lost blocks in the coding and transmission of images, for example, in a streaming video. It can also be used to remove logos or watermarks in videos. Deep learning neural network-based inpainting can be used for decensoring images. Deep image prior-based techniques can be used for digital image inpainting, where a trained deep learning model is either unavailable or infeasible. Deep models for visual content generation, like text-to-image or text-to-video, learn complex priors over the distribution of visual content, and can be used to inpaint missing parts. For example, videos can be separated into layers, using a technique called omnimatte, which either pretrain an omnimatte model or without any training using an omnimatte-zero model. Three main groups of 2D image-inpainting algorithms can be found in the literature. The first one to be noted is structural (or geometric) inpainting, the second one is texture inpainting, the last one is a combination of these two techniques. They use the information of the known or non-destroyed image areas in order to fill the gap, similar to how physical images are restored. ==== Structural ==== Structural or geometric inpainting is used for smooth images that have strong, defined borders. There are many different approaches to geometric inpainting, but they all come from the idea that geometry can be recovered from similar areas or domains. Bertalmio proposed a method of structural inpainting that mimics how conservators address painting restoration. Bertalmio proposed that by progressively transferring similar information from the borders of an inpainting domain inwards, the gap can be filled. ==== Textural ==== While structural/geometric inpainting works to repair smooth images, textural inpainting works best with images that are heavily textured. Texture has a repetitive pattern which means that a missing portion cannot be restored by continuing the level lines into the gap; level lines provide a complete, stable representation of an image. To repair texture in an image, one can combine frequency and spatial domain information to fill in a selected area with a desired texture. This method, while the most simple and very effective, works well when selecting a texture to be in-painted. For a texture that covers a wider area or a larger frame one would have to go through the image segmenting the areas to be in-painted and selecting the corresponding textures from throughout the image; there are programs that can help find the corresponding areas that work in a similar way as 'find and replace' works in a word processor. ==== Combined structural and textural ==== Combined structural and textural inpainting approaches simultaneously try to perform texture- and structure-filling in regions of missing image information. Most parts of an image consist of texture and structure and the boundaries between image regions contain a large amount of structural information. This is the result when blending differ

    Read more →
  • DIKW pyramid

    DIKW pyramid

    The DIKW pyramid (also known as the knowledge pyramid or information hierarchy) is a model describing relationships between data, information, knowledge and wisdom sometimes also stylized as a chain, refer to models of possible structural and functional relationships between a set of components—often four, data, information, knowledge, and wisdom. The concept has roots predating the 1980s. In the latter years of that decade, interest in the models grew after explicit presentations and discussions, including from Milan Zeleny, Russell Ackoff, and Robert W. Lucky. Subsequent important discussions extended along theoretical and practical lines into the coming decades. While debate continues as to actual meaning of the component terms of DIKW-type models, and the actual nature of their relationships—including occasional doubt being cast over any simple, linear, unidirectional model—even so they have become very popular visual representations in use by business, the military, and others. Among the academic and popular, not all versions of the DIKW-type models include all four components (earlier ones excluding data, later ones excluding or downplaying wisdom, and several including additional components (for instance Ackoff inserting "understanding" before and Zeleny adding "enlightenment" after the wisdom component). In addition, DIKW-type models are no longer always presented as pyramids, instead also as a chart or framework (e.g., by Zeleny), as flow diagrams (e.g., by Liew, and by Chisholm et al.), and sometimes as a continuum (e.g., by Choo et al.). == Short description == As Rowley noted in 2007, the DIKW model "is often quoted, or used implicitly, in definitions of data, information and knowledge in the information management, information systems and knowledge management literatures, but [as of that date] there ha[d] been limited direct discussion of the hierarchy". Reviews of textbooks and a survey of scholars in relevant fields indicate that there was not a consensus as to definitions used in the model as of that date, and as reviewed by Liew in that year, even less "in the description of the processes that transform components lower in the hierarchy into those above them". Zins work, published in 2007—from studies in 2003-2005 that documented "130 definitions of data, information, and knowledge formulated by 45 scholars", published in 2007—to suggest that the data–information–knowledge components of DIKW refer to a class of no less than five models, as a function of whether data, information, and knowledge are each conceived of as subjective, objective (what Zins terms, "universal" or "collective") or both. In Zins' usage, subjective and objective "are not related to arbitrariness and truthfulness, which are usually attached to the concepts of subjective knowledge and objective knowledge". Information science, Zins argues, studies data and information, but not knowledge, as knowledge is an internal (subjective) rather than an external (universal–collective) phenomenon. == Representations == === Graphical representation === DIKW is a hierarchical model often depicted as a pyramid, sometimes as a chain, with data at its base and wisdom at its apex (or chain-beginning and -end). Both Zeleny and Ackoff have been credited with originating the pyramid representation, although neither used a pyramid to present their ideas. According to Wallace, Debons and colleagues may have been the first to "present the hierarchy graphically". Many variations of the DIKW-type pyramid have been produced. One, in use by knowledge managers in the United States Department of Defense, attempts to show the DIKW progression to enable effective decisions and consequent activities supporting shared understanding throughout defense organizations, as well as supporting management of risks associated with decisions. DIKW-type hierarchical information paradigms have also been represented as two-dimensional charts, and as flow diagrams, where relationships between the components may be presented less hierarchically, with defining aspects of the relationships, feedback loops, etc. === Computational representation === Intelligent decision support systems are trying to improve decision making by introducing new technologies and methods from the domain of modeling and simulation in general, and in particular from the domain of intelligent software agents in the contexts of agent-based modeling. The following example describes a military decision support system, but the architecture and underlying conceptual idea are transferable to other application domains: The value chain starts with data quality describing the information within the underlying command and control systems. Information quality tracks the completeness, correctness, currency, consistency and precision of the data items and information statements available. Knowledge quality deals with procedural knowledge and information embedded in the command and control system such as templates for adversary forces, assumptions about entities such as ranges and weapons, and doctrinal assumptions, often coded as rules. Awareness quality measures the degree of using the information and knowledge embedded within the command and control system. Awareness is explicitly placed in the cognitive domain. By the introduction of a common operational picture, data are put into context, which leads to information instead of data. The next step, which is enabled by service-oriented web-based infrastructures (but not yet operationally used), is the use of models and simulations for decision support. Simulation systems are the prototype for procedural knowledge, which is the basis for knowledge quality. Finally, using intelligent software agents to continually observe the battle sphere, apply models and simulations to analyze what is going on, to monitor the execution of a plan, and to do all the tasks necessary to make the decision maker aware of what is going on, command and control systems could even support situational awareness, the level in the value chain traditionally limited to pure cognitive methods. == History == Danny P. Wallace, a professor of library and information science, explained that the origin of the DIKW pyramid is uncertain: The presentation of the relationships among data, information, knowledge, and sometimes wisdom in a hierarchical arrangement has been part of the language of information science for many years. Although it is uncertain when and by whom those relationships were first presented, the ubiquity of the notion of a hierarchy is embedded in the use of the acronym DIKW as a shorthand representation for the data-to-information-to-knowledge-to-wisdom transformation.Many authors think that the idea of the DIKW relationship originated from two lines in the poem "Choruses", by T. S. Eliot, that appeared in the pageant play The Rock, in 1934: === Knowledge, intelligence, and wisdom === In 1927, Clarence W. Barron addressed his employees at Dow Jones & Company on the hierarchy: "Knowledge, Intelligence and Wisdom". === Data, information, knowledge === In 1955, English-American economist and educator Kenneth Boulding presented a variation on the hierarchy consisting of "signals, messages, information, and knowledge". However, "[t]he first author to distinguish among data, information, and knowledge and to also employ the term 'knowledge management' may have been American educator Nicholas L. Henry", in a 1974 journal article. === Data, information, knowledge, wisdom === Other early versions (prior to 1982) of the hierarchy that refer to a data tier include those of Chinese-American geographer Yi-Fu Tuan and sociologist-historian Daniel Bell.. In 1980, Irish-born engineer Mike Cooley invoked the same hierarchy in his critique of automation and computerization, in his book Architect or Bee?: The Human / Technology Relationship. Thereafter, in 1987, Czechoslovakia-born educator Milan Zeleny mapped the components of the hierarchy to knowledge forms: know-nothing, know-what, know-how, and know-why. Zeleny "has frequently been credited with proposing the [representation of DIKW as a pyramid ]... although he actually made no reference to any such graphical model." The hierarchy appears again in a 1988 address to the International Society for General Systems Research, by American organizational theorist Russell Ackoff, published in 1989. Subsequent authors and textbooks cite Ackoff's as the "original articulation" of the hierarchy or otherwise credit Ackoff with its proposal. Ackoff's version of the model includes an understanding tier (as Adler had, before him), interposed between knowledge and wisdom. Although Ackoff did not present the hierarchy graphically, he has also been credited with its representation as a pyramid. In 1989, Bell Labs veteran Robert W. Lucky wrote about the four-tier "information hierarchy" in the form of a pyramid in his book Silicon Dreams. In the same year as Ackoff presented his a

    Read more →
  • Conceptualization (information science)

    Conceptualization (information science)

    In information science, a conceptualization is an abstract simplified view of some selected parts of the world, containing the objects, concepts, and other entities that are presumed of interest for some particular purpose and the relationships between them. An explicit specification of a conceptualization is an ontology, and it may occur that a conceptualization can be realized by several distinct ontologies. An ontological commitment in describing ontological comparisons is taken to refer to that subset of elements of an ontology shared with all the others. "An ontology is language-dependent", its objects and interrelations described within the language it uses, while a conceptualization is always the same, more general, its concepts existing "independently of the language used to describe it". The relation between these terms is shown in the figure to the right. Not all workers in knowledge engineering use the term "conceptualization", but instead refer to the conceptualization itself, or to the ontological commitment of all its realizations, as an overarching ontology. == Purpose and implementation == As a higher level abstraction, a conceptualization facilitates the discussion and comparison of its various ontologies, facilitating knowledge sharing and reuse. Each ontology based upon the same overarching conceptualization maps the conceptualization into specific elements and their relationships. The question then arises as to how to describe the "conceptualization" in terms that can encompass multiple ontologies. This issue has been called the Tower of Babel problem, that is, how can persons used to one ontology talk with others using a different ontology? This problem is easily grasped, but a general resolution is not at hand. It can be a "bottom-up" or a "top-down" approach, or something in between. However, in more artificial situations, such as information systems, the idea of a "conceptualization" and the "ontological commitment" of various ontologies that realize the "conceptualization" is possible. The formation of a conceptualization and its ontologies involves these steps: specification of the conceptualization ontology concepts: every definition involves the definitions of other terms relationships between the concepts: this step maps conceptual relationships onto the ontology structure groups of concepts: this step may lead to the creation of sub-ontologies formal description of ontology commitments, for example, to make them computer readable An example of moving conception into a language leading to a variety of ontologies is the expression of a process in pseudocode (a strictly structured form of ordinary language) leading to implementation in several different formal computer languages like Lisp or Fortran. The pseudocode makes it easier to understand the instructions and compare implementations, but the formal languages make possible the compilation of the ideas as computer instructions. Another example is mathematics, where a very general formulation (the analog of a conceptualization) is illustrated with "applications" that are more specialized examples. For instance, aspects of a function space can be illustrated using a vector space or a topological space that introduce interpretations of the "elements" of the conceptualization and additional relationships between them but preserve the connections required in the function space.

    Read more →
  • Certifying algorithm

    Certifying algorithm

    In theoretical computer science, a certifying algorithm is an algorithm that outputs, together with a solution to the problem it solves, a proof that the solution is correct. A certifying algorithm is said to be efficient if the combined runtime of the algorithm and a proof checker is slower by at most a constant factor than the best known non-certifying algorithm for the same problem. The proof produced by a certifying algorithm should be in some sense simpler than the algorithm itself, for otherwise any algorithm could be considered certifying (with its output verified by running the same algorithm again). Sometimes this is formalized by requiring that a verification of the proof take less time than the original algorithm, while for other problems (in particular those for which the solution can be found in linear time) simplicity of the output proof is considered in a less formal sense. For instance, the validity of the output proof may be more apparent to human users than the correctness of the algorithm, or a checker for the proof may be more amenable to formal verification. Implementations of certifying algorithms that also include a checker for the proof generated by the algorithm may be considered to be more reliable than non-certifying algorithms. For, whenever the algorithm is run, one of three things happens: it produces a correct output (the desired case), it detects a bug in the algorithm or its implication (undesired, but generally preferable to continuing without detecting the bug), or both the algorithm and the checker are faulty in a way that masks the bug and prevents it from being detected (undesired, but unlikely as it depends on the existence of two independent bugs). == Examples == Many examples of problems with checkable algorithms come from graph theory. For instance, a classical algorithm for testing whether a graph is bipartite would simply output a Boolean value: true if the graph is bipartite, false otherwise. In contrast, a certifying algorithm might output a 2-coloring of the graph in the case that it is bipartite, or a cycle of odd length if it is not. Any graph is bipartite if and only if it can be 2-colored, and non-bipartite if and only if it contains an odd cycle. Both checking whether a 2-coloring is valid and checking whether a given odd-length sequence of vertices is a cycle may be performed more simply than testing bipartiteness. Analogously, it is possible to test whether a given directed graph is acyclic by a certifying algorithm that outputs either a topological order or a directed cycle. It is possible to test whether an undirected graph is a chordal graph by a certifying algorithm that outputs either an elimination ordering (an ordering of all vertices such that, for every vertex, the neighbors that are later in the ordering form a clique) or a chordless cycle. And it is possible to test whether a graph is planar by a certifying algorithm that outputs either a planar embedding or a Kuratowski subgraph. The extended Euclidean algorithm for the greatest common divisor of two integers x and y is certifying: it outputs three integers g (the divisor), a, and b, such that ax + by = g. This equation can only be true of multiples of the greatest common divisor, so testing that g is the greatest common divisor may be performed by checking that g divides both x and y and that this equation is correct.

    Read more →
  • Production (computer science)

    Production (computer science)

    In computer science, a production or production rule is a rewrite rule that replaces some symbols with other symbols. A finite set of productions P {\displaystyle P} is the main component in the specification of a formal grammar (specifically a generative grammar). In such grammars, a set of productions is a special case of relation on the set of strings V ∗ {\displaystyle V^{}} (where ∗ {\displaystyle {}^{}} is the Kleene star operator) over a finite set of symbols V {\displaystyle V} called a vocabulary that defines which non-empty strings can be substituted with others. The set of productions is thus a special kind subset P ⊂ V ∗ × V ∗ {\displaystyle P\subset V^{}\times V^{}} and productions are then written in the form u → v {\displaystyle u\to v} to mean that ( u , v ) ∈ P {\displaystyle (u,v)\in P} (not to be confused with → {\displaystyle \to } being used as function notation, since there may be multiple rules for the same u {\displaystyle u} ). Given two subsets A , B ⊂ V ∗ {\displaystyle A,B\subset V^{}} , productions can be restricted to satisfy P ⊂ A × B {\displaystyle P\subset A\times B} , in which case productions are said "to be of the form A → B {\displaystyle A\to B} . Different choices and constructions of A , B {\displaystyle A,B} lead to different types of grammars. In general, any production of the form u → ϵ , {\displaystyle u\to \epsilon ,} where ϵ {\displaystyle \epsilon } is the empty string (sometimes also denoted λ {\displaystyle \lambda } ), is called an erasing rule, while productions that would produce strings out of nowhere, namely of the form ϵ → v , {\displaystyle \epsilon \to v,} are never allowed. In order to allow the production rules to create meaningful sentences, the vocabulary is partitioned into (disjoint) sets Σ {\displaystyle \Sigma } and N {\displaystyle N} providing two different roles: Σ {\displaystyle \Sigma } denotes the terminal symbols known as an alphabet containing the symbols allowed in a sentence; N {\displaystyle N} denotes nonterminal symbols, containing a distinguished start symbol S ∈ N {\displaystyle S\in N} , that are needed together with the production rules to define how to build the sentences. In the most general case of an unrestricted grammar, a production u → v {\displaystyle u\to v} , is allowed to map arbitrary strings u {\displaystyle u} and v {\displaystyle v} in V {\displaystyle V} (terminals and nonterminals), as long as u {\displaystyle u} is not empty. So unrestricted grammars have productions of the form V ∗ ∖ { ϵ } → V ∗ {\displaystyle V^{}\setminus \{\epsilon \}\to V^{}} or if we want to disallow changing finished sentences V ∗ N V ∗ = ( V ∗ ∖ Σ ∗ ) → V ∗ {\displaystyle V^{}NV^{}=(V^{}\setminus \Sigma ^{})\to V^{}} , where V ∗ N V ∗ {\displaystyle V^{}NV^{}} indicates concatenation and forces a non-terminal symbol to always be present on the left-hand side of the productions, and ∖ {\displaystyle \setminus } denotes set minus or set difference. If we do not allow the start symbol to occur in v {\displaystyle v} (the word on the right side), we have to replace V ∗ {\displaystyle V^{}} with ( V ∖ { S } ) ∗ {\displaystyle (V\setminus \{S\})^{}} on the right-hand side. The other types of formal grammar in the Chomsky hierarchy impose additional restrictions on what constitutes a production. Notably in a context-free grammar, the left-hand side of a production must be a single nonterminal symbol. So productions are of the form: N → V ∗ {\displaystyle N\to V^{}} == Grammar generation == To generate a string in the language, one begins with a string consisting of only a single start symbol, and then successively applies the rules (any number of times, in any order) to rewrite this string. This stops when a string containing only terminals is obtained. The language consists of all the strings that can be generated in this manner. Any particular sequence of legal choices taken during this rewriting process yields one particular string in the language. If there are multiple different ways of generating this single string, then the grammar is said to be ambiguous. For example, assume the alphabet consists of a {\displaystyle a} and b {\displaystyle b} , with the start symbol S {\displaystyle S} , and we have the following rules: 1. S → a S b {\displaystyle S\rightarrow aSb} 2. S → b a {\displaystyle S\rightarrow ba} then we start with S {\displaystyle S} , and can choose a rule to apply to it. If we choose rule 1, we replace S {\displaystyle S} with a S b {\displaystyle aSb} and obtain the string a S b {\displaystyle aSb} . If we choose rule 1 again, we replace S {\displaystyle S} with a S b {\displaystyle aSb} and obtain the string a a S b b {\displaystyle aaSbb} . This process is repeated until we only have symbols from the alphabet (i.e., a {\displaystyle a} and b {\displaystyle b} ). If we now choose rule 2, we replace S {\displaystyle S} with b a {\displaystyle ba} and obtain the string a a b a b b {\displaystyle aababb} , and are done. We can write this series of choices more briefly, using symbols: S ⇒ a S b ⇒ a a S b b ⇒ a a b a b b {\displaystyle S\Rightarrow aSb\Rightarrow aaSbb\Rightarrow aababb} . The language of the grammar is the set of all the strings that can be generated using this process: { b a , a b a b , a a b a b b , a a a b a b b b , … } {\displaystyle \{ba,abab,aababb,aaababbb,\dotsc \}} .

    Read more →
  • Enterprise information system

    Enterprise information system

    An Enterprise Information System (EIS) is any kind of information system which improves the functions of enterprise business processes through integration. This means typically offering high quality service, dealing with large volumes of data and capable of supporting some large and possibly complex organization or enterprise. An EIS must be able to be used by all parts and all levels of an enterprise. The word enterprise can have various connotations. Frequently the term is used only to refer to very large organizations such as multi-national companies or public-sector organizations. However, the term may be used to mean virtually anything, by virtue of it having become a corporate-speak buzzword. == Purpose == Enterprise information systems provide a technology platform that enables organizations to integrate and coordinate their business processes on a robust foundation. An EIS is currently used in conjunction with customer relationship management and supply chain management to automate business processes. An enterprise information system provides a single system that is central to the organization that ensuring information can be shared across all functional levels and management hierarchies. An EIS can be used to increase business productivity and reduce service cycles, product development cycles and marketing life cycles. It may be used to amalgamate existing applications. Other outcomes include higher operational efficiency and cost savings. Financial value is not usually a direct outcome from the implementation of an enterprise information system. == Design stage == At the design stage the main characteristic of EIS efficiency evaluation is the probability of timely delivery of various messages such as command, service, etc. == Information systems == Enterprise systems create a standard data structure and are invaluable in eliminating the problem of information fragmentation caused by multiple information systems within an organization. An EIS differentiates itself from legacy systems in that it is self-transactional, self-helping and adaptable to general and specialist conditions. Unlike an enterprise information system, legacy systems are limited to department-wide communications. A typical enterprise information system would be housed in one or more data centers, would run enterprise software, and could include applications that typically cross organizational borders such as content management systems.

    Read more →
  • Flajolet–Martin algorithm

    Flajolet–Martin algorithm

    The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements in a stream with a single pass and space-consumption logarithmic in the maximal number of possible distinct elements in the stream (the count-distinct problem). The algorithm was introduced by Philippe Flajolet and G. Nigel Martin in their 1984 article "Probabilistic Counting Algorithms for Data Base Applications". Later it has been refined in "LogLog counting of large cardinalities" by Marianne Durand and Philippe Flajolet, and "HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm" by Philippe Flajolet et al. In their 2010 article "An optimal algorithm for the distinct elements problem", Daniel M. Kane, Jelani Nelson and David P. Woodruff give an improved algorithm, which uses nearly optimal space and has optimal O(1) update and reporting times. == The algorithm == Assume that we are given a hash function h a s h ( x ) {\displaystyle \mathrm {hash} (x)} that maps input x {\displaystyle x} to integers in the range [ 0 ; 2 L − 1 ] {\displaystyle [0;2^{L}-1]} , and where the outputs are sufficiently uniformly distributed. Note that the set of integers from 0 to 2 L − 1 {\displaystyle 2^{L}-1} corresponds to the set of binary strings of length L {\displaystyle L} . For any non-negative integer y {\displaystyle y} , define b i t ( y , k ) {\displaystyle \mathrm {bit} (y,k)} to be the k {\displaystyle k} -th bit in the binary representation of y {\displaystyle y} , such that: y = ∑ k ≥ 0 b i t ( y , k ) 2 k . {\displaystyle y=\sum _{k\geq 0}\mathrm {bit} (y,k)2^{k}.} We then define a function ρ ( y ) {\displaystyle \rho (y)} that outputs the position of the least-significant set bit in the binary representation of y {\displaystyle y} , and L {\displaystyle L} if no such set bit can be found as all bits are zero: ρ ( y ) = { min { k ≥ 0 ∣ b i t ( y , k ) ≠ 0 } y > 0 L y = 0 {\displaystyle \rho (y)={\begin{cases}\min\{k\geq 0\mid \mathrm {bit} (y,k)\neq 0\}&y>0\\L&y=0\end{cases}}} Note that with the above definition we are using 0-indexing for the positions, starting from the least significant bit. For example, ρ ( 13 ) = ρ ( 1101 2 ) = 0 {\displaystyle \rho (13)=\rho (1101_{2})=0} , since the least significant bit is a 1 (0th position), and ρ ( 8 ) = ρ ( 1000 2 ) = 3 {\displaystyle \rho (8)=\rho (1000_{2})=3} , since the least significant set bit is at the 3rd position. At this point, note that under the assumption that the output of our hash function is uniformly distributed, then the probability of observing a hash output ending with 2 k {\displaystyle 2^{k}} (a one, followed by k {\displaystyle k} zeroes) is 2 − ( k + 1 ) {\displaystyle 2^{-(k+1)}} , since this corresponds to flipping k {\displaystyle k} heads and then a tail with a fair coin. Now the Flajolet–Martin algorithm for estimating the cardinality of a multiset M {\displaystyle M} is as follows: Initialize a bit-vector BITMAP to be of length L {\displaystyle L} and contain all 0s. For each element x {\displaystyle x} in M {\displaystyle M} : Calculate the index i = ρ ( h a s h ( x ) ) {\displaystyle i=\rho (\mathrm {hash} (x))} . Set B I T M A P [ i ] = 1 {\displaystyle \mathrm {BITMAP} [i]=1} . Let R {\displaystyle R} denote the smallest index i {\displaystyle i} such that B I T M A P [ i ] = 0 {\displaystyle \mathrm {BITMAP} [i]=0} . Estimate the cardinality of M {\displaystyle M} as 2 R / ϕ {\displaystyle 2^{R}/\phi } , where ϕ ≈ 0.77351 {\displaystyle \phi \approx 0.77351} . The idea is that if n {\displaystyle n} is the number of distinct elements in the multiset M {\displaystyle M} , then B I T M A P [ 0 ] {\displaystyle \mathrm {BITMAP} [0]} is accessed approximately n / 2 {\displaystyle n/2} times, B I T M A P [ 1 ] {\displaystyle \mathrm {BITMAP} [1]} is accessed approximately n / 4 {\displaystyle n/4} times and so on. Consequently, if i ≫ log 2 ⁡ n {\displaystyle i\gg \log _{2}n} , then B I T M A P [ i ] {\displaystyle \mathrm {BITMAP} [i]} is almost certainly 0, and if i ≪ log 2 ⁡ n {\displaystyle i\ll \log _{2}n} , then B I T M A P [ i ] {\displaystyle \mathrm {BITMAP} [i]} is almost certainly 1. If i ≈ log 2 ⁡ n {\displaystyle i\approx \log _{2}n} , then B I T M A P [ i ] {\displaystyle \mathrm {BITMAP} [i]} can be expected to be either 1 or 0. The correction factor ϕ ≈ 0.77351 {\displaystyle \phi \approx 0.77351} (OEIS: A244256) is found by calculations, which can be found in the original article. == Improving accuracy == A problem with the Flajolet–Martin algorithm in the above form is that the results vary significantly. A common solution has been to run the algorithm multiple times with k {\displaystyle k} different hash functions and combine the results from the different runs. One idea is to take the mean of the k {\displaystyle k} results together from each hash function, obtaining a single estimate of the cardinality. The problem with this is that averaging is very susceptible to outliers (which are likely here). A different idea is to use the median, which is less prone to be influences by outliers. The problem with this is that the results can only take form 2 R / ϕ {\displaystyle 2^{R}/\phi } , where R {\displaystyle R} is integer. A common solution is to combine both the mean and the median: Create k ⋅ l {\displaystyle k\cdot l} hash functions and split them into k {\displaystyle k} distinct groups (each of size l {\displaystyle l} ). Within each group use the mean for aggregating together the l {\displaystyle l} results, and finally take the median of the k {\displaystyle k} group estimates as the final estimate. The 2007 HyperLogLog algorithm splits the multiset into subsets and estimates their cardinalities, then it uses the harmonic mean to combine them into an estimate for the original cardinality.

    Read more →
  • Universal Data Element Framework

    Universal Data Element Framework

    The Universal Data Element Framework (UDEF) was a controlled vocabulary developed by The Open Group. It provided a framework for categorizing, naming, and indexing data. It assigned to every item of data a structured alphanumeric tag plus a controlled vocabulary name that describes the meaning of the data. This allowed relating data elements to similar elements defined by other organizations. UDEF defined a Dewey-decimal like code for each concept. For example, an "employee number" is often used in human resource management. It has a UDEF tag a.5_12.35.8 and a controlled vocabulary description "Employee.PERSON_Employer.Assigned.IDENTIFIER". UDEF has been superseded by the Open Data Element Framework (ODEF). == Examples == In an application used by a hospital, the last name and first name of several people could include the following example concepts: Patient Person Family Name – find the word “Patient” under the UDEF object “Person” and find the word “Family” under the UDEF property “Name” Patient Person Given Name – find the word “Patient” under the UDEF object “Person” and find the word “Given” under the UDEF property “Name” Doctor Person Family Name – find the word “Doctor” under the UDEF object “Person” and find the word “Family” under the UDEF property “Name” Doctor Person Given Name – find the word “Doctor” under the UDEF object “Person” and find the word “Given” under the UDEF property “Name” For the examples above, the following UDEF IDs are available: “Patient Person Family Name” the UDEF ID is “au.5_11.10” “Patient Person Given Name” the UDEF ID is “au.5_12.10” “Doctor Person Family Name” the UDEF ID is “aq.5_11.10” “Doctor Person Given Name” the UDEF ID is “aq.5_12.10”

    Read more →
  • Error level analysis

    Error level analysis

    Error level analysis (ELA) is the analysis of compression artifacts in digital data with lossy compression such as JPEG. == Principles == When used, lossy compression is normally applied uniformly to a set of data, such as an image, resulting in a uniform level of compression artifacts. Alternatively, the data may consist of parts with different levels of compression artifacts. This difference may arise from the different parts having been repeatedly subjected to the same lossy compression a different number of times, or the different parts having been subjected to different kinds of lossy compression. A difference in the level of compression artifacts in different parts of the data may therefore indicate that the data has been edited. In the case of JPEG, even a composite with parts subjected to matching compressions will have a difference in the compression artifacts. In order to make the typically faint compression artifacts more readily visible, the data to be analyzed is subjected to an additional round of lossy compression, this time at a known, uniform level, and the result is subtracted from the original data under investigation. The resulting difference image is then inspected manually for any variation in the level of compression artifacts. In 2007, N. Krawetz denoted this method "error level analysis". Additionally, digital data formats such as JPEG sometimes include metadata describing the specific lossy compression used. If in such data the observed compression artifacts differ from those expected from the given metadata description, then the metadata may not describe the actual compressed data, and thus indicate that the data have been edited. == Limitations == By its nature, data without lossy compression, such as a PNG image, cannot be subjected to error level analysis. Consequently, since editing could have been performed on data without lossy compression with lossy compression applied uniformly to the edited, composite data, the presence of a uniform level of compression artifacts does not rule out editing of the data. Additionally, any non-uniform compression artifacts in a composite may be removed by subjecting the composite to repeated, uniform lossy compression. Also, if the image color space is reduced to 256 colors or less, for example, by conversion to GIF, then error level analysis will generate useless results. More significant, the actual interpretation of the level of compression artifacts in a given segment of the data is subjective, and the determination of whether editing has occurred is therefore not robust. == Controversy == In May 2013, Dr Neal Krawetz used error level analysis on the 2012 World Press Photo of the Year and concluded on his Hacker Factor blog that it was "a composite" with modifications that "fail to adhere to the acceptable journalism standards used by Reuters, Associated Press, Getty Images, National Press Photographer's Association, and other media outlets". The World Press Photo organizers responded by letting two independent experts analyze the image files of the winning photographer and subsequently confirmed the integrity of the files. One of the experts, Hany Farid, said about error level analysis that "It incorrectly labels altered images as original and incorrectly labels original images as altered with the same likelihood". Krawetz responded by clarifying that "It is up to the user to interpret the results. Any errors in identification rest solely on the viewer". In May 2015, the citizen journalism team Bellingcat wrote that error level analysis revealed that the Russian Ministry of Defense had edited satellite images related to the Malaysia Airlines Flight 17 disaster. In a reaction to this, image forensics expert Jens Kriese said about error level analysis: "The method is subjective and not based entirely on science", and that it is "a method used by hobbyists". On his Hacker Factor Blog, the inventor of error level analysis Neal Krawetz criticized both Bellingcat's use of error level analysis as "misinterpreting the results" but also on several points Jens Kriese's "ignorance" regarding error level analysis.

    Read more →
  • Research data archiving

    Research data archiving

    Research data archiving is the long-term storage of scholarly research data, including the natural sciences, social sciences, and life sciences. The various academic journals have differing policies regarding how much of their data and methods researchers are required to store in a public archive, and what is actually archived varies widely between different disciplines. Similarly, the major grant-giving institutions have varying attitudes towards public archiving of data. In general, the tradition of science has been for publications to contain sufficient information to allow fellow researchers to replicate and therefore test the research. In recent years this approach has become increasingly strained as research in some areas depends on large datasets which cannot easily be replicated independently. Data archiving is more important in some fields than others. In a few fields, all of the data necessary to replicate the work is already available in the journal article. In drug development, a great deal of data is generated and must be archived so researchers can verify that the reports the drug companies publish accurately reflect the data. Often used interchangeably, Data preservation and data archiving are both about protecting data for the long term, but they serve different purposes. Data preservation focuses on preventing data from being lost, damaged, or destroyed by creating backups, storing data in secure locations, and ensuring it remains accessible when needed. Data archiving, on the other hand, involves moving data that is no longer actively used to a separate storage location for long-term keeping. Archived data is often combined and compressed, and while it can still be accessed, it is not intended for regular use or frequent updates. The requirement of data archiving is a recent development in the history of science. It was made possible by advances in information technology allowing large amounts of data to be stored and accessed from central locations. For example, the American Geophysical Union (AGU) adopted their first policy on data archiving in 1993, about three years after the beginning of the WWW. This policy mandates that datasets cited in AGU papers must be archived by a recognised data center; it permits the creation of "data papers"; and it establishes AGU's role in maintaining data archives. But it makes no requirements on paper authors to archive their data. Prior to organized data archiving, researchers wanting to evaluate or replicate a paper would have to request data and methods information from the author. The academic community expects authors to share supplemental data. This process was recognized as wasteful of time and energy and obtained mixed results. Information could become lost or corrupted over the years. In some cases, authors simply refuse to provide the information. The need for data archiving and due diligence is greatly increased when the research deals with health issues or public policy formation. == Selected policies by journals == === Biotropica === Biotropica requires, as a condition for publication, that the data supporting the results in the paper and metadata describing them must be archived in an appropriate public archive such as Dryad, Figshare, GenBank, TreeBASE, or NCBI. Authors may elect to make the data publicly available as soon as the article is published or, if the technology of the archive allows, embargo access to the data up to three years after article publication. A statement describing Data Availability will be included in the manuscript as described in the instructions to authors. Exceptions to the required archiving of data may be granted at the discretion of the Editor-in-Chief for studies that include sensitive information (e.g., the location of endangered species). Our Editorial explaining the motivation for this policy can be found here. A more comprehensive list of data repositories is available here. Promoting a culture of collaboration with researchers who collect and archive data: The data collected by tropical biologists are often long-term, complex, and expensive to collect. The Board of Editors of Biotropica strongly encourages authors who re-use data archives archived data sets to include as fully engaged collaborators the scientists who originally collected them. We feel this will greatly enhance the quality and impact of the resulting research by drawing on the data collector’s profound insights into the natural history of the study system, reducing the risk of errors in novel analyses, and stimulating the cross-disciplinary and cross-cultural collaboration and training for which the ATBC and Biotropica are widely recognized. NB: Biotropica is one of only two journals that pays the fees for authors depositing data at Dryad. === The American Naturalist === The American Naturalist requires authors to deposit the data associated with accepted papers in a public archive. For gene sequence data and phylogenetic trees, deposition in GenBank or TreeBASE, respectively, is required. There are many possible archives that may suit a particular data set, including the Dryad repository for ecological and evolutionary biology data. All accession numbers for GenBank, TreeBASE, and Dryad must be included in accepted manuscripts before they go to Production. If the data is deposited somewhere else, please provide a link. If the data is culled from published literature, please deposit the collated data in Dryad for the convenience of your readers. Any impediments to data sharing should be brought to the attention of the editors at the time of submission so that appropriate arrangements can be worked out. === Journal of Heredity === The primary data underlying the conclusions of an article are critical to the verifiability and transparency of the scientific enterprise, and should be preserved in usable form for decades in the future. For this reason, Journal of Heredity requires that newly reported nucleotide or amino acid sequences, and structural coordinates, be submitted to appropriate public databases (e.g., GenBank; the EMBL Nucleotide Sequence Database; DNA Database of Japan; the Protein Data Bank; and Swiss-Prot). Accession numbers must be included in the final version of the manuscript. For other forms of data (e.g., microsatellite genotypes, linkage maps, images), the Journal endorses the principles of the Joint Data Archiving Policy (JDAP) in encouraging all authors to archive primary datasets in an appropriate public archive, such as Dryad, TreeBASE, or the Knowledge Network for Biocomplexity. Authors are encouraged to make data publicly available at time of publication or, if the technology of the archive allows, opt to embargo access to the data for a period up to a year after publication. The American Genetic Association also recognizes the vast investment of individual researchers in generating and curating large datasets. Consequently, we recommend that this investment be respected in secondary analyses or meta-analyses in a gracious collaborative spirit. === Molecular Ecology === Molecular Ecology expects that data supporting the results in the paper should be archived in an appropriate public archive, such as GenBank, Gene Expression Omnibus, TreeBASE, Dryad, the Knowledge Network for Biocomplexity, your own institutional or funder repository, or as Supporting Information on the Molecular Ecology web site. Data are important products of the scientific enterprise, and they should be preserved and usable for decades in the future. Authors may elect to have the data publicly available at time of publication, or, if the technology of the archive allows, may opt to embargo access to the data for a period up to a year after publication. Exceptions may be granted at the discretion of the editor, especially for sensitive information such as human subject data or the location of endangered species. === Nature === Such material must be hosted on an accredited independent site (URL and accession numbers to be provided by the author), or sent to the Nature journal at submission, either uploaded via the journal's online submission service, or if the files are too large or in an unsuitable format for this purpose, on CD/DVD (five copies). Such material cannot solely be hosted on an author's personal or institutional web site. Nature requires the reviewer to determine if all of the supplementary data and methods have been archived. The policy advises reviewers to consider several questions, including: "Should the authors be asked to provide supplementary methods or data to accompany the paper online? (Such data might include source code for modelling studies, detailed experimental protocols or mathematical derivations.) === Science === Science supports the efforts of databases that aggregate published data for the use of the scientific community. Therefore, before publication, large data sets (including microarray data, protein or DNA sequences, and atomic c

    Read more →
  • Car–Parrinello molecular dynamics

    Car–Parrinello molecular dynamics

    Car–Parrinello molecular dynamics (CPMD) refers to either a method used in molecular dynamics (also known as the Car–Parrinello method) or the computational chemistry software package used to implement this method. The CPMD method is one of the major methods for calculating ab initio molecular dynamics (ab initio MD or AIMD). Ab initio molecular dynamics (AIMD) is a computational method that uses first principles through quantum mechanics to simulate the motion of atoms in a system. It is a type of molecular dynamics (MD) simulation that does not rely on empirical potentials or force fields to describe the interactions between atoms, but rather calculates these interactions entirely from the electronic structure of the system using quantum mechanics. In an ab initio MD simulation, the total energy of the system is calculated at each time step using density functional theory (DFT), Hartree-Fock (HF), or other electronic structure calculation methods. The forces acting on each atom are then determined from the gradient of the energy with respect to the atomic coordinates, and the equations of motion are solved to predict the trajectory of the atoms. AIMD permits chemical bond breaking and forming events to occur and accounts for electronic polarization effect. Therefore, Ab initio MD simulations can be used to study a wide range of phenomena, including the structural, thermodynamic, and dynamic properties of materials and chemical reactions. They are particularly useful for systems that are not well described by empirical potentials or force fields, such as systems with strong electronic correlation or systems with many degrees of freedom. However, ab initio MD simulations are computationally demanding and require significant computational resources. The CPMD method is related to the more common Born–Oppenheimer molecular dynamics (BOMD) method in that the quantum mechanical effect of the electrons is included in the calculation of energy and forces for the classical motion of the nuclei. CPMD and BOMD are different types of AIMD. However, whereas BOMD treats the electronic structure problem within the time-independent Schrödinger equation, CPMD explicitly includes the electrons as active degrees of freedom, via (fictitious) dynamical variables. The software is a parallelized plane wave / pseudopotential implementation of density functional theory, particularly designed for ab initio molecular dynamics. == Car–Parrinello method == The Car–Parrinello method is a type of molecular dynamics, usually employing periodic boundary conditions, planewave basis sets, and density functional theory, proposed by Roberto Car and Michele Parrinello in 1985 while working at SISSA, who were subsequently awarded the Dirac Medal by ICTP in 2009. In contrast to Born–Oppenheimer molecular dynamics wherein the nuclear (ions) degree of freedom are propagated using ionic forces which are calculated at each iteration by approximately solving the electronic problem with conventional matrix diagonalization methods, the Car–Parrinello method explicitly introduces the electronic degrees of freedom as (fictitious) dynamical variables, writing an extended Lagrangian for the system which leads to a system of coupled equations of motion for both ions and electrons. In this way, an explicit electronic minimization at each time step, as done in Born–Oppenheimer MD, is not needed: after an initial standard electronic minimization, the fictitious dynamics of the electrons keeps them on the electronic ground state corresponding to each new ionic configuration visited along the dynamics, thus yielding accurate ionic forces. In order to maintain this adiabaticity condition, it is necessary that the fictitious mass of the electrons is chosen small enough to avoid a significant energy transfer from the ionic to the electronic degrees of freedom. This small fictitious mass in turn requires that the equations of motion are integrated using a smaller time step than the one (1–10 fs) commonly used in Born–Oppenheimer molecular dynamics. Currently, the CPMD method can be applied to systems that consist of a few tens or hundreds of atoms and access timescales on the order of tens of picoseconds. == General approach == In CPMD the core electrons are usually described by a pseudopotential and the wavefunction of the valence electrons are approximated by a plane wave basis set. The ground state electronic density (for fixed nuclei) is calculated self-consistently, usually using the density functional theory method. Kohn-Sham equations are often used to calculate the electronic structure, where electronic orbitals are expanded in a plane-wave basis set. Then, using that density, forces on the nuclei can be computed, to update the trajectories (using, e.g. the Verlet integration algorithm). In addition, however, the coefficients used to obtain the electronic orbital functions can be treated as a set of extra spatial dimensions, and trajectories for the orbitals can be calculated in this context. == Fictitious dynamics == CPMD is an approximation of the Born–Oppenheimer MD (BOMD) method. In BOMD, the electrons' wave function must be minimized via matrix diagonalization at every step in the trajectory. CPMD uses fictitious dynamics to keep the electrons close to the ground state, preventing the need for a costly self-consistent iterative minimization at each time step. The fictitious dynamics relies on the use of a fictitious electron mass (usually in the range of 400 – 800 a.u.) to ensure that there is very little energy transfer from nuclei to electrons, i.e. to ensure adiabaticity. Any increase in the fictitious electron mass resulting in energy transfer would cause the system to leave the ground-state BOMD surface. === Lagrangian === L = 1 2 ( ∑ I n u c l e i M I R ˙ I 2 + μ ∑ i o r b i t a l s ∫ d r | ψ ˙ i ( r , t ) | 2 ) − E [ { ψ i } , { R I } ] + ∑ i j Λ i j ( ∫ d r ψ i ψ j − δ i j ) , {\displaystyle {\mathcal {L}}={\frac {1}{2}}\left(\sum _{I}^{\mathrm {nuclei} }\ M_{I}{\dot {\mathbf {R} }}_{I}^{2}+\mu \sum _{i}^{\mathrm {orbitals} }\int d\mathbf {r} \ |{\dot {\psi }}_{i}(\mathbf {r} ,t)|^{2}\right)-E\left[\{\psi _{i}\},\{\mathbf {R} _{I}\}\right]+\sum _{ij}\Lambda _{ij}\left(\int d\mathbf {r} \ \psi _{i}\psi _{j}-\delta _{ij}\right),} where μ {\displaystyle \mu } is the fictitious mass parameter; E[{ψi},{RI}] is the Kohn–Sham energy density functional, which outputs energy values when given Kohn–Sham orbitals and nuclear positions. === Orthogonality constraint === ∫ d r ψ i ∗ ( r , t ) ψ j ( r , t ) = δ i j , {\displaystyle \int d\mathbf {r} \ \psi _{i}^{}(\mathbf {r} ,t)\psi _{j}(\mathbf {r} ,t)=\delta _{ij},} where δij is the Kronecker delta. === Equations of motion === The equations of motion are obtained by finding the stationary point of the Lagrangian under variations of ψi and RI, with the orthogonality constraint. M I R ¨ I = − ∇ I E [ { ψ i } , { R I } ] {\displaystyle M_{I}{\ddot {\mathbf {R} }}_{I}=-\nabla _{I}\,E\left[\{\psi _{i}\},\{\mathbf {R} _{I}\}\right]} μ ψ ¨ i ( r , t ) = − δ E δ ψ i ∗ ( r , t ) + ∑ j Λ i j ψ j ( r , t ) , {\displaystyle \mu {\ddot {\psi }}_{i}(\mathbf {r} ,t)=-{\frac {\delta E}{\delta \psi _{i}^{}(\mathbf {r} ,t)}}+\sum _{j}\Lambda _{ij}\psi _{j}(\mathbf {r} ,t),} where Λij is a Lagrangian multiplier matrix to comply with the orthonormality constraint. === Born–Oppenheimer limit === In the formal limit where μ → 0, the equations of motion approach Born–Oppenheimer molecular dynamics. == Software packages == There are a number of software packages available for performing AIMD simulations. Some of the most widely used packages include: CP2K: an open-source software package for AIMD. Quantum Espresso: an open-source package for performing DFT calculations. It includes a module for AIMD. VASP: a commercial software package for performing DFT calculations. It includes a module for AIMD. Gaussian: a commercial software package that can perform AIMD. NWChem: an open-source software package for AIMD. LAMMPS: an open-source software package for performing classical and ab initio MD simulations. SIESTA: an open-source software package for AIMD. ORCA: a general-purpose quantum chemistry package. == Applications == Studying the behavior of water across different environments, such as near a hydrophobic graphene sheet. Investigating the structure and dynamics of liquid water at ambient temperature. Solving the heat transfer problems (heat conduction and thermal radiation), such as in Si/Ge superlattices. Probing the proton transfer along hydrogen-bonds in different environments, such as in 1D water chains inside carbon nanotubes. Evaluating the critical point of crystals, composites, and solid-state materials, such as aluminum. Predicting and modelling different phases and phase transitions, such as in the amorphous phase of the phase-change memory material GeSbTe. Studying the combustion of combustibles, such as lignite-water systems. Measuring th

    Read more →