MarkLogic Server

MarkLogic Server

MarkLogic Server is a document-oriented database developed by MarkLogic. It is a NoSQL multi-model database that evolved from an XML database to natively store JSON documents and RDF triples, the data model for semantics. MarkLogic is designed to be a data hub for operational and analytical data. == History == MarkLogic Server was built to address shortcomings with existing search and data products. The product first focused on using XML as the document markup standard and XQuery as the query standard for accessing collections of documents up to hundreds of terabytes in size. Currently the MarkLogic platform is widely used in publishing, government, finance and other sectors. MarkLogic's customers are mostly Global 2000 companies. == Technology == MarkLogic uses documents without upfront schemas to maintain a flexible data model. In addition to having a flexible data model, MarkLogic uses a distributed, scale-out architecture that can handle hundreds of billions of documents and hundreds of terabytes of data. It has received Common Criteria certification, and has high availability and disaster recovery. MarkLogic is designed to run on-premises and within public or private cloud environments like Amazon Web Services. == Features == Indexing MarkLogic indexes the content and structure of documents including words, phrases, relationships, and values in over 200 languages with tokenization, collation, and stemming for core languages. Functionality includes the ability to toggle range indexes, geospatial indexes, the RDF triple index, and reverse indexes on or off based on your data, the kinds of queries that you will run, and your desired performance. Full-text search MarkLogic supports search across its data and metadata using a word or phrase and incorporates Boolean logic, stemming, wildcards, case sensitivity, punctuation sensitivity, diacritic sensitivity, and search term weighting. Data can be searched using JavaScript, XQuery, SPARQL, and SQL. Semantics MarkLogic uses RDF triples to provide semantics for ease of storing metadata and querying. ACID Unlike other NoSQL databases, MarkLogic maintains ACID consistency for transactions. Replication MarkLogic provides high availability with replica sets. Scalability MarkLogic scales horizontally using sharding. MarkLogic can run over multiple servers, balancing the load or replicating data to keep the system up and running in the event of hardware failure. Security MarkLogic has built in security features such as element-level permissions and data redaction. Optic API for Relational Operations An API that lets developers view their data as documents, graphs or rows. Security MarkLogic provides redaction, encryption, and element-level security (allowing for control on read and write rights on parts of a document). == Applications == Banking Big Data Fraud prevention Insurance Claims Management and Underwriting Master data management Recommendation engines == Licensing == MarkLogic is available under various licensing and delivery models, namely a free Developer or an Essential Enterprise license.[3] Licenses are available from MarkLogic or directly from cloud marketplaces such as Amazon Web Services and Microsoft Azure. == Releases == 2001 – Cerisent XQE 1: ACID transactions, Full-text search, XML Storage, XQuery, Role-based security 2004 – Cerisent XQE 2: Scale-out architecture, Enhanced search (stemming, thesaurus, wildcard), Backup and restore 2005 – MarkLogic Server 3: Continuing search improvements, Content Processing Framework (including PDF, Word, Excel, PPT), Failover 2008 – MarkLogic Server 4: Geospatial search, entity extraction, advanced XQuery, performance, scalability enhancements, auditing 2011 – MarkLogic Server 5: Flexible replication / DDIL, real-time indexing, advanced search, improved analytics, concurrency enhancements 2012 – MarkLogic Server 6: REST and Java APIs, App Builder, enhanced UI, improved search 2013 – MarkLogic Server 7: Semantic graph, bitemporal data, tiered storage, improved search, better management 2015 – MarkLogic Server 8: A Native JSON storage, Server-side JavaScript, Bitemporal, Node.js client API, Incremental backup, Flexible replication[16] 2017 – MarkLogic Server 9: Data integration across Relational and Non-Relational data, Advanced Encryption, Element Level Security, Redaction 2019 – MarkLogic Server 10: Enhanced Data Hub, improved SQL, security, analytics performance, cloud support 2022 – MarkLogic Server 11: MarkLogic Ops Director (Monitoring and Administration Improvements), expanded PKI 2025 – MarkLogic Server 12: Generative AI and Native Vector Search, Graph Algorithm Support, Virtual TDEs (relational views on the fly)

ACL Data Collection Initiative

The ACL Data Collection Initiative (ACL/DCI) was a project established in 1989 by the Association for Computational Linguistics (ACL) to create and distribute large text and speech corpora for computational linguistics research. The initiative aimed to address the growing need for substantial text databases that could support research in areas such as natural language processing, speech recognition, and computational linguistics. By 1993, the initiative’s activities had effectively ceased, with its functions and datasets absorbed by the Linguistic Data Consortium (LDC), which was founded in 1992. == Objectives == The ACL/DCI had several key objectives: To acquire a large and diverse text corpus from various sources To transform the collected texts into a common format based on the Standard Generalized Markup Language (SGML) To make the corpus available for scientific research at low cost with minimal restrictions To provide a common database that would allow researchers to replicate or extend published results To reduce duplication of effort among researchers in obtaining and preparing text data These objectives were designed to address the growing demand for very large amounts of text arising from applications in recognition and analysis of text and speech. Its core objective was to "oversee the acquisition and preparation of a large text corpus to be made available for scientific research at cost and without royalties". == History == By the late 1980s, researchers in computational linguistics and speech recognition faced a significant problem: the lack of large-scale, accessible text corpora for developing statistical models and testing algorithms. Existing generally available text databases were too small to meet the needs of developing applications in text and speech recognition. The initiative was formed to meet this need by collecting, standardizing, and distributing large quantities of text data with minimal restrictions for scientific research. As stated by Liberman (1990), "research workers have been severely hampered by the lack of appropriate materials, and specially by the lack of a large enough body of text on which published results can be replicated or extended by others." The ACL/DCI committee was established in February 1989. The committee included members from academic and industrial research laboratories in the United States and Europe. The initiative was chaired by Mark Liberman from the University of Pennsylvania (formerly of AT&T Bell Laboratories). Other committee members included representatives from organizations such as Bellcore, IBM T.J. Watson Research Center, Cambridge University, Virginia Polytechnic Institute & State University, Northeastern University, University of Pennsylvania, SRI International, MCC, Xerox PARC, ISSCO, and University of Pisa. The project operated initially without dedicated funding, relying on volunteer efforts from committee members and their affiliated institutions. Key supporters included AT&T Bell Labs, Bellcore, IBM, Xerox, and the University of Pennsylvania, which allowed the use of their computing facilities for ACL/DCI-related work. Previously running on volunteer effort pro bono, in 1991, it obtained funding from General Electric and the National Science Foundation (IRI-9113530). == Data == As of 1990, the ACL/DCI had collected hundreds of millions of words of diverse text. The collection included: Wall Street Journal articles (25 to 50 million words); Canadian Hansard (parliamentary records) in parallel English and French versions: cleaned-up English Hansard donated by the IBM alignment models group (100 million words), and original Bilingual Hansard (from a different time period) obtained directly (200 million words). Collins English Dictionary (1979 edition), both as fulltext (3 million words) and as various "database" versions, constructed using "typographers' tape" donated by Collins, which were computer tapes containing the structured digital data used to typeset and print the 1979 edition of the dictionary; Emails from ARPANET newsletters for the ACM Special Interest Group on Information Retrieval Forum (IRLIST) and AIList Digest issues distributed over the ARPANET (AILIST) (5 million words), both collected by Edward A. Fox at VIPSU; Articles on networking (2 million words); U.S. Department of Agriculture Extension Service Fact Sheets (>1 million words); 200,000 scientific abstracts of about 1,500 words each from the Department of Energy (25 million words); Archives of the Challenger Investigation Commission, including transcripts of depositions and hearings (2.5 million words); Books from the Library of America, including works by Mark Twain, Eugene O'Neill, Ralph Waldo Emerson, Herman Melville, W.E.B. DuBois, Willa Cather, and Benjamin Franklin (130 books, 20 million words); Public domain books like the King James Bible, Tristram Shandy, The Federalist Papers; Several million words of transcribed radiologists' reports, donated by Francis Ganong at Kurzweil Applied Intelligence Inc (about 5 million words); The Child Language Data Exchange corpus of child language acquisition transcripts; U.S. Department of Justice Justice Retrieval and Inquiry System (JURIS) materials; The Swiss Civil Code in parallel German, French and Italian; Economic reports from the Union Bank of Switzerland, in parallel English, German, French and Italian; About 12K words of administrative policy manuals and 14K words of administrative memos, contributed by Geoff Pullum of U.C.S.C.; Material from various ACM journals and the ACL journal Computational Linguistics; The CSLI publications series: 50-100 reports (8K words each) and 5-10 books (80K words each). The initiative started with North American English text but expanded to include Canadian French and planned to include Japanese, Chinese, and other Asian languages. At least 5 million words from the collection were tagged under the Penn Treebank project, and those tags were distributed by DCI as well. After DCI was absorbed by the LDC, the datasets were curated under LDC. == Format == The ACL/DCI corpus was coded in a standard form based on SGML (Standard Generalized Markup Language, ISO 8879), consistent with the recommendations of the Text Encoding Initiative (TEI), of which the DCI was an affiliated project. The TEI was a joint project of the ACL, the Association for Computers and the Humanities, and the Association for Literary and Linguistic Computing, aiming to provide a common interchange format for literary and linguistic data. The initiative planned to add annotations reflecting consensually approved linguistic features like part of speech and various aspects of syntactic and semantic structure over time. == Examples == As an example of the use of ACL/DCI, consider the Wall Street Journal (WSJ) corpus for speech recognition research. The WSJ corpus was used as the basis for the DARPA Spoken Language System (SLS) community's Continuous Speech Recognition (CSR) Corpus. The WSJ corpus became a standard benchmark for evaluating speech recognition systems and has been used in numerous research papers. The WSJ CSR Corpus provided DARPA with its first general-purpose English, large vocabulary, natural language, high perplexity corpus containing speech (400 hours) and text (47 million words) during 1987–89. The text corpus was 313 MB in size. The text was preprocessed to remove ambiguity in the word sequence that a reader might choose, ensuring that the unread text used to train language models was representative of the spoken test material. The preprocessing included converting numbers into orthographics, expanding abbreviations, resolving apostrophes and quotation marks, and marking punctuation. As another example, the Yarowsky algorithm used bitext data from DCI to train a simple word-sense disambiguation model that was competitive with advanced models trained on smaller datasets. == Distribution == Materials from the ACL/DCI collection were distributed to research groups on a non-commercial basis. By 1990, about 25 research groups and individual researchers had received tapes containing various portions of the collected material. To obtain the data, researchers had to sign an agreement not to redistribute the data or make direct commercial use of it. However, commercial application of "analytical materials" derived from the text, such as statistical tables or grammar rules, was explicitly permitted. The initiative first distributed data via 12-inch reels of 9-track tape, then via CD-ROMs. Each such tape could contain 30 million words compressed via the Lempel-Ziv algorithms. The first CD-ROM distribution was in 1991, funded by Dragon Systems Inc. It contained Collins English Dictionary, WSJ, scientific abstracts provided by the U.S. Department of Energy, and the Penn Treebank.

Common Voice

Common Voice is a crowdsourcing project started by Mozilla to create a free and open speech corpus. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. The transcribed sentences are collected in a voice database available under the public domain license CC0. This license ensures that developers can use the database for voice-to-text and text-to-voice applications without restrictions or costs. == Aims == Common Voice aims to provide diverse voice samples. According to Mozilla's Katharina Borchert, many existing projects took datasets from public radio or otherwise had datasets that underrepresented both women and people with pronounced accents. == Voice database == The first dataset was released in November 2017. More than 20,000 users worldwide had recorded 500 hours of English sentences. In February 2019, the first batch of languages was released for use. This included 18 languages such as English, French, German and Mandarin Chinese, but also less prevalent languages like Welsh and Kabyle. In total, this included almost 1,400 hours of recorded voice data from more than 42,000 contributors. By July 2020 the database had amassed 7,226 hours of voice recordings in 54 languages, 5,591 hours of which had been verified by volunteers. In May 2021, following the work to add Kinyarwanda, the project received a grant to add Kiswahili. At the beginning of 2022, Bengali.AI partnered with Common Voice to launch the "Bangla Speech Recognition" project that aims to make machines understand the Bangla language. 2000 hours of voice was collected. In September 2022, it was announced that the Twi language of Ghana was the 100th language to be added to the database. As of December 2025, Mozilla Common Voice collects voice data for over 250 languages, with the most hours having been collected in English, Catalan, Kinyarwanda, Belarusian and Esperanto.

Common Voice

Common Voice is a crowdsourcing project started by Mozilla to create a free and open speech corpus. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. The transcribed sentences are collected in a voice database available under the public domain license CC0. This license ensures that developers can use the database for voice-to-text and text-to-voice applications without restrictions or costs. == Aims == Common Voice aims to provide diverse voice samples. According to Mozilla's Katharina Borchert, many existing projects took datasets from public radio or otherwise had datasets that underrepresented both women and people with pronounced accents. == Voice database == The first dataset was released in November 2017. More than 20,000 users worldwide had recorded 500 hours of English sentences. In February 2019, the first batch of languages was released for use. This included 18 languages such as English, French, German and Mandarin Chinese, but also less prevalent languages like Welsh and Kabyle. In total, this included almost 1,400 hours of recorded voice data from more than 42,000 contributors. By July 2020 the database had amassed 7,226 hours of voice recordings in 54 languages, 5,591 hours of which had been verified by volunteers. In May 2021, following the work to add Kinyarwanda, the project received a grant to add Kiswahili. At the beginning of 2022, Bengali.AI partnered with Common Voice to launch the "Bangla Speech Recognition" project that aims to make machines understand the Bangla language. 2000 hours of voice was collected. In September 2022, it was announced that the Twi language of Ghana was the 100th language to be added to the database. As of December 2025, Mozilla Common Voice collects voice data for over 250 languages, with the most hours having been collected in English, Catalan, Kinyarwanda, Belarusian and Esperanto.

Synaptic transistor

A synaptic transistor is an electrical device that can learn in ways similar to a neural synapse. It optimizes its own properties for the functions it has carried out in the past. The device mimics the behavior of the property of neurons called spike-timing-dependent plasticity, or STDP. == Structure == Its structure is similar to that of a field effect transistor, where an ionic liquid takes the place of the gate insulating layer between the gate electrode and the conducting channel. That channel is composed of samarium nickelate (SmNiO3, or SNO) rather than the field effect transistor's doped silicon. == Function == A synaptic transistor has a traditional immediate response whose amount of current that passes between the source and drain contacts varies with voltage applied to the gate electrode. It also produces a much slower learned response such that the conductivity of the SNO layer varies in response to the transistor's STDP history, essentially by shuttling oxygen ions between the SNO and the ionic liquid. The analog of strengthening a synapse is to increase the SNO's conductivity, which essentially increases gain. Similarly, weakening a synapse is analogous to decreasing the SNO's conductivity, lowering the gain. The input and output of the synaptic transistor are continuous analog values, rather than digital on-off signals. While the physical structure of the device has the potential to learn from history, it contains no way to bias the transistor to control the memory effect. An external supervisory circuit converts the time delay between input and output into a voltage applied to the ionic liquid that either drives ions into the SNO or removes them. A network of such devices can learn particular responses to "sensory inputs", with those responses being learned through experience rather than explicitly programmed.

Database dump

A database dump contains a record of the table structure and/or the data from a database and is usually in the form of a list of SQL statements ("SQL dump"). A database dump is most often used for backing up a database so that its contents can be restored in the event of data loss. Corrupted databases can often be recovered by analysis of the dump. Database dumps are often published by free content projects, to facilitate reuse, forking, offline use, and long-term digital preservation. Dumps can be transported into environments with Internet blackouts or otherwise restricted Internet access, as well as facilitate local searching of the database using sophisticated tools such as grep.

Absorbing Markov chain

In the mathematical theory of probability, an absorbing Markov chain is a Markov chain in which every state can reach an absorbing state. An absorbing state is a state that, once entered, cannot be left. Like general Markov chains, there can be continuous-time absorbing Markov chains with an infinite state space. However, this article concentrates on the discrete-time discrete-state-space case. == Formal definition == A Markov chain is an absorbing chain if there is at least one absorbing state and it is possible to go from any state to at least one absorbing state in a finite number of steps. In an absorbing Markov chain, a state that is not absorbing is called transient. === Canonical form === Let an absorbing Markov chain with transition matrix P have t transient states and r absorbing states. The rows of P represent sources, while columns represent destinations. By ordering the transient states before the absorbing states, it can be assumed that P has the form P = [ Q R 0 I r ] , {\displaystyle P={\begin{bmatrix}Q&R\\\mathbf {0} &I_{r}\end{bmatrix}},} where Q is a t-by-t matrix, R is a nonzero t-by-r matrix, 0 is an r-by-t zero matrix, and Ir is the r-by-r identity matrix. Thus, Q describes the probability of transitioning from some transient state to another while R describes the probability of transitioning from some transient state to some absorbing state. The probability of transitioning from i to j in exactly k steps is the (i,j)-entry of Pk, further computed below. When considering only transient states, the probability is found in the upper left of Pk, the (i,j)-entry of Qk. == Fundamental matrix == === Expected number of visits to a transient state === A basic property about an absorbing Markov chain is the expected number of visits to a transient state j starting from a transient state i (before being absorbed). This can be established to be given by the (i, j) entry of so-called fundamental matrix N, obtained by summing Qk for all k (from 0 to ∞). It can be proven that N := ∑ k = 0 ∞ Q k = ( I t − Q ) − 1 , {\displaystyle N:=\sum _{k=0}^{\infty }Q^{k}=(I_{t}-Q)^{-1},} where It is the t-by-t identity matrix. The computation of this formula is the matrix equivalent of the geometric series of scalars, ∑ k = 0 ∞ q k = 1 1 − q {\displaystyle {\textstyle \sum }_{k=0}^{\infty }q^{k}={\tfrac {1}{1-q}}} . With the matrix N in hand, also other properties of the Markov chain are easy to obtain. === Expected number of steps before being absorbed === The expected number of steps before being absorbed in any absorbing state, when starting in transient state i can be computed via a sum over transient states. The value is given by the ith entry of the vector t := N 1 , {\displaystyle \mathbf {t} :=N\mathbf {1} ,} where 1 is a length-t column vector whose entries are all 1. === Absorbing probabilities === By induction, P k = [ Q k ( I t − Q k ) N R 0 I r ] . {\displaystyle P^{k}={\begin{bmatrix}Q^{k}&(I_{t}-Q^{k})NR\\\mathbf {0} &I_{r}\end{bmatrix}}.} The probability of eventually being absorbed in the absorbing state j when starting from transient state i is given by the (i,j)-entry of the matrix B := N R {\displaystyle B:=NR} . The number of columns of this matrix equals the number of absorbing states r. An approximation of those probabilities can also be obtained directly from the (i,j)-entry of P k {\displaystyle P^{k}} for a large enough value of k, when i is the index of a transient, and j the index of an absorbing state. This is because ( lim k → ∞ P k ) i , t + j = B i , j {\displaystyle \left(\lim _{k\to \infty }P^{k}\right)_{i,t+j}=B_{i,j}} . === Transient visiting probabilities === The probability of visiting transient state j when starting at a transient state i is the (i,j)-entry of the matrix H := ( N − I t ) ( N dg ) − 1 , {\displaystyle H:=(N-I_{t})(N_{\operatorname {dg} })^{-1},} where Ndg is the diagonal matrix with the same diagonal as N. === Variance on number of transient visits === The variance on the number of visits to a transient state j with starting at a transient state i (before being absorbed) is the (i,j)-entry of the matrix N 2 := N ( 2 N dg − I t ) − N sq , {\displaystyle N_{2}:=N(2N_{\operatorname {dg} }-I_{t})-N_{\operatorname {sq} },} where Nsq is the Hadamard product of N with itself (i.e. each entry of N is squared). === Variance on number of steps === The variance on the number of steps before being absorbed when starting in transient state i is the ith entry of the vector ( 2 N − I t ) t − t sq , {\displaystyle (2N-I_{t})\mathbf {t} -\mathbf {t} _{\operatorname {sq} },} where tsq is the Hadamard product of t with itself (i.e., as with Nsq, each entry of t is squared). == Examples == === String generation === Consider the process of repeatedly flipping a fair coin until the sequence (heads, tails, heads) appears. This process is modeled by an absorbing Markov chain with transition matrix P = [ 1 / 2 1 / 2 0 0 0 1 / 2 1 / 2 0 1 / 2 0 0 1 / 2 0 0 0 1 ] . {\displaystyle P={\begin{bmatrix}1/2&1/2&0&0\\0&1/2&1/2&0\\1/2&0&0&1/2\\0&0&0&1\end{bmatrix}}.} The first state represents the empty string, the second state the string "H", the third state the string "HT", and the fourth state the string "HTH". Although in reality, the coin flips cease after the string "HTH" is generated, the perspective of the absorbing Markov chain is that the process has transitioned into the absorbing state representing the string "HTH" and, therefore, cannot leave. For this absorbing Markov chain, the fundamental matrix is N = ( I − Q ) − 1 = ( [ 1 0 0 0 1 0 0 0 1 ] − [ 1 / 2 1 / 2 0 0 1 / 2 1 / 2 1 / 2 0 0 ] ) − 1 = [ 1 / 2 − 1 / 2 0 0 1 / 2 − 1 / 2 − 1 / 2 0 1 ] − 1 = [ 4 4 2 2 4 2 2 2 2 ] . {\displaystyle {\begin{aligned}N&=(I-Q)^{-1}=\left({\begin{bmatrix}1&0&0\\0&1&0\\0&0&1\end{bmatrix}}-{\begin{bmatrix}1/2&1/2&0\\0&1/2&1/2\\1/2&0&0\end{bmatrix}}\right)^{-1}\\[4pt]&={\begin{bmatrix}1/2&-1/2&0\\0&1/2&-1/2\\-1/2&0&1\end{bmatrix}}^{-1}={\begin{bmatrix}4&4&2\\2&4&2\\2&2&2\end{bmatrix}}.\end{aligned}}} The expected number of steps starting from each of the transient states is t = N 1 = [ 4 4 2 2 4 2 2 2 2 ] [ 1 1 1 ] = [ 10 8 6 ] . {\displaystyle \mathbf {t} =N\mathbf {1} ={\begin{bmatrix}4&4&2\\2&4&2\\2&2&2\end{bmatrix}}{\begin{bmatrix}1\\1\\1\end{bmatrix}}={\begin{bmatrix}10\\8\\6\end{bmatrix}}.} Therefore, the expected number of coin flips before observing the sequence (heads, tails, heads) is 10, the entry for the state representing the empty string. === Games of chance === Games based entirely on chance can be modeled by an absorbing Markov chain. A classic example of this is the ancient Indian board game Snakes and Ladders. The graph on the left plots the probability mass in the lone absorbing state that represents the final square as the transition matrix is raised to larger and larger powers. To determine the expected number of turns to complete the game, compute the vector t as described above and examine tstart, which is approximately 39.2. === Infectious disease testing === Infectious disease testing, either of blood products or in medical clinics, is often taught as an example of an absorbing Markov chain. The public U.S. Centers for Disease Control and Prevention (CDC) model for HIV and for hepatitis B, for example, illustrates the property that absorbing Markov chains can lead to the detection of disease, versus the loss of detection through other means. In the standard CDC model, the Markov chain has five states, a state in which the individual is uninfected, then a state with infected but undetectable virus, a state with detectable virus, and absorbing states of having quit/been lost from the clinic, or of having been detected (the goal). The typical rates of transition between the Markov states are the probability p per unit time of being infected with the virus, w for the rate of window period removal (time until virus is detectable), q for quit/loss rate from the system, and d for detection, assuming a typical rate λ {\displaystyle \lambda } at which the health system administers tests of the blood product or patients in question. It follows that we can "walk along" the Markov model to identify the overall probability of detection for a person starting as undetected, by multiplying the probabilities of transition to each next state of the model as: p ( p + q ) w ( w + q ) d ( d + q ) {\displaystyle {\frac {p}{(p+q)}}{\frac {w}{(w+q)}}{\frac {d}{(d+q)}}} . The subsequent total absolute number of false negative tests—the primary CDC concern—would then be the rate of tests, multiplied by the probability of reaching the infected but undetectable state, times the duration of staying in the infected undetectable state: p ( p + q ) 1 ( w + q ) λ {\displaystyle {\frac {p}{(p+q)}}{\frac {1}{(w+q)}}\lambda } .