AI Assistant Vs AI Agent

AI Assistant Vs AI Agent — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Lost Art-Database

    Lost Art-Database

    The Lost Art-Datenbank is an online database published by the German Lost Art Foundation (Deutsches Zentrum Kulturgutverluste. It contains information on cultural objects looted from Jewish collectors or transferred due to Nazi persecution during the Nazi era. Until 2015, it was managed by the Koordinierungsstelle für Kulturgutverluste (Magdeburg Coordination Office). == Creation == Following the Washington Conference of 1998, and the commitments to provide more transparency regarding looted art, Germany launched the Lost Art Database in 2000 order to help Holocaust victims and their families track down artworks that had been looted from them or lost due to Nazi persecution. == Functionality == The Lost Art Database lists art and books and other cultural objects that were lost, seized, stolen or forceably sold during the Nazi era. The database is divided into search requests from victims' families, heirs or institutions and "found" reports from cultural institutions on items with unresolved provenance gaps from the Nazi periods. The section on reports of finds lists objects that are known to have been unlawfully seized or relocated as a result of the war. In addition, reports are published here on cultural objects for which an uncertain or incomplete provenance may indicate a possible unlawful seizure or war-related relocation. The publication of reports in the Lost Art Internet Database is carried out on behalf of and with the consent of the reporting persons and institutions. The responsibility for the content of the reports lies with these legal or natural persons. There have been controversies over which items should be included in the database. Lost Art is based on the Washington Principles adopted in 1998, which Germany has committed itself to implementing (Joint Declaration, 1999). The Lost Art Database is considered a key resource in the search for looted art and the victims of persecution. Every item in the Lost Art Database has an identifier, known as a Lost Art ID. Proveana is the linked research database. == Other lost art databases == Other countries have launched databases to help identify Nazi looted art. Each database has its own area of focus. The German Lost Art Database allows families or heirs to submit information. Other countries have databases that focus on looted artworks that have not been found or artworks that were repatriated to the national authorities after the defeat of the Nazis but were never returned to their original owners. Other databases have been created for stolen antiquities, looted art from colonial era, art stolen from Syria, Iraq, Ukraine, or from museums or collectors.

    Read more →
  • Key (cryptography)

    Key (cryptography)

    A key in cryptography is a piece of information, usually a string of numbers or letters that are stored in a file, which, when processed through a cryptographic algorithm, can encode or decode cryptographic data. Based on the used method, the key can be different sizes and varieties, but in all cases, the strength of the encryption relies on the security of the key being maintained. A key's security strength is dependent on its algorithm, the size of the key, the generation of the key, and the process of key exchange. == Scope == The key is what is used to encrypt data from plaintext to ciphertext. There are different methods for utilizing keys and encryption. === Symmetric cryptography === Symmetric cryptography refers to the practice of the same key being used for both encryption and decryption. === Asymmetric cryptography === Asymmetric cryptography has separate keys for encrypting and decrypting. These keys are known as the public and private keys, respectively. == Purpose == Since the key protects the confidentiality and integrity of the system, it is important to be kept secret from unauthorized parties. With public key cryptography, only the private key must be kept secret, but with symmetric cryptography, it is important to maintain the confidentiality of the key. Kerckhoff's principle states that the entire security of the cryptographic system relies on the secrecy of the key. == Key sizes == Key size is the number of bits in the key defined by the algorithm. This size defines the upper bound of the cryptographic algorithm's security. The larger the key size, the longer it will take before the key is compromised by a brute force attack. Since perfect secrecy is not feasible for key algorithms, researches are now more focused on computational security. In the past, keys were required to be a minimum of 40 bits in length, however, as technology advanced, these keys were being broken quicker and quicker. As a response, restrictions on symmetric keys were enhanced to be greater in size. Currently, 2048 bit RSA is commonly used, which is sufficient for current systems. However, current RSA key sizes would all be cracked quickly with a powerful quantum computer. "The keys used in public key cryptography have some mathematical structure. For example, public keys used in the RSA system are the product of two prime numbers. Thus public key systems require longer key lengths than symmetric systems for an equivalent level of security. 3072 bits is the suggested key length for systems based on factoring and integer discrete logarithms which aim to have security equivalent to a 128 bit symmetric cipher." == Key generation == To prevent a key from being guessed, keys need to be generated randomly and contain sufficient entropy. The problem of how to safely generate random keys is difficult and has been addressed in many ways by various cryptographic systems. A key can directly be generated by using the output of a Random Bit Generator (RBG), a system that generates a sequence of unpredictable and unbiased bits. A RBG can be used to directly produce either a symmetric key or the random output for an asymmetric key pair generation. Alternatively, a key can also be indirectly created during a key-agreement transaction, from another key or from a password. Some operating systems include tools for "collecting" entropy from the timing of unpredictable operations such as disk drive head movements. For the production of small amounts of keying material, ordinary dice provide a good source of high-quality randomness. == Establishment scheme == The security of a key is dependent on how a key is exchanged between parties. Establishing a secured communication channel is necessary so that outsiders cannot obtain the key. A key establishment scheme (or key exchange) is used to transfer an encryption key among entities. Key agreement and key transport are the two types of a key exchange scheme that are used to be remotely exchanged between entities . In a key agreement scheme, a secret key, which is used between the sender and the receiver to encrypt and decrypt information, is set up to be sent indirectly. All parties exchange information (the shared secret) that permits each party to derive the secret key material. In a key transport scheme, encrypted keying material that is chosen by the sender is transported to the receiver. Either symmetric key or asymmetric key techniques can be used in both schemes. The Diffie–Hellman key exchange and Rivest-Shamir-Adleman (RSA) are the most two widely used key exchange algorithms. In 1976, Whitfield Diffie and Martin Hellman constructed the Diffie–Hellman algorithm, which was the first public key algorithm. The Diffie–Hellman key exchange protocol allows key exchange over an insecure channel by electronically generating a shared key between two parties. On the other hand, RSA is a form of the asymmetric key system which consists of three steps: key generation, encryption, and decryption. Key confirmation delivers an assurance between the key confirmation recipient and provider that the shared keying materials are correct and established. The National Institute of Standards and Technology recommends key confirmation to be integrated into a key establishment scheme to validate its implementations. == Management == Key management concerns the generation, establishment, storage, usage and replacement of cryptographic keys. A key management system (KMS) typically includes three steps of establishing, storing and using keys. The base of security for the generation, storage, distribution, use and destruction of keys depends on successful key management protocols. == Key vs password == A password is a memorized series of characters including letters, digits, and other special symbols that are used to verify identity. It is often produced by a human user or a password management software to protect personal and sensitive information or generate cryptographic keys. Passwords are often created to be memorized by users and may contain non-random information such as dictionary words. On the other hand, a key can help strengthen password protection by implementing a cryptographic algorithm which is difficult to guess or replace the password altogether. A key is generated based on random or pseudo-random data and can often be unreadable to humans. A password is less safe than a cryptographic key due to its low entropy, randomness, and human-readable properties. However, the password may be the only secret data that is accessible to the cryptographic algorithm for information security in some applications such as securing information in storage devices. Thus, a deterministic algorithm called a key derivation function (KDF) uses a password to generate the secure cryptographic keying material to compensate for the password's weakness. Various methods such as adding a salt or key stretching may be used in the generation.

    Read more →
  • Strong secrecy

    Strong secrecy

    Strong secrecy is a term used in formal proof-based cryptography for making propositions about the security of cryptographic protocols. It is a stronger notion of security than syntactic (or weak) secrecy. Strong secrecy is related with the concept of semantic security or indistinguishability used in the computational proof-based approach. Bruno Blanchet provides the following definition for strong secrecy: Strong secrecy means that an adversary cannot see any difference when the value of the secret changes For example, if a process encrypts a message m an attacker can differentiate between different messages, since their ciphertexts will be different. Thus m is not a strong secret. If however, probabilistic encryption were used, m would be a strong secret. The randomness incorporated into the encryption algorithm will yield different ciphertexts for the same value of m.

    Read more →
  • Cryptosystem

    Cryptosystem

    In cryptography, a cryptosystem is a suite of cryptographic algorithms needed to implement a particular security service, such as confidentiality (encryption). Typically, a cryptosystem consists of three algorithms: one for key generation, one for encryption, and one for decryption. The term cipher (sometimes cypher) is often used to refer to a pair of algorithms, one for encryption and one for decryption. Therefore, the term cryptosystem is most often used when the key generation algorithm is important. For this reason, the term cryptosystem is commonly used to refer to public key techniques; however both "cipher" and "cryptosystem" are used for symmetric key techniques. == Formal definition == Mathematically, a cryptosystem or encryption scheme can be defined as a tuple ( P , C , K , E , D ) {\displaystyle ({\mathcal {P}},{\mathcal {C}},{\mathcal {K}},{\mathcal {E}},{\mathcal {D}})} with the following properties. P {\displaystyle {\mathcal {P}}} is a set called the "plaintext space". Its elements are called plaintexts. C {\displaystyle {\mathcal {C}}} is a set called the "ciphertext space". Its elements are called ciphertexts. K {\displaystyle {\mathcal {K}}} is a set called the "key space". Its elements are called keys. E = { E k : k ∈ K } {\displaystyle {\mathcal {E}}=\{E_{k}:k\in {\mathcal {K}}\}} is a set of functions E k : P → C {\displaystyle E_{k}:{\mathcal {P}}\rightarrow {\mathcal {C}}} . Its elements are called "encryption functions". D = { D k : k ∈ K } {\displaystyle {\mathcal {D}}=\{D_{k}:k\in {\mathcal {K}}\}} is a set of functions D k : C → P {\displaystyle D_{k}:{\mathcal {C}}\rightarrow {\mathcal {P}}} . Its elements are called "decryption functions". For each e ∈ K {\displaystyle e\in {\mathcal {K}}} , there is d ∈ K {\displaystyle d\in {\mathcal {K}}} such that D d ( E e ( p ) ) = p {\displaystyle D_{d}(E_{e}(p))=p} for all p ∈ P {\displaystyle p\in {\mathcal {P}}} . Note; typically this definition is modified in order to distinguish an encryption scheme as being either a symmetric-key or public-key type of cryptosystem. == Examples == A classical example of a cryptosystem is the Caesar cipher. A more contemporary example is the RSA cryptosystem. Another example of a cryptosystem is the Advanced Encryption Standard (AES). AES is a widely used symmetric encryption algorithm that has become the standard for securing data in various applications. Paillier cryptosystem is another example used to preserve and maintain privacy and sensitive information. It is featured in electronic voting, electronic lotteries and electronic auctions.

    Read more →
  • Natural language understanding

    Natural language understanding

    Natural language understanding (NLU) or natural language interpretation (NLI) is a subset of natural language processing in artificial intelligence that deals with machine reading comprehension. NLU has been considered an AI-hard problem. There is considerable commercial interest in the field because of its application to automated reasoning, machine translation, question answering, news-gathering, text categorization, voice-activation, archiving, and large-scale content analysis. == History == The program STUDENT, written in 1964 by Daniel Bobrow for his PhD dissertation at MIT, is one of the earliest known attempts at NLU by a computer. Eight years after John McCarthy coined the term artificial intelligence, Bobrow's dissertation (titled Natural Language Input for a Computer Problem Solving System) showed how a computer could understand simple natural language input to solve algebra word problems. A year later, in 1965, Joseph Weizenbaum at MIT wrote ELIZA, an interactive program that carried on a dialogue in English on any topic, the most popular being psychotherapy. ELIZA worked by simple parsing and substitution of key words into canned phrases and Weizenbaum sidestepped the problem of giving the program a database of real-world knowledge or a rich lexicon. Yet ELIZA gained surprising popularity as a toy project and can be seen as a very early precursor to current commercial systems such as those used by Ask.com. In 1969, Roger Schank at Stanford University introduced the conceptual dependency theory for NLU. This model, partially influenced by the work of Sydney Lamb, was extensively used by Schank's students at Yale University, such as Robert Wilensky, Wendy Lehnert, and Janet Kolodner. In 1970, William A. Woods introduced the augmented transition network (ATN) to represent natural language input. Instead of phrase structure rules ATNs used an equivalent set of finite-state automata that were called recursively. ATNs and their more general format called "generalized ATNs" continued to be used for a number of years. In 1971, Terry Winograd finished writing SHRDLU for his PhD thesis at MIT. SHRDLU could understand simple English sentences in a restricted world of children's blocks to direct a robotic arm to move items. The successful demonstration of SHRDLU provided significant momentum for continued research in the field. Winograd continued to be a major influence in the field with the publication of his book Language as a Cognitive Process. At Stanford, Winograd would later advise Larry Page, who co-founded Google. In the 1970s and 1980s, the natural language processing group at SRI International continued research and development in the field. A number of commercial efforts based on the research were undertaken, e.g., in 1982 Gary Hendrix formed Symantec Corporation originally as a company for developing a natural language interface for database queries on personal computers. However, with the advent of mouse-driven graphical user interfaces, Symantec changed direction. A number of other commercial efforts were started around the same time, e.g., Larry R. Harris at the Artificial Intelligence Corporation and Roger Schank and his students at Cognitive Systems Corp. In 1983, Michael Dyer developed the BORIS system at Yale which bore similarities to the work of Roger Schank and W. G. Lehnert. The third millennium saw the introduction of systems using machine learning for text classification, such as the IBM Watson. However, experts debate how much "understanding" such systems demonstrate: e.g., according to John Searle, Watson did not even understand the questions. John Ball, cognitive scientist and inventor of the Patom Theory, supports this assessment. Natural language processing has made inroads for applications to support human productivity in service and e-commerce, but this has largely been made possible by narrowing the scope of the application. There are thousands of ways to request something in a human language that still defies conventional natural language processing. According to Wibe Wagemans, "To have a meaningful conversation with machines is only possible when we match every word to the correct meaning based on the meanings of the other words in the sentence – just like a 3-year-old does without guesswork." == Scope and context == The umbrella term "natural language understanding" can be applied to a diverse set of computer applications, ranging from small, relatively simple tasks such as short commands issued to robots, to highly complex endeavors such as the full comprehension of newspaper articles or poetry passages. Many real-world applications fall between the two extremes, for instance text classification for the automatic analysis of emails and their routing to a suitable department in a corporation does not require an in-depth understanding of the text, but needs to deal with a much larger vocabulary and more diverse syntax than the management of simple queries to database tables with fixed schemata. Throughout the years various attempts at processing natural language or English-like sentences presented to computers have taken place at varying degrees of complexity. Some attempts have not resulted in systems with deep understanding, but have helped overall system usability. For example, Wayne Ratliff originally developed the Vulcan program with an English-like syntax to mimic the English speaking computer in Star Trek. Vulcan later became the dBase system whose easy-to-use syntax effectively launched the personal computer database industry. Systems with an easy-to-use or English-like syntax are, however, quite distinct from systems that use a rich lexicon and include an internal representation (often as first order logic) of the semantics of natural language sentences. Hence the breadth and depth of "understanding" aimed at by a system determine both the complexity of the system (and the implied challenges) and the types of applications it can deal with. The "breadth" of a system is measured by the sizes of its vocabulary and grammar. The "depth" is measured by the degree to which its understanding approximates that of a fluent native speaker. At the narrowest and shallowest, English-like command interpreters require minimal complexity, but have a small range of applications. Narrow but deep systems explore and model mechanisms of understanding, but they still have limited application. Systems that attempt to understand the contents of a document such as a news release beyond simple keyword matching and to judge its suitability for a user are broader and require significant complexity, but they are still somewhat shallow. Systems that are both very broad and very deep are beyond the current state of the art. == Components and architecture == Regardless of the approach used, most NLU systems share some common components. The system needs a lexicon of the language and a parser and grammar rules to break sentences into an internal representation. The construction of a rich lexicon with a suitable ontology requires significant effort, e.g., the Wordnet lexicon required many person-years of effort. The system also needs theory from semantics to guide the comprehension. The interpretation capabilities of a language-understanding system depend on the semantic theory it uses. Competing semantic theories of language have specific trade-offs in their suitability as the basis of computer-automated semantic interpretation. These range from naive semantics or stochastic semantic analysis to the use of pragmatics to derive meaning from context. Semantic parsers convert natural-language texts into formal meaning representations. Advanced applications of NLU also attempt to incorporate logical inference within their framework. This is generally achieved by mapping the derived meaning into a set of assertions in predicate logic, then using logical deduction to arrive at conclusions. Therefore, systems based on functional languages such as Lisp need to include a subsystem to represent logical assertions, while logic-oriented systems such as those using the language Prolog generally rely on an extension of the built-in logical representation framework. The management of context in NLU can present special challenges. A large variety of examples and counter examples have resulted in multiple approaches to the formal modeling of context, each with specific strengths and weaknesses.

    Read more →
  • Trace zero cryptography

    Trace zero cryptography

    First proposed by Gerhard Frey in 1998, trace zero cryptography refers to the use of trace zero varieties (TZV) for cryptographic purpose. Trace zero varieties are subgroups of the divisor class group on a low genus hyperelliptic curve defined over a finite field. These groups can be used to establish asymmetric cryptography using the discrete logarithm problem as cryptographic primitive. Trace zero varieties feature a better scalar multiplication performance than elliptic curves. This allows fast arithmetic in these groups, which can speed up the calculations with a factor 3 compared with elliptic curves and hence speed up the cryptosystem. Another advantage is that for groups of cryptographically relevant size, the order of the group can simply be calculated using the characteristic polynomial of the Frobenius endomorphism. This is not the case, for example, in elliptic curve cryptography when the group of points of an elliptic curve over a prime field is used for cryptographic purpose. However, to represent an element of the trace zero variety more bits are needed compared with elements of elliptic or hyperelliptic curves. Another disadvantage is the fact that it is possible to reduce the security of the TZV of 1/6th of the bit length using cover attack. == Mathematical background == A hyperelliptic curve C of genus g over a prime field F q {\displaystyle \mathbb {F} _{q}} where q = pn (p prime) of odd characteristic is defined as C : y 2 + h ( x ) y = f ( x ) , {\displaystyle C:~y^{2}+h(x)y=f(x),} where f monic, deg(f) = 2g + 1 and deg(h) ≤ g. The curve has at least one F q {\displaystyle \mathbb {F} _{q}} -rational Weierstraßpoint. The Jacobian variety J C ( F q n ) {\displaystyle J_{C}(\mathbb {F} _{q^{n}})} of C is for all finite extension F q n {\displaystyle \mathbb {F} _{q^{n}}} isomorphic to the ideal class group Cl ⁡ ( C / F q n ) {\displaystyle \operatorname {Cl} (C/\mathbb {F} _{q^{n}})} . With the Mumford's representation it is possible to represent the elements of J C ( F q n ) {\displaystyle J_{C}(\mathbb {F} _{q^{n}})} with a pair of polynomials [u, v], where u, v ∈ F q n [ x ] {\displaystyle \mathbb {F} _{q^{n}}[x]} . The Frobenius endomorphism σ is used on an element [u, v] of J C ( F q n ) {\displaystyle J_{C}(\mathbb {F} _{q^{n}})} to raise the power of each coefficient of that element to q: σ([u, v]) = [uq(x), vq(x)]. The characteristic polynomial of this endomorphism has the following form: χ ( T ) = T 2 g + a 1 T 2 g − 1 + ⋯ + a g T g + ⋯ + a 1 q g − 1 T + q g , {\displaystyle \chi (T)=T^{2g}+a_{1}T^{2g-1}+\cdots +a_{g}T^{g}+\cdots +a_{1}q^{g-1}T+q^{g},} where ai in Z {\displaystyle \mathbb {Z} } With the Hasse–Weil theorem it is possible to receive the group order of any extension field F q n {\displaystyle \mathbb {F} _{q^{n}}} by using the complex roots τi of χ(T): | J C ( F q n ) | = ∏ i = 1 2 g ( 1 − τ i n ) {\displaystyle |J_{C}(\mathbb {F} _{q^{n}})|=\prod _{i=1}^{2g}(1-\tau _{i}^{n})} Let D be an element of the J C ( F q n ) {\displaystyle J_{C}(\mathbb {F} _{q^{n}})} of C, then it is possible to define an endomorphism of J C ( F q n ) {\displaystyle J_{C}(\mathbb {F} _{q^{n}})} , the so-called trace of D: Tr ⁡ ( D ) = ∑ i = 0 n − 1 σ i ( D ) = D + σ ( D ) + ⋯ + σ n − 1 ( D ) {\displaystyle \operatorname {Tr} (D)=\sum _{i=0}^{n-1}\sigma ^{i}(D)=D+\sigma (D)+\cdots +\sigma ^{n-1}(D)} Based on this endomorphism one can reduce the Jacobian variety to a subgroup G with the property, that every element is of trace zero: G = { D ∈ J C ( F q n ) | Tr ( D ) = 0 } , ( 0 neutral element in J C ( F q n ) {\displaystyle G=\{D\in J_{C}(\mathbb {F} _{q^{n}})~|~{\text{Tr}}(D)={\textbf {0}}\},~~~({\textbf {0}}{\text{ neutral element in }}J_{C}(\mathbb {F} _{q^{n}})} G is the kernel of the trace endomorphism and thus G is a group, the so-called trace zero (sub)variety (TZV) of J C ( F q n ) {\displaystyle J_{C}(\mathbb {F} _{q^{n}})} . The intersection of G and J C ( F q ) {\displaystyle J_{C}(\mathbb {F} _{q})} is produced by the n-torsion elements of J C ( F q ) {\displaystyle J_{C}(\mathbb {F} _{q})} . If the greatest common divisor gcd ( n , | J C ( F q ) | ) = 1 {\displaystyle \gcd(n,|J_{C}(\mathbb {F} _{q})|)=1} the intersection is empty and one can compute the group order of G: | G | = | J C ( F q n ) | | J C ( F q ) | = ∏ i = 1 2 g ( 1 − τ i n ) ∏ i = 1 2 g ( 1 − τ i ) {\displaystyle |G|={\dfrac {|J_{C}(\mathbb {F} _{q^{n}})|}{|J_{C}(\mathbb {F} _{q})|}}={\dfrac {\prod _{i=1}^{2g}(1-\tau _{i}^{n})}{\prod _{i=1}^{2g}(1-\tau _{i})}}} The actual group used in cryptographic applications is a subgroup G0 of G of a large prime order l. This group may be G itself. There exist three different cases of cryptographical relevance for TZV: g = 1, n = 3 g = 1, n = 5 g = 2, n = 3 == Arithmetic == The arithmetic used in the TZV group G0 based on the arithmetic for the whole group J C ( F q n ) {\displaystyle J_{C}(\mathbb {F} _{q^{n}})} , But it is possible to use the Frobenius endomorphism σ to speed up the scalar multiplication. This can be archived if G0 is generated by D of order l then σ(D) = sD, for some integers s. For the given cases of TZV s can be computed as follows, where ai come from the characteristic polynomial of the Frobenius endomorphism : For g = 1, n = 3: s = q − 1 1 − a 1 mod ℓ {\displaystyle s={\dfrac {q-1}{1-a_{1}}}{\bmod {\ell }}} For g = 1, n = 5: s = q 2 − q − a 1 2 q + a 1 q + 1 q − 2 a 1 q + a 1 3 − a 1 2 + a 1 − 1 mod ℓ {\displaystyle s={\dfrac {q^{2}-q-a_{1}^{2}q+a_{1}q+1}{q-2a_{1}q+a_{1}^{3}-a_{1}^{2}+a_{1}-1}}{\bmod {\ell }}} For g = 2, n = 3: s = − q 2 − a 2 + a 1 a 1 q − a 2 + 1 mod ℓ {\displaystyle s=-{\dfrac {q^{2}-a_{2}+a_{1}}{a_{1}q-a_{2}+1}}{\bmod {\ell }}} Knowing this, it is possible to replace any scalar multiplication mD (|m| ≤ l/2) with: m 0 D + m 1 σ ( D ) + ⋯ + m n − 1 σ n − 1 ( D ) , where m i = O ( ℓ 1 / ( n − 1 ) ) = O ( q g ) {\displaystyle m_{0}D+m_{1}\sigma (D)+\cdots +m_{n-1}\sigma ^{n-1}(D),~~~~{\text{where }}m_{i}=O(\ell ^{1/(n-1)})=O(q^{g})} With this trick the multiple scalar product can be reduced to about 1/(n − 1)th of doublings necessary for calculating mD, if the implied constants are small enough. == Security == The security of cryptographic systems based on trace zero subvarieties according to the results of the papers comparable to the security of hyper-elliptic curves of low genus g' over F p ′ {\displaystyle \mathbb {F} _{p'}} , where p' ~ (n − 1)(g/g' ) for |G| ~128 bits. For the cases where n = 3, g = 2 and n = 5, g = 1 it is possible to reduce the security for at most 6 bits, where |G| ~ 2256, because one can not be sure that G is contained in a Jacobian of a curve of genus 6. The security of curves of genus 4 for similar fields are far less secure. == Cover attack on a trace zero crypto-system == The attack published in shows, that the DLP in trace zero groups of genus 2 over finite fields of characteristic diverse than 2 or 3 and a field extension of degree 3 can be transformed into a DLP in a class group of degree 0 with genus of at most 6 over the base field. In this new class group the DLP can be attacked with the index calculus methods. This leads to a reduction of the bit length 1/6th.

    Read more →
  • InfiniBand

    InfiniBand

    InfiniBand (IB) is a computer networking standard used in high-performance computing that features very high throughput and very low latency. It is used for data interconnect both among and within computers. InfiniBand is also used as either a direct or switched interconnect between servers and storage systems, as well as an interconnect between storage systems. It is designed to be scalable and uses a switched fabric network topology. Between 2014 and June 2016, it was the most commonly used interconnect in the TOP500 list of supercomputers. Mellanox (acquired by Nvidia) manufactures InfiniBand host bus adapters and network switches, which are used by large computer system and database vendors in their product lines. As a computer cluster interconnect, IB competes with Ethernet, Fibre Channel, and Intel Omni-Path. The technology is promoted by the InfiniBand Trade Association. == History == InfiniBand originated in 1999 from the merger of two competing designs: Future I/O and Next Generation I/O (NGIO). NGIO was led by Intel, with a specification released in 1998, and joined by Sun Microsystems and Dell. Future I/O was backed by Compaq, IBM, and Hewlett-Packard. This led to the formation of the InfiniBand Trade Association (IBTA), which included both sets of hardware vendors as well as software vendors such as Microsoft. At the time it was thought some of the more powerful computers were approaching the interconnect bottleneck of the PCI bus, in spite of upgrades like PCI-X. Version 1.0 of the InfiniBand Architecture Specification was released in 2000. Initially the IBTA vision for IB was simultaneously a replacement for PCI in I/O, Ethernet in the machine room, cluster interconnect and Fibre Channel. IBTA also envisaged decomposing server hardware on an IB fabric. Mellanox had been founded in 1999 to develop NGIO technology, but by 2001 shipped an InfiniBand product line called InfiniBridge at 10 Gbit/second speeds. Following the burst of the dot-com bubble there was hesitation in the industry to invest in such a far-reaching technology jump. By 2002, Intel announced that instead of shipping IB integrated circuits ("chips"), it would focus on developing PCI Express, and Microsoft discontinued IB development in favor of extending Ethernet. Sun Microsystems and Hitachi continued to support IB. In 2003, the System X supercomputer built at Virginia Tech used InfiniBand in what was estimated to be the third largest computer in the world at the time. The OpenIB Alliance (later renamed OpenFabrics Alliance) was founded in 2004 to develop an open set of software for the Linux kernel. By February, 2005, the support was accepted into the 2.6.11 Linux kernel. In November 2005 storage devices finally were released using InfiniBand from vendors such as Engenio. Cisco, desiring to keep technology superior to Ethernet off the market, adopted a "buy to kill" strategy. Cisco successfully killed InfiniBand switching companies such as Topspin via acquisition. Of the top 500 supercomputers in 2009, Gigabit Ethernet was the internal interconnect technology in 259 installations, compared with 181 using InfiniBand. In 2010, market leaders Mellanox and Voltaire merged, leaving just one other IB vendor, QLogic, primarily a Fibre Channel vendor. At the 2011 International Supercomputing Conference, links running at about 56 gigabits per second (known as FDR, see below), were announced and demonstrated by connecting booths in the trade show. In 2012, Intel acquired QLogic's InfiniBand technology, leaving only one independent supplier. By 2014, InfiniBand was the most popular internal connection technology for supercomputers, although within two years, 10 Gigabit Ethernet started displacing it. In 2016, it was reported that Oracle Corporation (an investor in Mellanox) might engineer its own InfiniBand hardware. In 2019 Nvidia acquired Mellanox, the last independent supplier of InfiniBand products. == Specification == Specifications are published by the InfiniBand trade association. === Performance === Original names for speeds were single-data rate (SDR), double-data rate (DDR) and quad-data rate (QDR) as given below. Subsequently, other three-letter initialisms were added for even higher data rates. Notes Each link is duplex. Links can be aggregated: most systems use a 4 link/lane connector (QSFP). HDR often makes use of 2x links (aka HDR100, 100 Gb link using 2 lanes of HDR, while still using a QSFP connector). NDR introduced OSFP connectors which host one or two links at 2x (NDR200) or 4x (NDR400). They are not logically configured as a single 8x link, even when connecting switches together with an OSFP cable. InfiniBand provides remote direct memory access (RDMA) capabilities for low CPU overhead. === Topology === InfiniBand uses a switched fabric topology, as opposed to early shared medium Ethernet. All transmissions begin or end at a channel adapter. Each processor contains a host channel adapter (HCA) and each peripheral has a target channel adapter (TCA). These adapters can also exchange information for security or quality of service (QoS). === Messages === InfiniBand transmits data in packets of up to 4 KB that are taken together to form a message. A message can be: a remote direct memory access read or write a channel send or receive a transaction-based operation (that can be reversed) a multicast transmission an atomic operation === Physical interconnection === In addition to a board form factor connection, it can use both active and passive copper (up to 10 meters) and optical fiber cable (up to 10 km). QSFP connectors are used. The InfiniBand Association also specified the CXP connector system for speeds up to 120 Gbit/s over copper, active optical cables, and optical transceivers using parallel multi-mode fiber cables with 24-fiber MPO connectors. === Software interfaces === Mellanox operating system support is available for Solaris, FreeBSD, Red Hat Enterprise Linux, SUSE Linux Enterprise Server (SLES), Windows, HP-UX, VMware ESX, and AIX. InfiniBand has no specific standard application programming interface (API). The standard only lists a set of verbs such as ibv_open_device or ibv_post_send, which are abstract representations of functions or methods that must exist. The syntax of these functions is left to the vendors. Sometimes for reference this is called the verbs API. The de facto standard software is developed by OpenFabrics Alliance and called the Open Fabrics Enterprise Distribution (OFED). It is released under two licenses GPL2 or BSD license for Linux and FreeBSD, and as Mellanox OFED for Windows (product names: WinOF / WinOF-2; attributed as host controller driver for matching specific ConnectX 3 to 5 devices) under a choice of BSD license for Windows. It has been adopted by most of the InfiniBand vendors, for Linux, FreeBSD, and Microsoft Windows. IBM refers to a software library called libibverbs, for its AIX operating system, as well as "AIX InfiniBand verbs". The Linux kernel support was integrated in 2005 into the kernel version 2.6.11. === Ethernet over InfiniBand === Ethernet over InfiniBand, abbreviated to EoIB, is an Ethernet implementation over the InfiniBand protocol and connector technology. EoIB enables multiple Ethernet bandwidths varying on the InfiniBand (IB) version. Ethernet's implementation of the Internet Protocol Suite, usually referred to as TCP/IP, is different in some details compared to the direct InfiniBand protocol in IP over IB (IPoIB).

    Read more →
  • Tumblr

    Tumblr

    Tumblr ( TUM-blər) is a microblogging and social media platform founded by David Karp in 2007 and operated by American company Tumblr, Inc., a subsidiary of Automattic. The service allows users to post multimedia and other content to a short-form blog. It has attracted significant attention and controversy for hosting a wide range of progressive user-generated content. == History == === Beginnings (2006–2012) === Development of Tumblr began in 2006 during a two-week gap between contracts at David Karp's software consulting company, Davidville. Karp had been interested in tumblelogs (short-form blogs, hence the name Tumblr) for some time and was waiting for one of the established blogging platforms to introduce their own tumblelogging platform. As none had done so after a year of waiting, Karp and developer Marco Arment began working on their own platform. Tumblr was launched in February 2007, and within two weeks had gained 75,000 users. Arment left the company in September 2010 to work on Instapaper. In June 2012, Tumblr featured its first major brand advertising campaign in collaboration with Adidas, who launched an official soccer Tumblr blog and bought ad placements on the user dashboard. This launch came only two months after Tumblr announced it would be moving towards paid advertising on its site. === Ownership by Yahoo! (2013–2018) === On May 20, 2013, it was announced that Yahoo and Tumblr had reached an agreement for Yahoo! Inc. to acquire Tumblr for $1.1 billion in cash. Many of Tumblr's users were unhappy with the news, causing some to start a petition, achieving nearly 170,000 signatures. David Karp remained CEO and the deal was finalized on June 20, 2013. Advertising sales goals were not met and in 2016 Yahoo wrote down $712 million of Tumblr's value. Verizon Communications acquired Yahoo in June 2017, and placed Yahoo and Tumblr under its Oath subsidiary. Karp announced in November 2017 that he would be leaving Tumblr by the end of the year. Jeff D'Onofrio, Tumblr's president and COO, took over leading the company. The site, along with the rest of the Oath division (renamed Verizon Media Group in 2019), continued to struggle under Verizon. In March 2019, Similarweb estimated Tumblr had lost 30% of its user traffic since December 2018, when the site had introduced a stricter content policy with heavier restrictions on adult content (which had been a notable draw to the service). In May 2019, it was reported that Verizon was considering selling the site due to its continued struggles since the purchase (as it had done with another Yahoo property, Flickr, via its sale to SmugMug). Following this news, Pornhub's vice president publicly expressed interest in purchasing Tumblr, with a promise to reinstate the previous adult content policies. === Automattic (2019–present) === On August 12, 2019, Verizon Media announced that it would sell Tumblr to Automattic, the operator of blog service WordPress.com and corporate backer of the open source blog software of the same name. The sale was for an undisclosed amount, but Axios reported that the sale price was less than $3 million, less than 0.3% of Yahoo's original purchase price. Automattic CEO Matt Mullenweg stated that the site will operate as a complementary service to WordPress.com, and that there were no plans to reverse the content policy decisions made during Verizon ownership. In November 2022, Mullenweg stated that Tumblr will add support for the decentralized social networking protocol ActivityPub. In November 2023, most of Tumblr's product development and marketing teams were transferred to other groups within Automattic. Mullenweg stated that focus would shift to core functionality and streamlining existing features. In February 2024, Automattic announced that it would begin selling user data from Tumblr and WordPress.com to Midjourney and OpenAI. Tumblr users are opted-in by default, with an option to opt out. In August 2024, Automattic announced that it would migrate Tumblr's backend to an architecture derived from WordPress, in order to ease development and code sharing between the platforms. The company stated that this migration would not impact the service's user experience and content, and that users "won't even notice a difference from the outside". In January 2025, Mullenweg stated that the migration, once completed, would also "unlock" ActivityPub access for Tumblr, including native support for the company's official ActivityPub plugin for WordPress. In April 2025, Automattic announced layoffs for 16% of its workforce, reducing a large portion of Tumblr staff. On March 16, 2026, Tumblr implemented a change to how notes were assigned to reblogs, making it more similar to sites like Twitter and Bluesky. The change was rolled back the next day after heavy user backlash. == Features == === Blog management === Dashboard: The dashboard is the primary tool for the typical Tumblr user. It is a live feed of recent posts from blogs that they follow. Through the dashboard, users are able to comment, reblog, and like posts from other blogs that appear on their dashboard. The dashboard allows the user to upload text posts, images, videos, quotes, or links to their blog with a click of a button displayed at the top of the dashboard. Users are also able to connect their blogs to their Twitter and Facebook accounts, so that whenever they make a post, it will also be sent as a tweet and a status update. As of June 2022, users can also turn off reblogs on specific posts through the dashboard. Queue: Users are able to set up a schedule to delay posts that they make. They can spread their posts over several hours or even days. Tags: Users can help their audience find posts about certain topics by adding tags. If someone were to upload a picture to their blog and wanted their viewers to find pictures, they would add the tag #picture, and their viewers could use that word to search for posts with the tag #picture. HTML editing: Tumblr allows users to edit their blog's theme using HTML to control the appearance of their blog. Custom themes are able to be shared and used by other users, or sold. Custom domains: Tumblr allows users to use custom domains for their blogs. Users must purchase a domain from Tumblr Domains, an in-house registrar that provides domains that can only be used with Tumblr unless removed from the user's blog and transferred to another registrar. Blogs previously were able to be linked with any domain/subdomain from any registrar, however following the introduction of the Tumblr Domains service, now requires you to purchase a domain directly from Tumblr to be used with a blog. Users who kept their blogs connected to a domain after the introduction got to keep their custom domain, as long as they do not disconnect it from Tumblr or let the domain expire. === Tags === The tagging system on the website operates on a hybrid tagging system, involving both self-tagging (user write their own tags on their posts) and an auto-manual function (the website will recommend popular tags and ones that the user has used before.) Only the first 20 tags added to any post will be indexed by the site. The tags are prefaced by a hashtag and separated by commas, and spaces and special characters are allowed, but only up to 140 characters total per tag. There are two main types used by Tumblr users: descriptive tagging, and opinion or commentary tagging. Descriptive tags are usually introduced by the original poster, and describe what is in the post (e.g. #art, #sky). These are important for the original poster to use, so their post will be indexed and searchable by others wishing to view that subject of content. Tags used as a form of communication are unique to Tumblr, and are typically more personal, expressing opinions, reactions, meta-commentary, background information, and more. Instead of adding onto the reblogged post (with their comments becoming an addition to each subsequent reblog from them) a user may add their comments in the tags, not changing the content or appearance of the original post in any way. Not all users choose to use tags this way, but those who do use tags for commentary may prefer it over adding a comment on the actual post. === Mobile === With Tumblr's 2009 acquisition of Tumblerette, an iOS application created by Jeff Rock and Garrett Ross, the service launched its official iPhone app. The site became available to BlackBerry smartphones on April 17, 2010, via a Mobelux application in BlackBerry World. In June 2012, Tumblr released a new version of its iOS app, Tumblr 3.0, allowing support for Spotify integration, hi-res images and offline access. An app for Android is also available. A Windows Phone app was released on April 23, 2013. An app for Google Glass was released on May 16, 2013. === Inbox and messaging === Tumblr blogs have the option to allow users to submit questions, either as themselves or anonymously, to the blog for a response. Tumblr

    Read more →
  • Hybrid machine translation

    Hybrid machine translation

    Hybrid machine translation is a method of machine translation that is characterized by the use of multiple machine translation approaches within a single machine translation system. The motivation for developing hybrid machine translation systems stems from the failure of any single technique to achieve a satisfactory level of accuracy. Many hybrid machine translation systems have been successful in improving the accuracy of the translations, and there are several popular machine translation systems which employ hybrid methods. == Approaches == === Multi-engine === This approach to hybrid machine translation involves running multiple machine translation systems in parallel. The final output is generated by combining the output of all the sub-systems. Most commonly, these systems use statistical and rule-based translation subsystems, but other combinations have been explored. For example, researchers at Carnegie Mellon University have had some success combining example-based, transfer-based, knowledge-based and statistical translation sub-systems into one machine translation system. === Statistical rule generation === This approach involves using statistical data to generate lexical and syntactic rules. The input is then processed with these rules as if it were a rule-based translator. This approach attempts to avoid the difficult and time-consuming task of creating a set of comprehensive, fine-grained linguistic rules by extracting those rules from the training corpus. This approach still suffers from many problems of normal statistical machine translation, namely that the accuracy of the translation will depend heavily on the similarity of the input text to the text of the training corpus. As a result, this technique has had the most success in domain-specific applications, and has the same difficulties with domain adaptation as many statistical machine translation systems. === Multi-Pass === This approach involves serially processing the input multiple times. The most common technique used in multi-pass machine translation systems is to pre-process the input with a rule-based machine translation system. The output of the rule-based pre-processor is passed to a statistical machine translation system, which produces the final output. This technique is used to limit the amount of information a statistical system need consider, significantly reducing the processing power required. It also removes the need for the rule-based system to be a complete translation system for the language, significantly reducing the amount of human effort and labor necessary to build the system. === Confidence-Based === This approach differs from the other hybrid approaches in that in most cases only one translation technology is used. A confidence metric is produced for each translated sentence from which a decision can be made whether to try a secondary translation technology or to proceed with the initial translation output. SMT is also used when common error patterns such as multiple repeat words appear in sequence, as is common with NMT when the attention mechanism is confused.

    Read more →
  • Viber

    Viber

    Rakuten Viber, commonly known as Viber, is a cross-platform voice over IP (VoIP) and instant messaging (IM) software application owned by the Japanese technology company Rakuten Group. The service is available as freeware for Android, iOS, Microsoft Windows, macOS and Linux. Users are registered and identified through a mobile phone number, although the service can also be accessed on desktop platforms without mobile connectivity. In addition to instant messaging, the platform allows users to exchange media such as images, videos and files, and provides a paid international calling service called Viber Out. The software was launched in 2010 by the company Viber Media, founded by Talmon Marco and Igor Magazinnik. Rakuten acquired Viber Media in 2014 and later renamed the company Rakuten Viber. The company is headquartered in Cyprus and maintains offices in London, Manila, Paris, San Francisco, Singapore, Tokyo and Beijing. == History == === Founding (2010) === Viber Media was founded in Tel Aviv, Israel, in 2010 by Talmon Marco and Igor Magazinnik. Marco and Magazinnik are also co-founders of the peer-to-peer media and file-sharing client iMesh. The company was run from Israel and was registered in Cyprus. Sani Maroli and Ofer Smocha soon joined the company as well. Marco said Viber allows instant calling and synchronization with contacts because the ID is the user's cell number. In its early days, Viber relied on a patchwork of outsourcing partners from different countries, commissioning specific solutions from external vendors — including teams based in Cyprus and Belarus. According to the company's statements, development of Viber's core functionality historically originated from its Tel Aviv office — a testament to its roots — even though the legal entity was registered elsewhere. === Early monetisation (2011) === In its first two years of availability, Viber did not generate revenues. It began doing so in 2013, via user payments for Viber Out voice calling and the Viber graphical messaging "sticker market". The company was originally funded by individual investors, described by Marco as "friends and family". They invested $20 million in the company, which had 120 employees as of May 2013. On 24 July 2013, Viber's support system was defaced by the Syrian Electronic Army. According to Viber, no sensitive user information was accessed. By the time Rakuten came forward with its acquisition deal in 2014, Viber had already stopped working with external vendors, choosing instead to consolidate development under its own offices. === Rakuten acquires Viber (2014) === On 13 February 2014, Rakuten announced they had acquired Viber Media for $900 million, and since then Viber has been owned by Rakuten, Inc., an e-commerce conglomerate headquartered in Tokyo. The sale of Viber earned the Shabtai family (Benny, his brother Gilad, and Gilad's son Ofer) some $500 million from their 55.2% stake in the company. At that sale price, the founders each realized over 30 times return on their investments. Later that year, the company established a UK presence with the incorporation of Viber UK Limited in London. Djamel Agaoua became Viber Media CEO in February 2017, replacing co-founder Marco who left in 2015. In July 2017 the corporate name of Viber Media was changed to Rakuten Viber and a new wordmark logo was introduced. Its legal name remains Viber Media, S.à r.l. based in Luxembourg. === Post-acquisition === In August 2015 Viber opened a regional office for Central and Eastern Europe in Sofia to support growth in the region. In 2017, Rakuten Viber and the World Wildlife Fund engaged in a commercial transaction aimed at raising awareness and protecting wildlife. After first using Viber to spread its message in June 2020, the International Federation of the Red Cross launched an official chatbot and community on the messaging app to combat the spread of false information, which they termed an infodemic, about COVID-19. The chatbot is still active as of June 2022, with over 1.4 million subscribers. In 2020, Rakuten Viber and the World Health Organization (the WHO) engaged in a commercial transaction for a chatbot to inform users of issues such as women's health. and an anti-smoking campaign. In the wake of the July–August 2020 Belarusian election protests, to avoid sanctions and harassment from monopolies the company closed its office in Minsk. In 2022, Ofir Eyal became Viber CEO, replacing Djamel Agaoua. Eyal is a Viber veteran; he worked as Vice President of Product in 2014 before his promotion to Chief Operating Officer in 2019. Shortly after the appointment of a new CEO, Viber continued its international expansion. In March 2022, Rakuten announced the opening of a development center in Tbilisi, Georgia, intended to support work on mobile applications and technology projects in the region. In July 2022, Rakuten Viber partnered with Rapyd to launch instant cross-border P2P payments. The company launched payments on the Viber app first in Greece and Germany, and then in other countries. In August, Mineski teamed up with Viber to develop a social minigame platform that can play off Viber's application. In May 2022, Rakuten Viber launched the premium chat service Viber Plus that offers exclusive features, including sticker market privileges, ad-free use, priority Viber support, exclusive badge, unique Viber icon, large file sharing, and more. In 2022, Viber joined the European Union’s Code of Conduct on countering illegal hate speech online. As part of this framework, the company undertook to review reported content and remove material identified as hate speech in accordance with the Code and its platform rules. In January 2024 Rakuten (the company behind Viber) established an office in Kyiv to bring together engineering and marketing departments. Alongside launching its Kyiv office the company joined Diia.City as a resident. Subsequently in October 2024 Rakuten Viber inaugurated an office in Manila to broaden its operations, in the Philippines. The company’s legal entity remains Viber Media S.à r.l., registered in Luxembourg. Viber’s engineering work has been carried out across multiple countries and through external partners, including outsourcing and near-shore vendors. As a result, its development operations are distributed internationally rather than concentrated in a single location. In December 2024, Viber was blocked in Russia. Roskomnadzor announced the nationwide blocking of the messaging app due to non-compliance with local legal requirements. == Security audit == On 4 November 2014, Viber scored 1 out of 7 points on the Electronic Frontier Foundation's "Secure Messaging Scorecard". Viber received a point for encryption during transit but lost points because communications were not encrypted with keys that the provider did not have access to (i.e. the communications were not end-to-end encrypted), users could not verify contacts' identities, past messages were not secure if the encryption keys were stolen (i.e. the service did not provide forward secrecy), the code was not open to independent review (i.e. the code was not open-source), the security design was not properly documented, and there had not been a recent independent security audit. On 14 November 2014, the EFF changed Viber's score to 2 out of 7 after it had received an external security audit from Ernst & Young's Advanced Security Centre. On 19 April 2016, with the announcement of Viber version 6.0, Rakuten added end-to-end encryption to their service. The company said that the encryption protocol had only been audited internally, and promised to commission external audits "in the coming weeks". In May 2016, Viber published an overview of their encryption protocol, saying that it is a custom implementation that "uses the same concepts" as the Signal Protocol. In 2022, Rakuten Viber won a Security Award, by test.de, a tech firm based in Germany where there are over 3 million Viber users. In 2024, Rakuten Viber received SOC certification following an audit conducted by Ernst & Young. The certification relates to the company’s controls for data protection and information security. == Market share == As of December 2016, Viber had 800 million registered users. According to Statista, there are 260 million monthly active users as of January 2019. The Viber messenger is very popular in the Philippines, Greece, Eastern Europe, Russia, the Middle East, and some Asian markets. India was the largest market for Viber as of December 2014 with 33 million registered users, the fifth most popular instant messenger in the country. At the same time there were 30 million users in the United States, 28 million in Russia and 18 million in Brazil. Viber is particularly popular in Eastern Europe, being the most downloaded messaging app on Android in Belarus, Moldova and Ukraine as of 2016. It is also popular in Iraq, Libya and Nepal. Viber is translated in 44 languages and used in more than 190 co

    Read more →
  • Data lineage

    Data lineage

    Data lineage refers to the process of tracking how data is generated, transformed, transmitted and used across systems over time. It documents data's origins, transformations and movements, providing detailed visibility into its life cycle. This process simplifies the identification of errors in data analytics workflows, by enabling users to trace issues back to their root causes. Data lineage facilitates the ability to replay specific segments or inputs of the dataflow. This can be used in debugging or regenerating lost outputs. In database systems, this concept is closely related to data provenance, which involves maintaining records of inputs, entities, systems and processes that influence data. Data provenance provides a historical record of data origins and transformations. It supports forensic activities such as data-dependency analysis, error/compromise detection, recovery, auditing and compliance analysis: "Lineage is a simple type of why provenance." Data governance plays a critical role in managing metadata by establishing guidelines, strategies and policies. Enhancing data lineage with data quality measures and master data management adds business value. Although data lineage is typically represented through a graphical user interface (GUI), the methods for gathering and exposing metadata to this interface can vary. Based on the metadata collection approach, data lineage can be categorized into three types: Those involving software packages for structured data, programming languages and Big data systems. Data lineage information includes technical metadata about data transformations. Enriched data lineage may include additional elements such as data quality test results, reference data, data models, business terminology, data stewardship information, program management details and enterprise systems associated with data points and transformations. Data lineage visualization tools often include masking features that allow users to focus on information relevant to specific use cases. To unify representations across disparate systems, metadata normalization or standardization may be required. == Representation of data lineage == Representation broadly depends on the scope of the metadata management and reference point of interest. Data lineage provides sources of the data and intermediate data flow hops from the reference point with backward data lineage, leading to the final destination's data points and its intermediate data flows with forward data lineage. These views can be combined with end-to-end lineage for a reference point that provides a complete audit trail of that data point of interest from sources to their final destinations. As the data points or hops increase, the complexity of such representation becomes incomprehensible. Thus, the best feature of the data lineage view is the ability to simplify the view by temporarily masking unwanted peripheral data points. Tools with the masking feature enable scalability of the view and enhance analysis with the best user experience for both technical and business users. Data lineage also enables companies to trace sources of specific business data to track errors, implement changes in processes and implementing system migrations to save significant amounts of time and resources. Data lineage can improve efficiency in business intelligence BI processes. Data lineage can be represented visually to discover the data flow and movement from its source to destination via various changes and hops on its way in the enterprise environment. This includes how the data is transformed along the way, how the representation and parameters change and how the data splits or converges after each hop. A simple representation of the Data Lineage can be shown with dots and lines, where dots represent data containers for data points, and lines connecting them represent transformations the data undergoes between the data containers. Data lineage can be visualized at various levels based on the granularity of the view. At a very high-level, data lineage is visualized as systems that the data interacts with before it reaches its destination. At its most granular, visualizations at the data point level can provide the details of the data point and its historical behavior, attribute properties and trends and data quality of the data passed through that specific data point in the data lineage. The scope of the data lineage determines the volume of metadata required to represent its data lineage. Usually, data governance and data management of an organization determine the scope of the data lineage based on their regulations, enterprise data management strategy, data impact, reporting attributes and critical data elements of the organization. == Rationale == Distributed systems like Google Map Reduce, Microsoft Dryad, Apache Hadoop (an open-source project) and Google Pregel provide such platforms for businesses and users. However, even with these systems, Big Data analytics can take several hours, days or weeks to run, simply due to the data volumes involved. For example, a ratings prediction algorithm for the Netflix Prize challenge took nearly 20 hours to execute on 50 cores, and a large-scale image processing task to estimate geographic information took 3 days to complete using 400 cores. "The Large Synoptic Survey Telescope is expected to generate terabytes of data every night and eventually store more than 50 petabytes, while in the bioinformatics sector, the 12 largest genome sequencing houses in the world now store petabytes of data apiece. It is very difficult for a data scientist to trace an unknown or an unanticipated result. === Big data debugging === Big data analytics is the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. Machine learning, among other algorithms, is used to transform and analyze the data. Due to the large size of the data, there could be unknown features in the data. The massive scale and unstructured nature of data, the complexity of these analytics pipelines, and long runtimes pose significant manageability and debugging challenges. Even a single error in these analytics can be extremely difficult to identify and remove. While one may debug them by re-running the entire analytics through a debugger for stepwise debugging, this can be expensive due to the amount of time and resources needed. Auditing and data validation are other major problems due to the growing ease of access to relevant data sources for use in experiments, the sharing of data between scientific communities and use of third-party data in business enterprises. As such, more cost-efficient ways of analyzing data intensive scale-able computing (DISC) are crucial to their continued effective use. === Challenges in Big Data debugging === ==== Massive scale ==== According to an EMC/IDC study, 2.8 ZB of data were created and replicated in 2012. Furthermore, the same study states that the digital universe will double every two years between now and 2020, and that there will be approximately 5.2 TB of data for every person in 2020. Based on current technology, the storage of this much data will mean greater energy usage by data centers. ==== Unstructured data ==== Unstructured data usually refers to information that doesn't reside in a traditional row-column database. Unstructured data files often include text and multimedia content, such as e-mail messages, word processing documents, videos, photos, audio files, presentations, web pages and many other kinds of business documents. While these types of files may have an internal structure, they are still considered "unstructured" because the data they contain doesn't fit neatly into a database. The amount of unstructured data in enterprises is growing many times faster than structured databases are growing. Big data can include both structured and unstructured data, but IDC estimates that 90 percent of Big Data is unstructured data. The fundamental challenge of unstructured data sources is that they are difficult for non-technical business users and data analysts alike to unbox, understand and prepare for analytic use. Beyond issues of structure, the sheer volume of this type of data contributes to such difficulty. Because of this, current data mining techniques often leave out valuable information and make analyzing unstructured data laborious and expensive. In today's competitive business environment, companies have to find and analyze the relevant data they need quickly. The challenge is going through the volumes of data and accessing the level of detail needed, all at a high speed. The challenge only grows as the degree of granularity increases. One possible solution is hardware. Some vendors are using increased memory and parallel processing to crunch large volumes of data quickly. Another method is putting data in-memory but using a grid

    Read more →
  • Data room

    Data room

    Data rooms are secure spaces used for housing data, usually of a privileged or confidential nature. They can be physical data rooms, virtual data rooms (VDRs), or data centers. They are primarily used for a variety of corporate purposes, including data storage, document exchange, file sharing, financial transactions, and legal proceedings. Today, data rooms are central to workflows in mergers and acquisitions, venture capital, and corporate restructuring, increasingly utilizing artificial intelligence to securely manage and review large datasets. Historically, data rooms were strictly physical locations heavily guarded and monitored. Today, the vast majority of corporate data rooms are hosted virtually on secure cloud platforms, though physical rooms are still occasionally used for highly sensitive government or proprietary intelligence. == Physical Data Rooms == In mergers and acquisitions (M&A), the traditional data room genuinely consists of a physically secured and continually monitored room, normally in the vendor's offices or those of their legal counsel. Bidders and their advisers visit this room in order to inspect and report on various documents, legal contracts, and financial statements made available during the due diligence process. Historically, physical data rooms presented significant logistical challenges. Often, only one bidder at a time was allowed to enter to maintain document integrity and confidentiality. If new documents or new versions of documents were required, they had to be brought in by courier as hardcopies. Teams involved in large due diligence processes typically had to be flown in from many regions or countries and remain available throughout the process. Because these teams comprised a number of experts in different fields—such as legal counsel, forensic accountants, and industry specialists—the overall cost of keeping such groups on call near the physical data room was often extremely high. == Virtual Data Rooms (VDRs) == To address the costs and logistical bottlenecks of physical data rooms, virtual data rooms (VDRs) were developed to provide secure, online dissemination of confidential information. A VDR is essentially a secure cloud repository with strictly controlled access. Access is managed through secure log-ons supplied by the vendor or authority, which can be disabled at any time if a bidder withdraws from a transaction. Because much of the information released during corporate transactions is highly confidential, VDRs utilize digital rights management (DRM) to control information. Restrictions are applied to the viewers' ability to release data to third parties by disabling forwarding, copying, or printing capabilities. Modern VDRs also employ dynamic watermarking and detailed auditing capabilities. Detailed auditing is required for legal reasons so that a precise digital footprint is kept of who has viewed which version of each document, and for how long. Furthermore, modern VDR platforms are typically built to comply with stringent information security standards such as ISO 27001 and SOC 2. Transitioning from sequential physical data rooms to parallel virtual data rooms has been shown to significantly reduce the duration of M&A transactions while allowing sellers to field multiple bidders simultaneously. == Key Applications == Data rooms are commonly used by legal, accounting, investment banking, and private equity firms. Primary applications include: Mergers and Acquisitions (M&A): VDRs are central to the sell-side M&A process. After potential buyers sign a Non-Disclosure Agreement (NDA) and review a Confidential Information Memorandum (CIM), they are granted data room access to perform deep financial due diligence, such as Quality of Earnings (QoE) analysis and legal liability assessments. Venture Capital and Startups: Startups use data rooms as a centralized location for key operational data, capitalization tables, and financial projections to streamline due diligence for angel investors and venture capital firms during fundraising rounds. Initial Public Offerings (IPOs): Taking a company public requires intense regulatory scrutiny. Data rooms are used to securely share company histories and financial audits with investment bankers, legal teams, and regulatory bodies. Corporate Restructuring and Insolvency: During bankruptcies or corporate carve-outs, data rooms are used to organize outstanding debt profiles, creditor agreements, and operational liabilities. == Emerging Technologies == In recent years, the management of virtual data rooms has increasingly incorporated Artificial Intelligence (AI) and Machine Learning (ML). Generative AI and Natural Language Processing (NLP) tools are now integrated into VDRs to automatically index thousands of documents, perform auto-redaction of personally identifiable information (PII), and assist buy-side analysts in identifying hidden liabilities within unstructured text data during the due diligence phase. Modern AI algorithms can extract line items from financial statements to instantly populate structured databases.

    Read more →
  • Packed pixel

    Packed pixel

    In packed pixel or chunky framebuffer organization, the bits defining each pixel are clustered and stored consecutively. For example, if there are 16 bits per pixel, each pixel is represented in two consecutive (contiguous) 8-bit bytes in the framebuffer. If there are 4 bits per pixel, each framebuffer byte defines two pixels, one in each nibble. The latter example is as opposed to storing a single 4-bit pixel in a byte, leaving 4 bits of the byte unused. If a pixel has more than one channel, the channels are interleaved when using packed pixel organization. Packed pixel displays were common on early microcomputer system that shared a single main memory for both the central processing unit (CPU) and display driver. In such systems, memory was normally accessed a byte at a time, so by packing the pixels, the display system could read out several pixels worth of data in a single read operation. Packed pixel is one of two major ways to organize graphics data in memory, the other being planar organization, where each pixel is made of individual bits stored in their own plane. For a 4-bit color value, memory would be organized as four screen-sized planes of one bit each and a single pixel's value built up by selecting the appropriate bit from each plane. Planar organization has the advantage that the data can be accessed in parallel, and is used when memory bandwidth is an issue.

    Read more →
  • Backup

    Backup

    In information technology, a backup, or data backup is a copy of computer data taken and stored elsewhere so that it may be used to restore the original after a data loss event. The verb form, referring to the process of doing so, is "back up", whereas the noun and adjective form is "backup". Backups can be used to recover data after its loss from data deletion or corruption, or to recover data from an earlier time. Backups provide a simple form of IT disaster recovery; however not all backup systems are able to reconstitute a computer system or other complex configuration such as a computer cluster, active directory server, or database server. A backup system contains at least one copy of all data considered worth saving. The data storage requirements can be large. An information repository model may be used to provide structure to this storage. There are different types of data storage devices used for copying backups of data that is already in secondary storage onto archive files. There are also different ways these devices can be arranged to provide geographic dispersion, data security, and portability. Data is selected, extracted, and manipulated for storage. The process can include methods for dealing with live data, including open files, as well as compression, encryption, and de-duplication. Additional techniques apply to enterprise client-server backup. Backup schemes may include dry runs that validate the reliability of the data being backed up. There are limitations and human factors involved in any backup scheme. == Storage == A backup strategy requires an information repository, "a secondary storage space for data" that aggregates backups of data "sources". The repository could be as simple as a list of all backup media (DVDs, etc.) and the dates produced, or could include a computerized index, catalog, or relational database. === 3-2-1 Backup Rule === The backup data needs to be stored, requiring a backup rotation scheme, which is a system of backing up data to computer media that limits the number of backups of different dates retained separately, by appropriate re-use of the data storage media by overwriting of backups no longer needed. The scheme determines how and when each piece of removable storage is used for a backup operation and how long it is retained once it has backup data stored on it. The 3-2-1 rule can aid in the backup process. It states that there should be at least 3 copies of the data, stored on 2 different types of storage media, and one copy should be kept offsite, in a remote location (this can include cloud storage). 2 or more different media should be used to eliminate data loss due to similar reasons (for example, optical discs may tolerate being underwater while LTO tapes may not, and SSDs cannot fail due to head crashes or damaged spindle motors since they do not have any moving parts, unlike hard drives). An offsite copy protects against fire, theft of physical media (such as tapes or discs) and natural disasters like floods and earthquakes. Physically protected hard drives are an alternative to an offsite copy, but they have limitations like only being able to resist fire for a limited period of time, so an offsite copy still remains as the ideal choice. Because there is no perfect storage, many backup experts recommend maintaining a second copy on a local physical device, even if the data is also backed up offsite. === Backup methods === ==== Unstructured ==== An unstructured repository may simply be a stack of tapes, DVD-Rs or external HDDs with minimal information about what was backed up and when. This method is the easiest to implement, but unlikely to achieve a high level of recoverability as it lacks automation. ==== Full only/System imaging ==== A repository using this backup method contains complete source data copies taken at one or more specific points in time. Copying system images, this method is frequently used by computer technicians to record known good configurations. However, imaging is generally more useful as a way of deploying a standard configuration to many systems rather than as a tool for making ongoing backups of diverse systems. ==== Incremental ==== An incremental backup stores data changed since a reference point in time. Duplicate copies of unchanged data are not copied. Typically a full backup of all files is made once or at infrequent intervals, serving as the reference point for an incremental repository. Subsequently, a number of incremental backups are made after successive time periods. Restores begin with the last full backup and then apply the incrementals. Some backup systems can create a synthetic full backup from a series of incrementals, thus providing the equivalent of frequently doing a full backup. When done to modify a single archive file, this speeds restores of recent versions of files. ==== Near-CDP ==== Continuous Data Protection (CDP) refers to a backup that instantly saves a copy of every change made to the data. This allows restoration of data to any point in time and is the most comprehensive and advanced data protection. Near-CDP backup applications—often marketed as "CDP"—automatically take incremental backups at a specific interval, for example every 15 minutes, one hour, or 24 hours. They can therefore only allow restores to an interval boundary. Near-CDP backup applications use journaling and are typically based on periodic "snapshots", read-only copies of the data frozen at a particular point in time. Near-CDP (except for Apple Time Machine) intent-logs every change on the host system, often by saving byte or block-level differences rather than file-level differences. This backup method differs from simple disk mirroring in that it enables a roll-back of the log and thus a restoration of old images of data. Intent-logging allows precautions for the consistency of live data, protecting self-consistent files but requiring applications "be quiesced and made ready for backup." Near-CDP is more practicable for ordinary personal backup applications, as opposed to true CDP, which must be run in conjunction with a virtual machine or equivalent and is therefore generally used in enterprise client-server backups. Software may create copies of individual files such as written documents, multimedia projects, or user preferences, to prevent failed write events caused by power outages, operating system crashes, or exhausted disk space, from causing data loss. A common implementation is an appended ".bak" extension to the file name. ==== Reverse incremental ==== A Reverse incremental backup method stores a recent archive file "mirror" of the source data and a series of differences between the "mirror" in its current state and its previous states. A reverse incremental backup method starts with a non-image full backup. After the full backup is performed, the system periodically synchronizes the full backup with the live copy, while storing the data necessary to reconstruct older versions. This can either be done using hard links—as Apple Time Machine does, or using binary diffs. ==== Differential ==== A differential backup saves only the data that has changed since the last full backup. This means a maximum of two backups from the repository are used to restore the data. However, as time from the last full backup (and thus the accumulated changes in data) increases, so does the time to perform the differential backup. Restoring an entire system requires starting from the most recent full backup and then applying just the last differential backup. A differential backup copies files that have been created or changed since the last full backup, regardless of whether any other differential backups have been made since, whereas an incremental backup copies files that have been created or changed since the most recent backup of any type (full or incremental). Changes in files may be detected through a more recent date/time of last modification file attribute, and/or changes in file size. Other variations of incremental backup include multi-level incrementals and block-level incrementals that compare parts of files instead of just entire files. === Storage media === Regardless of the repository model that is used, the data has to be copied onto an archive file data storage medium. The medium used is also referred to as the type of backup destination. ==== Magnetic tape ==== Magnetic tape was for a long time the most commonly used medium for bulk data storage, backup, archiving, and interchange. It was previously a less expensive option, but this is no longer the case for smaller amounts of data. Tape is a sequential access medium, so the rate of continuously writing or reading data can be very fast. While tape media itself has a low cost per space, tape drives are typically dozens of times as expensive as hard disk drives and optical drives. Tape media are generally rotated on a schedule so at least one set is off-site in case something should happe

    Read more →
  • Media contacts database

    Media contacts database

    In public relations (PR) and marketing, a media contacts database is a resource which catalogs the names, contact information, and other details about people who work in various media professions. These include journalists, reporters, editors, publishers, contributors, freelance journalists, opinion writers, social media personalities/ influencers, TV show anchors, radio show hosts, DJs, and others. A media contacts database usually contains the following information: Full name of the media contact, The publication or channel they work for Designations (past and present) Topics they cover, or their beat Contact information found in public domains Online presence like blogs and other social networking sites Education Information == Overview == A media contacts database is a public relations tool that is maintained and used by PR professionals to pitch stories on a particular topic, product, or company to a specific group of journalists. These journalists would then write or speak about the particular topic in a relevant issue or episode of their shows. A media contacts database allows a PR professional to gain easy access to hundreds of journalists within a short span of time. Media contacts database are created and sold by many media research companies that offer such PR software for professionals.

    Read more →