Data integration

Data integration is the process of combining, sharing, or synchronizing data from multiple sources to provide users with a unified view. There are a wide range of possible applications for data integration, from commercial (such as when a business merges multiple databases) to scientific (combining research data from different bioinformatics repositories). The decision to integrate data tends to arise when the volume, complexity (that is, big data) and need to share existing data explodes. It has become the focus of extensive theoretical work, and numerous open problems remain unsolved. Data integration encourages collaboration between internal as well as external users. The data being integrated must be received from a heterogeneous database system and transformed to a single coherent data store that provides synchronous data across a network of files for clients. A common use of data integration is in data mining when analyzing and extracting information from existing databases that can be useful for Business information. == History == Issues with combining heterogeneous data sources, often referred to as information silos, under a single query interface have existed for some time. In the early 1980s, computer scientists began designing systems for interoperability of heterogeneous databases. The first data integration system driven by structured metadata was designed in 1991 at the University of Minnesota for the Integrated Public Use Microdata Series (IPUMS). IPUMS used a data warehousing approach, which extracts, transforms, and loads data from heterogeneous sources into a unique view schema so data from different sources become compatible. By making thousands of population databases interoperable, IPUMS demonstrated the feasibility of large-scale data integration. The data warehouse approach offers a tightly coupled architecture because the data are already physically reconciled in a single queryable repository, so it usually takes little time to resolve queries. The data warehouse approach is less feasible for data sets that are frequently updated, requiring the extract, transform, load (ETL) process to be continuously re-executed for synchronization. Difficulties also arise in constructing data warehouses when one has only a query interface to summary data sources and no access to the full data. This problem frequently emerges when integrating several commercial query services like travel or classified advertisement web applications. A trend began in 2009 favoring the loose coupling of data and providing a unified query-interface to access real time data over a mediated schema (see Figure 2), which allows information to be retrieved directly from original databases. This is consistent with the SOA approach popular in that era. This approach relies on mappings between the mediated schema and the schema of original sources, and translating a query into decomposed queries to match the schema of the original databases. Such mappings can be specified in two ways: as a mapping from entities in the mediated schema to entities in the original sources (the "Global-as-View" (GAV) approach), or as a mapping from entities in the original sources to the mediated schema (the "Local-as-View" (LAV) approach). The latter approach requires more sophisticated inferences to resolve a query on the mediated schema, but makes it easier to add new data sources to a (stable) mediated schema. As of 2010, some of the work in data integration research concerns the semantic integration problem. This problem addresses not the structuring of the architecture of the integration, but how to resolve semantic conflicts between heterogeneous data sources. For example, if two companies merge their databases, certain concepts and definitions in their respective schemas like "earnings" inevitably have different meanings. In one database it may mean profits in dollars (a floating-point number), while in the other it might represent the number of sales (an integer). A common strategy for the resolution of such problems involves the use of ontologies which explicitly define schema terms and thus help to resolve semantic conflicts. This approach represents ontology-based data integration. On the other hand, the problem of combining research results from different bioinformatics repositories requires bench-marking of the similarities, computed from different data sources, on a single criterion such as positive predictive value. This enables the data sources to be directly comparable and can be integrated even when the natures of experiments are distinct. As of 2011, it was determined that current data modeling methods were imparting data isolation into every data architecture in the form of islands of disparate data and information silos. This data isolation is an unintended artifact of the data modeling methodology that results in the development of disparate data models. Disparate data models, when instantiated as databases, form disparate databases. Enhanced data model methodologies have been developed to eliminate the data isolation artifact and to promote the development of integrated data models. One enhanced data modeling method recasts data models by augmenting them with structural metadata in the form of standardized data entities. As a result of recasting multiple data models, the set of recast data models will now share one or more commonality relationships that relate the structural metadata now common to these data models. Commonality relationships are a peer-to-peer type of entity relationships that relate the standardized data entities of multiple data models. Multiple data models that contain the same standard data entity may participate in the same commonality relationship. When integrated data models are instantiated as databases and are properly populated from a common set of master data, then these databases are integrated. Since 2011, data hub approaches have been of greater interest than fully structured (typically relational) Enterprise Data Warehouses. Since 2013, data lake approaches have risen to the level of Data Hubs. (See all three search terms popularity on Google Trends.) These approaches combine unstructured or varied data into one location, but do not necessarily require an (often complex) master relational schema to structure and define all data in the Hub. In recent times, as the number of applications being used have increased many fold and application to application integration have become critical and this has given rise to [Unified APIs] that help application developers integrate their apps with other apps and more recently with [MCP - Model Context Protocol] taking it a step further for AI Agents. Data integration plays a big role in business regarding data collection used for studying the market. Converting the raw data retrieved from consumers into coherent data is something businesses try to do when considering what steps they should take next. Organizations are more frequently using data mining for collecting information and patterns from their databases, and this process helps them develop new business strategies to increase business performance and perform economic analyses more efficiently. Compiling the large amount of data they collect to be stored in their system is a form of data integration adapted for Business intelligence to improve their chances of success. == Example == Consider a web application where a user can query a variety of information about cities (such as crime statistics, weather, hotels, demographics, etc.). Traditionally, the information must be stored in a single database with a single schema. But any single enterprise would find information of this breadth somewhat difficult and expensive to collect. Even if the resources exist to gather the data, it would likely duplicate data in existing crime databases, weather websites, and census data. A data-integration solution may address this problem by considering these external resources as materialized views over a virtual mediated schema, resulting in "virtual data integration". This means application-developers construct a virtual schema—the mediated schema—to best model the kinds of answers their users want. Next, they design "wrappers" or adapters for each data source, such as the crime database and weather website. These adapters simply transform the local query results (those returned by the respective websites or databases) into an easily processed form for the data integration solution (see figure 2). When an application-user queries the mediated schema, the data-integration solution transforms this query into appropriate queries over the respective data sources. Finally, the virtual database combines the results of these queries into the answer to the user's query. This solution offers the convenience of adding new sources by simply constructing an adapter or an application software blade for them. It contrasts with ETL systems or with a si

Lost Art-Database

The Lost Art-Datenbank is an online database published by the German Lost Art Foundation (Deutsches Zentrum Kulturgutverluste. It contains information on cultural objects looted from Jewish collectors or transferred due to Nazi persecution during the Nazi era. Until 2015, it was managed by the Koordinierungsstelle für Kulturgutverluste (Magdeburg Coordination Office). == Creation == Following the Washington Conference of 1998, and the commitments to provide more transparency regarding looted art, Germany launched the Lost Art Database in 2000 order to help Holocaust victims and their families track down artworks that had been looted from them or lost due to Nazi persecution. == Functionality == The Lost Art Database lists art and books and other cultural objects that were lost, seized, stolen or forceably sold during the Nazi era. The database is divided into search requests from victims' families, heirs or institutions and "found" reports from cultural institutions on items with unresolved provenance gaps from the Nazi periods. The section on reports of finds lists objects that are known to have been unlawfully seized or relocated as a result of the war. In addition, reports are published here on cultural objects for which an uncertain or incomplete provenance may indicate a possible unlawful seizure or war-related relocation. The publication of reports in the Lost Art Internet Database is carried out on behalf of and with the consent of the reporting persons and institutions. The responsibility for the content of the reports lies with these legal or natural persons. There have been controversies over which items should be included in the database. Lost Art is based on the Washington Principles adopted in 1998, which Germany has committed itself to implementing (Joint Declaration, 1999). The Lost Art Database is considered a key resource in the search for looted art and the victims of persecution. Every item in the Lost Art Database has an identifier, known as a Lost Art ID. Proveana is the linked research database. == Other lost art databases == Other countries have launched databases to help identify Nazi looted art. Each database has its own area of focus. The German Lost Art Database allows families or heirs to submit information. Other countries have databases that focus on looted artworks that have not been found or artworks that were repatriated to the national authorities after the defeat of the Nazis but were never returned to their original owners. Other databases have been created for stolen antiquities, looted art from colonial era, art stolen from Syria, Iraq, Ukraine, or from museums or collectors.

General time- and transfer constant analysis

The general time- and transfer-constants (TTC) analysis is the generalized version of the Cochran-Grabel (CG) method, which itself is the generalized version of zero-value time-constants (ZVT), which in turn is the generalization of the open-circuit time constant method (OCT). While the other methods mentioned provide varying terms of only the denominator of an arbitrary transfer function, TTC can be used to determine every term both in the numerator and the denominator. Its denominator terms are the same as that of Cochran-Grabel method, when stated in terms of time constants (when expressed in Rosenstark notation). however, the numerator terms are determined using a combination of transfer constants and time constants, where the time constants are the same as those in CG method. Transfer constants are low-frequency ratios of the output variable to input variable under different open- and short-circuited active elements. In general, a transfer function (which can characterize gain, admittance, impedance, trans-impedance, etc., based on the choice of the input and output variables) can be written as: H ( s ) = a 0 + a 1 s + a 2 s 2 + … + a m s m 1 + b 1 s + b 2 s 2 + … + b n s n {\displaystyle H(s)={\frac {a_{0}+a_{1}s+a_{2}s^{2}+\ldots +a_{m}s^{m}}{1+b_{1}s+b_{2}s^{2}+\ldots +b_{n}s^{n}}}} == The denominator terms == The first denominator term b 1 {\textstyle b_{1}} can be expressed as the sum of zero value time constants (ZVTs): b 1 = ∑ i = 1 N τ i 0 {\displaystyle b_{1}=\sum _{i=1}^{N}\tau _{i}^{0}} where τ i 0 {\textstyle \tau _{i}^{0}} is the time constant associated with the reactive element i {\textstyle i} when all the other sources are zero-valued (hence the superscript '0'). Setting a capacitor value to zero corresponds to an open circuit, while a zero-valued inductor is a short circuit. So for calculation of the τ i 0 {\textstyle \tau _{i}^{0}} , all other capacitors are open-circuited and all other inductors are short-circuited. This is the essence of the ZVT method, which reduces to OCT when only capacitors are involved. All independent sources are also zero-valued during the time constant calculations (voltage sources short-circuited and current source open-circuited). In this case, if the element in question (element i {\textstyle i} ) is a capacitor, the time constant is given by τ i 0 = R i 0 C i {\displaystyle \tau _{i}^{0}=R_{i}^{0}C_{i}} and when element i {\textstyle i} is an inductor is it given by: τ i 0 = L i / R i 0 {\displaystyle \tau _{i}^{0}=L_{i}/R_{i}^{0}} . where in both cases, the resistance R i 0 {\textstyle R_{i}^{0}} , is the resistance seen by elements i {\textstyle i} (denoted by subscript), when all the other elements are zero-valued (denoted by the zero superscript). The second-order denominator term is equal to: b 2 = ∑ i = 1 N − 1 ∑ j = i + 1 N τ i 0 τ j i = ∑ i 1 ⩽ i ∑ j < j ⩽ N τ i 0 τ j i {\displaystyle b_{2}=\sum _{i=1}^{N-1}\sum _{j=i+1}^{N}\tau _{i}^{0}\tau _{j}^{i}=\sum _{i}^{1\leqslant i}\sum _{j}^{

Web worker

A web worker, as defined by the World Wide Web Consortium (W3C) and the Web Hypertext Application Technology Working Group (WHATWG), is a JavaScript script executed from an HTML page that runs in the background, independently of scripts that may also have been executed from the same HTML page. Web workers are often able to utilize multi-core CPUs more effectively. The W3C and WHATWG envision web workers as long-running scripts that are not interrupted by scripts that respond to clicks or other user interactions. Keeping such workers from being interrupted by user activities should allow Web pages to remain responsive at the same time as they are running long tasks in the background. The web worker specification is part of the HTML Living Standard. == Overview == As envisioned by WHATWG, web workers are relatively heavy-weight and are not intended to be used in large numbers. They are expected to be long-lived, with a high start-up performance cost, and a high per-instance memory cost. Web workers run outside the context of an HTML document's scripts. Consequently, while they do not have access to the DOM, they can facilitate concurrent execution of JavaScript programs. == Features == Web workers interact with the main document via message passing. The following code creates a Worker that will execute the JavaScript in the given file. To send a message to the worker, the postMessage method of the worker object is used as shown below. The onmessage property uses an event handler to retrieve information from a worker. Once a worker is terminated, it goes out of scope and the variable referencing it becomes undefined; at this point a new worker has to be created if needed. == Example == The simplest use of web workers is for performing a computationally expensive task without interrupting the user interface. In this example, the main document spawns a web worker to compute prime numbers, and progressively displays the most recently found prime number. The main page is as follows: The Worker() constructor call creates a web worker and returns a worker object representing that web worker, which is used to communicate with the web worker. That object's onmessage event handler allows the code to receive messages from the web worker. The Web Worker itself is as follows: To send a message back to the page, the postMessage() method is used to post a message when a prime is found. == Support == If the browser supports web workers, a Worker property will be available on the global window object. The Worker property will be undefined if the browser does not support it. The following example code checks for web worker support on a browser Web workers are currently supported by Chrome, Opera, Edge, Internet Explorer (version 10), Mozilla Firefox, and Safari. Mobile Safari for iOS has supported web workers since iOS 5. The Android browser first supported web workers in Android 2.1, but support was removed in Android versions 2.2–4.3 before being restored in Android 4.4.

Groundswell (book)

Groundswell is a book by Forrester Research executives Charlene Li and Josh Bernoff that focuses on how companies can take advantage of emerging social technologies. It was published in 2008 by Harvard Business Press. A revised edition was published in 2011. The book attempts to explain a shift in the relationship between customers and companies, in which companies are no longer able to control customers' attitudes through market research, customer service, and advertising. Instead, customers are controlling the conversation by using new media to communicate about products and companies. == Synopsis == The groundswell is characterized by several tactics that guide companies into using social technologies strategically and effectively. Listening: Businesses should listen to their customers to understand what the market is looking for in their products. In order to do this, a company needs to find out if their customers are using social technologies and how they are using them. Talking: Instead of advertising to customers, marketing departments should find creative ways to connect with users about their experience with a product and their feelings about the brand. One common method is participation in social networks. Energizing: Enthusiastic customers are part of the groundswell, and companies can recognize and appreciate these customers by creating online communities and social platforms where they can connect with the brand and provide reviews. Supporting: Businesses can harness the support of their own employees by creating internal social applications for them to connect with the brand, also known as enterprise social software. == Groundswell in action == === Examples === Some companies distinguish their product through the use of social technologies. Tom Dickson successfully marketed his Blendtec line of blenders through the viral marketing campaign Will It Blend? The groundswell spread marketing messages through Digg and YouTube with a small budget and little marketing experience. Other companies have been able to listen to and talk with the groundswell by building their own online communities. Procter & Gamble created beinggirl.com Archived 2016-04-10 at the Wayback Machine to introduce girls to P&G feminine care products. The community approach worked because the company could reach girls with information that might seem embarrassing or sensitive in a traditional marketing campaign. === Risks === Features of particular industries or companies can make direct customer engagement more difficult. For instance, some companies must work within industry regulations, national or multinational corporations must balance corporate and local engagement, and other companies must find ways to engage with customers on time-sensitive issues. == Reception == Kevin Allison of the Financial Times praised the book for its focus on Web analytics: "[Groundswell] is not so much a manifesto or a dissection of online culture as it is a how-to manual for executives and mid-level managers trying to navigate this fast-changing and often confusing environment." The book won the American Marketing Association Foundation’s Berry-AMA Book Prize for best marketing book of 2009. It was also listed by: Amazon, as one of the Top 10 Business & Investing Books of 2008 CIO Insight, as one of the Top 10 Business-Tech Books of 2008 and one of 10 Insightful Web 2.0 Books Fortune as Magazine as one of the 3 best Web books of 2008 Advertising Age as number 3 of 10 Books You Should Have Read BusinessWeek as one of the Best Innovation & Design Books of 2008 "strategy+business" as one of the Best Business Books 2008 and “Top Shelf” in Marketing

Representation collapse

Representation collapse is a phenomenon in machine learning and representation learning where a model maps different inputs to the same or very similar embeddings, which means it loses important information about how the data is spread out. It is frequently encountered in self-supervised learning, especially within contrastive and non-contrastive frameworks, when training objectives or model architectures do not maintain variance across representations. Collapse results in degenerate solutions characterized by uninformative learned features, significantly impairing downstream task performance. Various techniques have been proposed to mitigate representation collapse, including the use of negative samples, architectural asymmetry, stop-gradient operations, variance regularization, and redundancy reduction objectives, as seen in methods such as SimCLR, BYOL, and VICReg. Comprehending and averting representation collapse is regarded as a fundamental challenge in the advancement of stable and efficient self-supervised learning systems.

Data communication

Data communication is the transfer of data over a point-to-point or point-to-multipoint communication channel. Data communication comprises data transmission and data reception and can be classified as analog transmission and digital communications. Analog data communication conveys voice, data, image, signal or video information using a continuous signal, which varies in amplitude, phase, or some other property. In baseband analog transmission, messages are represented by a sequence of pulses by means of a line code; in passband analog transmission, they are communicated by a limited set of continuously varying waveforms, using a digital modulation method. Passband modulation and demodulation are carried out by modem equipment. Digital transmission and digital reception are the transfer of either a digitized analog signal or a born-digital bitstream. Baseband digital transmission is regarded as comprising part of a digital signal, whereas passband transmission of digital data may also or alternatively be considered a form of digital-to-analog conversion. Data communication channels include copper wires, optical fibers, wireless communication using radio spectrum, storage media and computer buses. The data are represented as an electromagnetic signal, such as an electrical voltage, radiowave, microwave, or infrared signal. == Distinction between related subjects == Digital transmission or data transmission traditionally belongs to telecommunications and electrical engineering. Basic principles of data transmission may also be covered within the computer science or computer engineering topic of data communications, which also includes computer networking applications and communication protocols, for example, routing, switching and inter-process communication. Although the Transmission Control Protocol (TCP) involves transmission, TCP and other transport layer protocols are covered in computer networking but not discussed in a textbook or course about data transmission. In most textbooks, the term analog transmission only refers to the transmission of an analog message signal (without digitization) by means of an analog signal, either as a non-modulated baseband signal or as a passband signal using an analog modulation method such as AM or FM. It may also include analog-over-analog pulse modulated baseband signals such as pulse-width modulation. In a few books within the computer networking tradition, analog transmission also refers to passband transmission of bit-streams using digital modulation methods such as FSK, PSK and ASK. The theoretical aspects of data transmission are covered by information theory and coding theory. == Protocol layers and sub-topics == Courses and textbooks in the field of data transmission typically deal with the following OSI model protocol layers and topics: Layer 1, the physical layer: Channel coding including Digital modulation schemes Line coding schemes Forward error correction (FEC) codes Bit synchronization Multiplexing Equalization Channel models Layer 2, the data link layer: Channel access schemes, media access control (MAC) Packet mode communication and Frame synchronization Error detection and automatic repeat request (ARQ) Flow control Layer 6, the presentation layer: Source coding (digitization and data compression), and information theory. Cryptography (may occur at any layer) It is also common to deal with the cross-layer design of those three layers. == Applications and history == Data (mainly but not exclusively informational) has been sent via non-electronic (e.g. optical, acoustic, mechanical) means since the advent of communication. Analog signal data has been sent electronically since the advent of the telephone. However, the first data electromagnetic transmission applications in modern time were electrical telegraphy (1809) and teletypewriters (1906), which are both digital signals. The fundamental theoretical work in data transmission and information theory by Harry Nyquist, Ralph Hartley, Claude Shannon and others during the early 20th century, was done with these applications in mind. In the early 1960s, Paul Baran invented distributed adaptive message block switching for digital communication of voice messages using switches that were low-cost electronics. Donald Davies invented and implemented modern data communication during 1965–7, including packet switching, high-speed routers, communication protocols, hierarchical computer networks and the essence of the end-to-end principle. Baran's work did not include routers with software switches and communication protocols, nor the idea that users, rather than the network itself, would provide the reliability. Both were seminal contributions that influenced the development of computer networks. Data transmission is utilized in computers in computer buses and for communication with peripheral equipment via parallel ports and serial ports such as RS-232 (1969), FireWire (1995) and USB (1996). The principles of data transmission are also utilized in storage media for error detection and correction since 1951. The first practical method to overcome the problem of receiving data accurately by the receiver using digital code was the Barker code invented by Ronald Hugh Barker in 1952 and published in 1953. Data transmission is utilized in computer networking equipment such as modems (1940), local area network (LAN) adapters (1964), repeaters, repeater hubs, microwave links, wireless network access points (1997), etc. In telephone networks, digital communication is utilized for transferring many phone calls over the same copper cable or fiber cable by means of pulse-code modulation (PCM) in combination with time-division multiplexing (TDM) (1962). Telephone exchanges have become digital and software controlled, facilitating many value-added services. For example, the first AXE telephone exchange was presented in 1976. Digital communication to the end user using Integrated Services Digital Network (ISDN) services became available in the late 1980s. Since the end of the 1990s, broadband access techniques such as ADSL, Cable modems, fiber-to-the-building (FTTB) and fiber-to-the-home (FTTH) have become widespread to small offices and homes. The current tendency is to replace traditional telecommunication services with packet mode communication such as IP telephony and IPTV. Transmitting analog signals digitally allows for greater signal processing capability. The ability to process a communications signal means that errors caused by random processes can be detected and corrected. Digital signals can also be sampled instead of continuously monitored. The multiplexing of multiple digital signals is much simpler compared to the multiplexing of analog signals. Because of all these advantages, because of the vast demand to transmit computer data and the ability of digital communications to do so and because recent advances in wideband communication channels and solid-state electronics have allowed engineers to realize these advantages fully, digital communications have grown quickly. The digital revolution has also resulted in many digital telecommunication applications where the principles of data transmission are applied. Examples include second-generation (1991) and later cellular telephony, video conferencing, digital TV (1998), digital radio (1999), and telemetry. Data transmission, digital transmission or digital communications is the transfer of data over a point-to-point or point-to-multipoint communication channel. Examples of such channels include copper wires, optical fibers, wireless communication channels, storage media and computer buses. The data are represented as an electromagnetic signal, such as an electrical voltage, radio wave, microwave, or infrared light. While analog transmission is the transfer of a continuously varying analog signal over an analog channel, digital communication is the transfer of discrete messages over a digital or an analog channel. The messages are either represented by a sequence of pulses by means of a line code (baseband transmission) or by a limited set of continuously varying waveforms (passband transmission), using a digital modulation method. The passband modulation and corresponding demodulation (also known as detection) are carried out by modem equipment. According to the most common definition of a digital signal, both baseband and passband signals representing bit-streams are considered as digital transmission, while an alternative definition only considers the baseband signal as digital, and passband transmission of digital data as a form of digital-to-analog conversion. Data transmitted may be digital messages originating from a data source, for example, a computer or a keyboard. It may also be an analog signal, such as a phone call or a video signal, digitized into a bit-stream, for example,e using pulse-code modulation (PCM) or more advanced source coding (analog-to-digital conversion and