AI Detector Check

AI Detector Check — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Semantic analysis (machine learning)

    Semantic analysis (machine learning)

    In machine learning, semantic analysis of a text corpus is the task of building structures that approximate concepts from a large set of documents. It generally does not involve prior semantic understanding of the documents. Semantic analysis strategies include: Metalanguages based on first-order logic, which can analyze the speech of humans. Understanding the semantics of a text is symbol grounding: if language is grounded, it is equal to recognizing a machine-readable meaning. For the restricted domain of spatial analysis, a computer-based language understanding system was demonstrated. Latent semantic analysis (LSA), a class of techniques where documents are represented as vectors in a term space. A prominent example is probabilistic latent semantic analysis (PLSA). Latent Dirichlet allocation, which involves attributing document terms to topics. n-grams and hidden Markov models, which work by representing the term stream as a Markov chain, in which each term is derived from preceding terms. == Stochastic semantic analysis ==

    Read more →
  • Data lake

    Data lake

    A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transformed data used for tasks such as reporting, visualization, advanced analytics, and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs), and binary data (images, audio, video). A data lake can be established on premises (within an organization's data centers) or in the cloud (using cloud services). == Background == James Dixon, then chief technology officer at Pentaho, coined the term by 2011 to contrast it with data mart, which is a smaller repository of interesting attributes derived from raw data. In promoting data lakes, he argued that data marts have several inherent problems, such as information siloing. PricewaterhouseCoopers (PwC) said that data lakes could "put an end to data silos". In their study on data lakes, they noted that enterprises were "starting to extract and place data for analytics into a single, Hadoop-based repository." == Examples == Many companies use cloud storage services such as Google Cloud Storage and Amazon S3 or a distributed file system such as Apache Hadoop distributed file system (HDFS). There is a gradual academic interest in the concept of data lakes. For example, Personal DataLake at Cardiff University is a new type of data lake which aims at managing big data of individual users by providing a single point of collecting, organizing, and sharing personal data. Early data lakes, such as Hadoop 1.0, had limited capabilities because it only supported batch-oriented processing (Map Reduce). Interacting with it required expertise in Java, map reduce and higher-level tools like Apache Pig, Apache Spark and Apache Hive (which were also originally batch-oriented). == Criticism == Poorly managed data lakes have been facetiously called data swamps. In June 2015, David Needle characterized "so-called data lakes" as "one of the more controversial ways to manage big data". PwC was also careful to note in their research that not all data lake initiatives are successful. They quote Sean Martin, CTO of Cambridge Semantics: We see customers creating big data graveyards, dumping everything into Hadoop distributed file system (HDFS) and hoping to do something with it down the road. But then they just lose track of what’s there. The main challenge is not creating a data lake, but taking advantage of the opportunities it presents. They describe companies that build successful data lakes as gradually maturing their lake as they figure out which data and metadata are important to the organization. Another criticism is that the term data lake is used with many different meanings. It may be used to refer to, for example: any tools or data management practices that are not data warehouses; a particular technology for implementation; a raw data reservoir; a hub for ETL offload; or a central hub for self-service analytics. While critiques of data lakes are warranted, in many cases they apply to other data projects as well. For example, the definition of data warehouse is also changeable, and not all data warehouse efforts have been successful. In response to various critiques, McKinsey noted that the data lake should be viewed as a service model for delivering business value within the enterprise, not a technology outcome. == Data lakehouses == Data lakehouses are a hybrid approach that can ingest a variety of raw data formats like a data lake, while also providing ACID transactions and enforced data quality like a data warehouse.

    Read more →
  • Cloud Data Management Interface

    Cloud Data Management Interface

    ISO/IEC 17826 Information technology — Cloud Data Management Interface (CDMI) Version 2.0.0 is an international standard that specifies a protocol for self-provisioning, administering and managing access to data stored in cloud storage, object storage, storage area network and network attached storage systems. The CDMI standard is developed and maintained by the Storage Networking Industry Association, who makes a publicly accessible version of the specification available. CDMI defines new resource representations to enable standardized management of any URI-accessible data, and defines RESTful HTTP operations using these representations to discover the capabilities of the storage system, discover stored data, access and update management metadata, specify data storage protocols (such as iSCSI and NFS) through which the stored data is accessed, and provide cross-system and cross-cloud import and export in order to enable data portability. Management functions enabled by CDMI include managing data ownership, identity mapping, access controls, user-specified metadata, and to declaratively specify desired data protection, data retention, constraints on geographic placement, desired quality of service, data versioning and security requirements. CDMI also defines utility services to facilitate data management, such the ability to query data matching specific criteria, and includes extensions to perform bulk updates using CDMI Jobs. == Capabilities == Compliant implementations must provide access to a set of configuration parameters known as capabilities. These are either boolean values that represent whether or not a system supports things such as queues, export via other protocols, path-based storage and so on, or numeric values expressing system limits, such as how much metadata may be placed on an object. As a minimal compliant implementation can be quite small, with few features, clients need to check the cloud storage system for a capability before attempting to use the functionality it represents. Resource allocation assignments limited to the data management interface protocols must possess access bypass capabilities which extend beyond the layered framework. This integral function is vital to the prevention of transport layer session hijacking by unauthorized entities which may circumvent standard interfacing security parameters. == Containers == A CDMI client may access objects, including containers, by either name or object id (OID), assuming the CDMI server supports both methods. When storing objects by name, it is natural to use nested named containers; the resulting structure corresponds exactly to a traditional filesystem directory structure. == Objects == Objects are similar to files in a traditional file system, but are enhanced with an increased amount and capacity for metadata. As with containers, they may be accessed by either name or OID. When accessed by name, clients use URLs that contain the full pathname of objects to create, read, update and delete them. When accessed by OID, the URL specifies an OID string in the cdmi-objectid container; this container presents a flat name space conformant with standard object storage system semantics. Subject to system limits, objects may be of any size or type and have arbitrary user-supplied metadata attached to them. Systems that support query allow arbitrary queries to be run against the metadata. == Domains, Users and Groups == CDMI supports the concept of a domain, similar in concept to a domain in the Windows Active Directory model. Users and groups created in a domain share a common administrative database and are known to each other on a "first name" basis, i.e. without reference to any other domain or system. Domains also function as containers for usage and billing summary data. == Access Control == CDMI exactly follows the ACL and ACE model used for file authorization operations by NFSv4. This makes it also compatible with Microsoft Windows systems. == Metadata == CDMI draws much of its metadata model from the XAM specification. Objects and containers have "storage system metadata", "data system metadata" and arbitrary user specified metadata, in addition to the metadata maintained by an ordinary filesystem (atime etc.). == Queries == CDMI specifies a way for systems to support arbitrary queries against CDMI containers, with a rich set of comparison operators, including support for regular expressions. == Queues == CDMI supports the concept of persistent FIFO (first-in, first-out) queues. These are useful for job scheduling, order processing and other tasks in which lists of things must be processed in order. == Compliance == Both retention intervals and retention holds are supported by CDMI. A retention interval consists of a start time and a retention period. During this time interval, objects are preserved as immutable and may not be deleted. A retention hold is usually placed on an object because of judicial action and has the same effect: objects may not be changed nor deleted until all holds placed on them are removed. == Billing == Summary information suitable for billing clients for on-demand services can be obtained by authorized users from systems that support it. == Serialization == Serialization of objects and containers allows export of all data and metadata on a system and importation of that data into another cloud system. == Foreign protocols == CDMI supports export of containers as NFS or CIFS shares. Clients that mount these shares see the container hierarchy as an ordinary filesystem directory hierarchy, and the objects in the containers as normal files. Metadata outside of ordinary filesystem metadata may or may not be exposed. Provisioning of iSCSI LUNs is also supported. == Client SDKs == CDMI Reference Implementation Droplet libcdmi-java libcdmi-python .NET SDK

    Read more →
  • Squeaky Dolphin

    Squeaky Dolphin

    Squeaky Dolphin is a program developed by the Government Communications Headquarters (GCHQ), a British intelligence and security organization, to collect and analyze data from social media networks. The program was first revealed to the general public on NBC on 27 January 2014 based on documents previously leaked by Edward Snowden. == Scope of surveillance == According to a document of the GCHQ dated August 2012, the program enables broad, real-time surveillance of the following items: YouTube video views The Like button on Facebook. Facebook has since then encrypted the data. Blogspot/Blogger visits Twitter, which has however encrypted its communications since this presentation was made The program can be supplemented with commercially available analytic software to determine which videos are popular among residents of specific cities. The dashboard software chosen was made by Splunk. The presentation, which was originally shown to an NSA audience and was made public by the NBC, contains a note saying the program was "Not interested in individuals just broad trends!". However, "according to other Snowden documents" obtained by NBC, in 2010, "GCHQ exploited unencrypted data from Twitter to identify specific users around the world and target them with propaganda."

    Read more →
  • Digital supply chain security

    Digital supply chain security

    Digital supply chain security refers to efforts to enhance cyber security within the supply chain. It is a subset of supply chain security and is focused on the management of cyber security requirements for information technology systems, software and networks, which are driven by threats such as cyber-terrorism, malware, data theft and the advanced persistent threat (APT). Typical supply chain cyber security activities for minimizing risks include buying only from trusted vendors, disconnecting critical machines from outside networks, and educating users on the threats and protective measures they can take. The acting deputy undersecretary for the National Protection and Programs Directorate for the United States Department of Homeland Security, Greg Schaffer, stated at a hearing that he is aware that there are instances where malware has been found on imported electronic and computer devices sold within the United States. == Examples of supply chain cyber security threats == Network or computer hardware that is delivered with malware installed on it already. Malware that is inserted into software or hardware (by various means) Vulnerabilities in software applications and networks within the supply chain that are discovered by malicious hackers Counterfeit computer hardware == Related U.S. government efforts == Comprehensive National Cyber Initiative Defense Procurement Regulations: Noted in section 806 of the National Defense Authorization Act International Strategy for Cyberspace: White House lays out for the first time the U.S.’s vision for a secure and open Internet. The strategy outlines three main themes: diplomacy, development and defense. Diplomacy: The strategy sets out to “promote an open, interoperable, secure and reliable information and communication infrastructure” by establishing norms of acceptable state behavior built through consensus among nations. Development: Through this strategy the government seeks to “facilitate cybersecurity capacity-building abroad, bilaterally and through multilateral organizations.” The objective is to protect the global IT infrastructure and to build closer international partnerships to sustain open and secure networks. Defense: The strategy calls out that the government “will ensure that the risks associated with attacking or exploiting our networks vastly outweigh the potential benefits” and calls for all nations to investigate, apprehend and prosecute criminals and non-state actors who intrude and disrupt network systems. == Related government efforts around the world == Common Criteria offers with Evaluation Assurance Level(EAL) 4 an opportunity to evaluate all relevant aspects of the digital supply chain security like the product, the development environment, IT systems security, the processes in human resource, physical security and with the module ALC_FLR.3 (Systematic Flaw Remediation) also security update processes and methods even by physical site visits. EAL 4 is mutually recognized in countries that signed the SOGIS-MRA and up to ELA 2 in countries the signed the CCRA but including ALC_FRL.3. Russia: Russia has had non-disclosed functionality certification requirements for several years and has recently initiated the National Software Platform effort based on open-source software. This reflects the apparent desire for national autonomy, reducing dependence on foreign suppliers. India: Recognition of supply chain risk in its draft National Cybersecurity Strategy. Rather than targeting specific products for exclusion, it is considering Indigenous Innovation policies, giving preferences to domestic ITC suppliers in order to create a robust, globally competitive national presence in the sector. China: Deriving from goals in the 11th Five Year Plan (2006–2010), China introduced and pursued a mix of security-focused and aggressive Indigenous Innovation policies. China is requiring an indigenous innovation product catalog be used for its government procurement and implementing a Multi-level Protection Scheme (MLPS) which requires (among other things) product developers and manufacturers to be Chinese citizens or legal persons, and product core technology and key components must have independent Chinese or indigenous intellectual property rights. == Private sector efforts == SLSA (Supply-chain Levels for Software Artifacts) is an end-to-end framework for ensuring the integrity of software artifacts throughout the software supply chain. The requirements are inspired by Google’s internal "Binary Authorization for Borg" that has been in use for the past 8+ years and that is mandatory for all of Google's production workloads. The goal of SLSA is to improve the state of the industry, particularly open source, to defend against the most pressing integrity threats. With SLSA, consumers can make informed choices about the security posture of the software they consume. == Other references == Financial Sector Information Sharing and Analysis Center International Strategy for Cyberspace (from the White House) NSTIC SafeCode Whitepaper Archived 2013-10-21 at the Wayback Machine Trusted Technology Forum and the Open Trusted Technology Provider Standard (O-TTPS) Archived 2012-01-03 at the Wayback Machine Cyber Supply Chain Security Solution Malware Implants in Firmware Supply Chain in the Software Era INFORMATION AND COMMUNICATIONS TECHNOLOGY SUPPLY CHAIN RISK MANAGEMENT TASK FORCE: INTERIM REPORT

    Read more →
  • Data validation and reconciliation

    Data validation and reconciliation

    Industrial process data validation and reconciliation, or more briefly, process data reconciliation (PDR), is a technology that uses process information and mathematical methods in order to automatically ensure data validation and reconciliation by correcting measurements in industrial processes. The use of PDR allows for extracting accurate and reliable information about the state of industry processes from raw measurement data and produces a single consistent set of data representing the most likely process operation. == Models, data and measurement errors == Industrial processes, for example chemical or thermodynamic processes in chemical plants, refineries, oil or gas production sites, or power plants, are often represented by two fundamental means: Models that express the general structure of the processes, Data that reflects the state of the processes at a given point in time. Models can have different levels of detail, for example one can incorporate simple mass or compound conservation balances, or more advanced thermodynamic models including energy conservation laws. Mathematically the model can be expressed by a nonlinear system of equations F ( y ) = 0 {\displaystyle F(y)=0\,} in the variables y = ( y 1 , … , y n ) {\displaystyle y=(y_{1},\ldots ,y_{n})} , which incorporates all the above-mentioned system constraints (for example the mass or heat balances around a unit). A variable could be the temperature or the pressure at a certain place in the plant. === Error types === Data originates typically from measurements taken at different places throughout the industrial site, for example temperature, pressure, volumetric flow rate measurements etc. To understand the basic principles of PDR, it is important to first recognize that plant measurements are never 100% correct, i.e. raw measurement y {\displaystyle y\,} is not a solution of the nonlinear system F ( y ) = 0 {\displaystyle F(y)=0\,\!} . When using measurements without correction to generate plant balances, it is common to have incoherencies. Measurement errors can be categorized into two basic types: random errors due to intrinsic sensor accuracy and systematic errors (or gross errors) due to sensor calibration or faulty data transmission. Random errors means that the measurement y {\displaystyle y\,\!} is a random variable with mean y ∗ {\displaystyle y^{}\,\!} , where y ∗ {\displaystyle y^{}\,\!} is the true value that is typically not known. A systematic error on the other hand is characterized by a measurement y {\displaystyle y\,\!} which is a random variable with mean y ¯ {\displaystyle {\bar {y}}\,\!} , which is not equal to the true value y ∗ {\displaystyle y^{}\,} . For ease in deriving and implementing an optimal estimation solution, and based on arguments that errors are the sum of many factors (so that the Central limit theorem has some effect), data reconciliation assumes these errors are normally distributed. Other sources of errors when calculating plant balances include process faults such as leaks, unmodeled heat losses, incorrect physical properties or other physical parameters used in equations, and incorrect structure such as unmodeled bypass lines. Other errors include unmodeled plant dynamics such as holdup changes, and other instabilities in plant operations that violate steady state (algebraic) models. Additional dynamic errors arise when measurements and samples are not taken at the same time, especially lab analyses. The normal practice of using time averages for the data input partly reduces the dynamic problems. However, that does not completely resolve timing inconsistencies for infrequently-sampled data like lab analyses. This use of average values, like a moving average, acts as a low-pass filter, so high frequency noise is mostly eliminated. The result is that, in practice, data reconciliation is mainly making adjustments to correct systematic errors like biases. === Necessity of removing measurement errors === ISA-95 is the international standard for the integration of enterprise and control systems It asserts that: Data reconciliation is a serious issue for enterprise-control integration. The data have to be valid to be useful for the enterprise system. The data must often be determined from physical measurements that have associated error factors. This must usually be converted into exact values for the enterprise system. This conversion may require manual, or intelligent reconciliation of the converted values [...]. Systems must be set up to ensure that accurate data are sent to production and from production. Inadvertent operator or clerical errors may result in too much production, too little production, the wrong production, incorrect inventory, or missing inventory. == History == PDR has become more and more important due to industrial processes that are becoming more and more complex. PDR started in the early 1960s with applications aiming at closing material balances in production processes where raw measurements were available for all variables. At the same time the problem of gross error identification and elimination has been presented. In the late 1960s and 1970s unmeasured variables were taken into account in the data reconciliation process., PDR also became more mature by considering general nonlinear equation systems coming from thermodynamic models., , Quasi steady state dynamics for filtering and simultaneous parameter estimation over time were introduced in 1977 by Stanley and Mah. Dynamic PDR was formulated as a nonlinear optimization problem by Liebman et al. in 1992. == Data reconciliation == Data reconciliation is a technique that targets at correcting measurement errors that are due to measurement noise, i.e. random errors. From a statistical point of view the main assumption is that no systematic errors exist in the set of measurements, since they may bias the reconciliation results and reduce the robustness of the reconciliation. Given n {\displaystyle n} measurements y i {\displaystyle y_{i}} , data reconciliation can mathematically be expressed as an optimization problem of the following form: min x , y ∗ ∑ i = 1 n ( y i ∗ − y i σ i ) 2 subject to F ( x , y ∗ ) = 0 y min ≤ y ∗ ≤ y max x min ≤ x ≤ x max , {\displaystyle {\begin{aligned}\min _{x,y^{}}&\sum _{i=1}^{n}\left({\frac {y_{i}^{}-y_{i}}{\sigma _{i}}}\right)^{2}\\{\text{subject to }}&F(x,y^{})=0\\&y_{\min }\leq y^{}\leq y_{\max }\\&x_{\min }\leq x\leq x_{\max },\end{aligned}}\,\!} where y i ∗ {\displaystyle y_{i}^{}\,\!} is the reconciled value of the i {\displaystyle i} -th measurement ( i = 1 , … , n {\displaystyle i=1,\ldots ,n\,\!} ), y i {\displaystyle y_{i}\,\!} is the measured value of the i {\displaystyle i} -th measurement ( i = 1 , … , n {\displaystyle i=1,\ldots ,n\,\!} ), x j {\displaystyle x_{j}\,\!} is the j {\displaystyle j} -th unmeasured variable ( j = 1 , … , m {\displaystyle j=1,\ldots ,m\,\!} ), and σ i {\displaystyle \sigma _{i}\,\!} is the standard deviation of the i {\displaystyle i} -th measurement ( i = 1 , … , n {\displaystyle i=1,\ldots ,n\,\!} ), F ( x , y ∗ ) = 0 {\displaystyle F(x,y^{})=0\,\!} are the p {\displaystyle p\,\!} process equality constraints and x min , x max , y min , y max {\displaystyle x_{\min },x_{\max },y_{\min },y_{\max }\,\!} are the bounds on the measured and unmeasured variables. The term ( y i ∗ − y i σ i ) 2 {\displaystyle \left({\frac {y_{i}^{}-y_{i}}{\sigma _{i}}}\right)^{2}\,\!} is called the penalty of measurement i. The objective function is the sum of the penalties, which will be denoted in the following by f ( y ∗ ) = ∑ i = 1 n ( y i ∗ − y i σ i ) 2 {\displaystyle f(y^{})=\sum _{i=1}^{n}\left({\frac {y_{i}^{}-y_{i}}{\sigma _{i}}}\right)^{2}} . In other words, one wants to minimize the overall correction (measured in the least squares term) that is needed in order to satisfy the system constraints. Additionally, each least squares term is weighted by the standard deviation of the corresponding measurement. The standard deviation is related to the accuracy of the measurement. For example, at a 95% confidence level, the standard deviation is about half the accuracy. === Redundancy === Data reconciliation relies strongly on the concept of redundancy to correct the measurements as little as possible in order to satisfy the process constraints. Here, redundancy is defined differently from redundancy in information theory. Instead, redundancy arises from combining sensor data with the model (algebraic constraints), sometimes more specifically called "spatial redundancy", "analytical redundancy", or "topological redundancy". Redundancy can be due to sensor redundancy, where sensors are duplicated in order to have more than one measurement of the same quantity. Redundancy also arises when a single variable can be estimated in several independent ways from separate sets of measurements at a given time or time averaging period, using the algebraic constraints. Redundancy is linked to the concept

    Read more →
  • Computer network

    Computer network

    In computer science, computer engineering, and telecommunications, a network is a group of communicating computers and peripherals known as hosts, which communicate data to other hosts via communication protocols, as facilitated by networking hardware. Within a computer network, hosts are identified by network addresses, which allow networking hardware to locate and identify hosts. Hosts may also have hostnames, memorable labels for the host nodes, which can be mapped to a network address using a hosts file or a name server such as Domain Name Service. The physical medium that supports information exchange includes wired media like copper cables, optical fibers, and wireless radio-frequency media. The arrangement of hosts and hardware within a network architecture is known as the network topology. The first computer network was created in 1940 when George Stibitz connected a terminal at Dartmouth to his Complex Number Calculator at Bell Labs in New York. Today, almost all computers are connected to a computer network, such as the global Internet or embedded networks such as those found in many modern electronic devices. Many applications have only limited functionality unless they are connected to a network. Networks support applications and services, such as access to the World Wide Web, digital video and audio, application and storage servers, printers, and email and instant messaging applications. == History == === Early origins (1940 – 1960s) === In 1940, George Stibitz of Bell Labs connected a teletype at Dartmouth to a Bell Labs computer running his Complex Number Calculator to demonstrate the use of computers at long distance. This was the first real-time, remote use of a computing machine. In the late 1950s, a network of computers was built for the U.S. military Semi-Automatic Ground Environment (SAGE) radar system using the Bell 101 modem. It was the first commercial modem for computers, released by AT&T Corporation in 1958. The modem allowed digital data to be transmitted over regular unconditioned telephone lines at a speed of 110 bits per second (bit/s). In 1959, Christopher Strachey filed a patent application for time-sharing in the United Kingdom and John McCarthy initiated the first project to implement time-sharing of user programs at MIT. Strachey passed the concept on to J. C. R. Licklider at the inaugural UNESCO Information Processing Conference in Paris that year. McCarthy was instrumental in the creation of three of the earliest time-sharing systems (the Compatible Time-Sharing System in 1961, the BBN Time-Sharing System in 1962, and the Dartmouth Time-Sharing System in 1963). In 1959, Anatoly Kitov proposed to the Central Committee of the Communist Party of the Soviet Union a detailed plan for the re-organization of the control of the Soviet armed forces and of the Soviet economy on the basis of a network of computing centers. Kitov's proposal was rejected, as later was the 1962 OGAS economy management network project. During the 1960s, Paul Baran and Donald Davies independently invented the concept of packet switching for data communication between computers over a network. Baran's work addressed adaptive routing of message blocks across a distributed network, but did not include routers with software switches, nor the idea that users, rather than the network itself, would provide the reliability. Davies' hierarchical network design included high-speed routers, communication protocols and the essence of the end-to-end principle. The NPL network, a local area network at the National Physical Laboratory (United Kingdom), pioneered the implementation of the concept in 1968-69 using 768 kbit/s links. Both Baran's and Davies' inventions were seminal contributions that influenced the development of computer networks. === ARPANET (1969 – 1974) === In 1962 and 1963, J. C. R. Licklider sent a series of memos to office colleagues discussing the concept of the "Intergalactic Computer Network", a computer network intended to allow general communications among computer users. This ultimately became the basis for the ARPANET, which began in 1969. That year, the first four nodes of the ARPANET were connected using 50 kbit/s circuits between the University of California at Los Angeles, the Stanford Research Institute, the University of California, Santa Barbara, and the University of Utah. Designed principally by Bob Kahn, the network's routing, flow control, software design and network control were developed by the IMP team working for Bolt Beranek & Newman. In the early 1970s, Leonard Kleinrock carried out mathematical work to model the performance of packet-switched networks, which underpinned the development of the ARPANET. His theoretical work on hierarchical routing in the late 1970s with student Farouk Kamoun remains critical to the operation of the Internet today. In 1973, Peter Kirstein put internetworking into practice at University College London (UCL), connecting the ARPANET to British academic networks, the first international heterogeneous computer network. That same year, Robert Metcalfe wrote a formal memo at Xerox PARC describing Ethernet, a local area networking system he created with David Boggs. It was inspired by the packet radio ALOHAnet, started by Norman Abramson and Franklin Kuo at the University of Hawaii in the late 1960s. Metcalfe and Boggs, with John Shoch and Edward Taft, also developed the PARC Universal Packet for internetworking. That year, the French CYCLADES network, directed by Louis Pouzin was the first to make the hosts responsible for the reliable delivery of data, rather than this being a centralized service of the network itself. === The internet (1974 – present) === In 1974, Vint Cerf and Bob Kahn published their seminal 1974 paper on internetworking, A Protocol for Packet Network Intercommunication. Later that year, Cerf, Yogen Dalal, and Carl Sunshine wrote the first Transmission Control Protocol (TCP) specification, RFC 675, coining the term Internet as a shorthand for internetworking. In July 1976, Metcalfe and Boggs published their paper "Ethernet: Distributed Packet Switching for Local Computer Networks" and in December 1977, together with Butler Lampson and Charles P. Thacker, they received U.S. patent 4063220A for their invention. In 1976, John Murphy of Datapoint Corporation created ARCNET, a token-passing network first used to share storage devices. In 1979, Robert Metcalfe pursued making Ethernet an open standard. In 1980, Ethernet was upgraded from the original 2.94 Mbit/s protocol to the 10 Mbit/s protocol, which was developed by Ron Crane, Bob Garner, Roy Ogus, Hal Murray, Dave Redell and Yogen Dalal. In 1986, the National Science Foundation (NSF) launched the National Science Foundation Network (NSFNET) as a general-purpose research network connecting various NSF-funded sites to each other and to regional research and education networks. In 1995, the transmission speed capacity for Ethernet increased from 10 Mbit/s to 100 Mbit/s. By 1998, Ethernet supported transmission speeds of 1 Gbit/s. Subsequently, higher speeds of up to 800 Gbit/s were added (as of 2025). The scaling of Ethernet has been a contributing factor to its continued use. In the 1980s and 1990s, as embedded systems were becoming increasingly important in factories, cars, and airplanes, network protocols were developed to allow the embedded computers to communicate. In the late 1990s and 2000s, ubiquitous computing and an Internet of Things became popular. === Commercial usage === In 1960, the commercial airline reservation system semi-automatic business research environment (SABRE) went online with two connected mainframes. In 1965, Western Electric introduced the first widely used telephone switch that implemented computer control in the switching fabric. In 1972, commercial services were first deployed on experimental public data networks in Europe. Public data networks in Europe, North America and Japan began using X.25 in the late 1970s and interconnected with X.75. This underlying infrastructure was used for expanding TCP/IP networks in the 1980s. In 1977, the first long-distance fiber network was deployed by GTE in Long Beach, California. == Hardware == === Network links === The transmission media used to link devices to form a computer network include electrical cable, optical fiber, and free space. In the OSI model, the software to handle the media is defined at layers 1 and 2 — the physical layer and the data link layer. Common examples of networking technologies include: Ethernet is a widely adopted family of networking technologies that use copper and fiber media in local area networks (LAN). The media and protocol standards that enable communication between networked devices over Ethernet are defined by IEEE 802.3. Wireless LAN standards, which use radio waves. Some standards use infrared signals as a transmission medium. Power line communication uses a building's power cabling to transmit

    Read more →
  • Out-of-band control

    Out-of-band control

    Out-of-band control is a method used by network protocols for sending control information (commands, logins, or session signals) separately from the main data, improving reliability and preventing interference. File Transfer Protocol (FTP) employs an out-of-band approach, using one connection for control commands, like logging in or requesting files, and a separate connection for transferring the files themselves.

    Read more →
  • Algorithmic probability

    Algorithmic probability

    In algorithmic information theory, algorithmic probability, also known as Solomonoff probability, is a mathematical method of assigning a prior probability to a given observation. It was invented by Ray Solomonoff in the 1960s. It is used in inductive inference theory and analyses of algorithms. In his general theory of inductive inference, Solomonoff uses the method together with Bayes' rule to obtain probabilities of prediction for an algorithm's future outputs. In the mathematical formalism used, the observations have the form of finite binary strings viewed as outputs of Turing machines, and the universal prior is a probability distribution over the set of finite binary strings calculated from a probability distribution over programs (that is, inputs to a universal Turing machine). The prior is universal in the Turing-computability sense, i.e. no string has zero probability. It is not computable, but it can be approximated. Formally, the probability P {\displaystyle P} is not a probability and it is not computable. It is only "lower semi-computable" and a "semi-measure". By "semi-measure", it means that 0 ≤ ∑ x P ( x ) < 1 {\displaystyle 0\leq \sum _{x}P(x)<1} . That is, the "probability" does not actually sum up to one, unlike actual probabilities. This is because some inputs to the Turing machine causes it to never halt, which means the probability mass allocated to those inputs is lost. By "lower semi-computable", it means there is a Turing machine that, given an input string x {\displaystyle x} , can print out a sequence y 1 < y 2 < ⋯ {\displaystyle y_{1} Read more →

  • Social television

    Social television

    Social television is the union of television and social media. Millions of people now share their TV experience with other viewers on social media such as Twitter and Facebook using smartphones and tablets. TV networks and rights holders are increasingly sharing video clips on social platforms to monetise engagement and drive tune-in. The social TV market covers the technologies that support communication and social interaction around TV as well as companies that study television-related social behavior and measure social media activities tied to specific TV broadcasts – many of which have attracted significant investment from established media and technology companies. The market is also seeing numerous tie-ups between broadcasters and social networking players such as Twitter and Facebook. The market is expected to be worth $256bn by 2017. Social TV was named one of the 10 most important emerging technologies by the MIT Technology Review on Social TV in 2010. And in 2011, David Rowan, the editor of Wired magazine, named Social TV at number three of six in his peek into 2011 and what tech trends to expect to get traction. Ynon Kreiz, CEO of the Endemol Group told the audience at the Digital Life Design (DLD) conference in January 2011: "Everyone says that social television will be big. I think it's not going to be big—it's going to be huge". Much of the investment in the earlier years of social TV went into standalone social TV apps. The industry believed these apps would provide an appealing and complimentary consumer experience which could then be monetized with ads. These apps featured TV listings, check-ins, stickers and synchronised second-screen content but struggled to attract users away from Twitter and Facebook. Most of these companies have since gone out of business or been acquired amid a wave of consolidation and the market has instead focused on the activities of the social media channels themselves – such as Twitter Amplify, Facebook Suggested Videos and Snapchat Discover – and the technologies that support them. == Twitter == Twitter and Facebook are both helping users connect around media, which can provoke strong debate and engagement. Both social platforms want to be the 'digital watercooler' and host conversation around TV because the engagement and data about what media people consume can then be used to generate advertising revenue. As an open platform, conversation on Twitter is closely aligned with real-time events. In May 2013, it launched Twitter Amplify – an advertising product for media and consumer brands. With Amplify, Twitter runs video highlights from major live broadcasts, with advertisers' names and messages playing before the clip. By February 2014, all four major U.S. TV networks had signed up to the Amplify program, bringing a variety of premium TV content onto the social platform in the form of in-tweet real-time video clips. In June 2014, Twitter acquired its Twitter Amplify partner in the U.S. SnappyTV, a company that was helping broadcasters and rights holders to share video content both organically across social and via Twitter's Amplify program. Twitter continues to rely on Grabyo, which has also struck numerous deals with some of the largest broadcasters and rights holders in Europe and North America to share video content across Facebook and Twitter. == Facebook == Facebook made significant changes to its platform in 2014 including updates to its algorithm to enhance how it serves video in users' feeds. It also launched video autoplay to get users to watch the videos in their feeds. It rapidly surpassed Twitter and by the end of 2014 it was enjoying three billion video views a day on its platform and had announced a partnership with the NFL, one of Twitter's most active Twitter Amplify partners. In April 2015, at its F8 Developer Conference, it revealed it was working with Grabyo among other technology partners to bring video onto its platform. Then in July it announced it would be launching Facebook Suggested Videos, bringing related videos and ads to anyone that clicks on a video – a move that not only competed with Twitter's commercial video offering but also put it in direct competition with YouTube. == TV Time == TV Time is a television dedicated social network that allows users to keep track of the television series they watch, as well as films. It also allows them to express their reaction to the media they have seen with episode specific voting for favorite characters and emotional reaction to episodes, as well as commenting in episode restrictive pages. This way users are able to avoid spoilers while also finding a precise audience and community for each of their interactions, as opposed to bigger, non-television dedicated social medias such as Facebook and Twitter where the likelihood of unintentionally reading spoilers is much higher. TV Time offers an analytics service called "TVLytics" where the votes and reactions collected from users can be studied for research and television production purposes. == Advertising == According to Businessinsider.com, there are variety of applications for social TV, including support for TV ad sales, optimizing TV ad buys, making ad buys more efficient, as a complement to audience measurement, and eventually, audience forecasting and real-time optimization. Social TV data can ease access to focus groups and may create a positive feedback loop for generating ultra-sticky TV programming and multi-screen ad campaigns. == In numbers == Viewers share their TV experience on social media in real-time as events unfold: between 88-100m Facebook users login to the platform during the primetime hours of 8pm – 11pm in the US. The volume of social media engagement in TV is also rising – according to Nielsen SocialGuide, there was a 38% increase in tweets about TV in 2013 to 263m. For the 2014 Super Bowl, Twitter reported that a record 24.9 million tweets about the game were sent during the telecast, peaking at 381,605 tweets per minute. Facebook reported that 50 million people discussed the Super Bowl, generating 185 million interactions. The 2014 Oscars generated 5m tweets, viewed by an audience of 37m unique Twitter users and delivering 3.3bn impressions globally as conversation and key moments were shared virally across the platform. In 2014 the All England Lawn Tennis Club (AELTC), hosts of Wimbledon, used Grabyo to share video content across social. The videos were viewed 3.5 million times across Facebook and Twitter. In partnered with Grabyo again in 2015 and the videos generated over 48 million views across Facebook and Twitter. == Television shows with social integration == Here are some examples of how TV executives are integrating social elements with TV shows: C-SPAN streamed tweets from US Senators and Representatives during the quorum call The Voice had the judges of the program tweet during the show and the posts scrolls on the bottom of the screen. The use of Twitter also led to an increase in viewers. "Glee" Entertainment Weekly created a second screen viewing platform for the Glee season 3 premiere. == Related publications == Erika Jonietz. "Making TV Social, Virtually" MIT Technology Review. (January 11, 2010) AmigoTV (Alcatel-Lucent; Coppens et al.) – 2004 www.ist-ipmedianet.org/Alcatel_EuroiTV2004_AmigoTV_short_paper_S4-2.pdf Nextream (MIT Media Lab, Martin et al.) – 2010 Social Interactive Television: Immersive Shared Experiences and Perspectives (P. Cesar, D. Geerts, and K. Chorianopoulos (eds.)) – 2009 Social TV and the Emergence of Interactive TV – Multimedia Research Group – November 2010 Interactive Social TV on Service Oriented Environments: Challenges and Enablers (May 2011) == Systems == Boxee – acquired by Samsung GetGlue – acquired by i.TV Grabyo KIT digital Miso TV Tank Top TV WiO Xbox Live

    Read more →
  • SFINKS

    SFINKS

    Sfinks (Polish for "Sphynx") was also the initial name of the Janusz A. Zajdel Award In cryptography, SFINKS is a stream cypher algorithm developed by An Braeken, Joseph Lano, Nele Mentens, Bart Preneel, and Ingrid Verbauwhede. It includes a message authentication code. It has been submitted to the eSTREAM Project of the eCRYPT network. In 2005, Nicolas T. Courtois noted that, while the cipher is elegant and secure against some simple algebraic attacks, it is vulnerable to more elaborate known attacks.

    Read more →
  • Unknown key-share attack

    Unknown key-share attack

    As defined by Blake-Wilson & Menezes (1999), an unknown key-share (UKS) attack on an authenticated key agreement (AK) or authenticated key agreement with key confirmation (AKC) protocol is an attack whereby an entity A {\displaystyle A} ends up believing she shares a key with B {\displaystyle B} , and although this is in fact the case, B {\displaystyle B} mistakenly believes the key is instead shared with an entity E ≠ A {\displaystyle E\neq A} . In other words, in a UKS, an opponent, say Eve, coerces honest parties Alice and Bob into establishing a secret key where at least one of Alice and Bob does not know that the secret key is shared with the other. For example, Eve may coerce Bob into believing he shares the key with Eve, while he actually shares the key with Alice. The “key share” with Alice is thus unknown to Bob.

    Read more →
  • Information space analysis

    Information space analysis

    Within the field of information science, information space analysis is a deterministic method, enhanced by machine intelligence, for locating and assessing resources for team-centric efforts. Organizations need to be able to quickly assemble teams backed by the support services, information, and material to do the job. To do so, these teams need to find and assess sources of services that are potential participants in the team effort. To support this initial team and resource development, information needs to be developed via analysis tools that help make sense of sets of data sources in an Intranet or Internet. Part of the process is to characterize them, partition them, and sort and filter them. These tools focus on three key issues in forming a collaborative team: Help individuals responsible for forming the team understand what is available. Assist team members in identifying the structure and categorize the information available to them in a manner specifically suited to the task at hand. Aid team members to understand the mappings of their information between their organization and that used by others who might participate. Information space analysis tools combine multiple methods to assist in this task. This causes the tools to be particularly well-suited to integrating additional technologies in order to create specialized systems.

    Read more →
  • Signals intelligence

    Signals intelligence

    Signals intelligence (SIGINT) is the act and field of intelligence-gathering by interception of signals, whether communications between people (communications intelligence—abbreviated to COMINT) or from electronic signals not directly used in communication (electronic intelligence—abbreviated to ELINT). As classified and sensitive information is usually encrypted, signals intelligence may necessarily involve cryptanalysis (to decipher the messages). Traffic analysis—the study of who is signaling to whom and in what quantity—is also used to integrate information, and it may complement cryptanalysis. == History == === Origins === Electronic interceptions appeared as early as 1900, during the Boer War of 1899–1902. The British Royal Navy had installed wireless sets produced by Marconi on board their ships in the late 1890s, and the British Army used some limited wireless signalling. The Boers captured some wireless sets and used them to make vital transmissions. Since the British were the only people transmitting at the time, the British did not need special interpretation of the signals that they were. The birth of signals intelligence in a modern sense dates from the Russo-Japanese War of 1904–1905. As the Russian fleet prepared for conflict with Japan in 1904, the British ship HMS Diana stationed in the Suez Canal intercepted Russian naval wireless signals being sent out for the mobilization of the fleet, for the first time in history. === Development in World War I === Over the course of the First World War, a new method of signals intelligence reached maturity. Russia's failure to properly protect its communications fatally compromised the Russian Army's advance early in World War I and led to their disastrous defeat by the Germans under Ludendorff and Hindenburg at the Battle of Tannenberg. In 1918, French intercept personnel captured a message written in the new ADFGVX cipher, which was cryptanalyzed by Georges Painvin. This gave the Allies advance warning of the German 1918 Spring Offensive. The British in particular, built up great expertise in the newly emerging field of signals intelligence and codebreaking (synonymous with cryptanalysis). On the declaration of war, Britain cut all German undersea cables. This forced the Germans to communicate exclusively via either (A) a telegraph line that connected through the British network and thus could be tapped; or (B) through radio which the British could then intercept. Rear Admiral Henry Oliver appointed Sir Alfred Ewing to establish an interception and decryption service at the Admiralty; Room 40. An interception service known as 'Y' service, together with the post office and Marconi stations, grew rapidly to the point where the British could intercept almost all official German messages. The German fleet was in the habit each day of wirelessing the exact position of each ship and giving regular position reports when at sea. It was possible to build up a precise picture of the normal operation of the High Seas Fleet, to infer from the routes they chose where defensive minefields had been placed and where it was safe for ships to operate. Whenever a change to the normal pattern was seen, it immediately signalled that some operation was about to take place, and a warning could be given. Detailed information about submarine movements was also available. The use of radio-receiving equipment to pinpoint the location of any single transmitter was also developed during the war. Captain H.J. Round, working for Marconi, began carrying out experiments with direction-finding radio equipment for the army in France in 1915. By May 1915, the Admiralty was able to track German submarines crossing the North Sea. Some of these stations also acted as 'Y' stations to collect German messages, but a new section was created within Room 40 to plot the positions of ships from the directional reports. Room 40 played an important role in several naval engagements during the war, notably in detecting major German sorties into the North Sea. The battle of Dogger Bank was won in no small part due to the intercepts that allowed the Navy to position its ships in the right place. It played a vital role in subsequent naval clashes, including at the Battle of Jutland as the British fleet was sent out to intercept them. The direction-finding capability allowed for the tracking and location of German ships, submarines, and Zeppelins. The system was so successful that by the end of the war, over 80 million words, comprising the totality of German wireless transmission over the course of the war, had been intercepted by the operators of the Y-stations and decrypted. However, its most astonishing success was in decrypting the Zimmermann Telegram, a telegram from the German Foreign Office sent via Washington to its ambassador Heinrich von Eckardt in Mexico. === Postwar consolidation === With the importance of interception and decryption firmly established by the wartime experience, countries established permanent agencies dedicated to this task in the interwar period. In 1919, the British Cabinet's Secret Service Committee, chaired by Lord Curzon, recommended that a peace-time codebreaking agency should be created. The Government Code and Cypher School (GC&CS) was the first peace-time codebreaking agency, with a public function "to advise as to the security of codes and cyphers used by all Government departments and to assist in their provision", but also with a secret directive to "study the methods of cypher communications used by foreign powers". GC&CS officially formed on 1 November 1919, and produced its first decrypt on 19 October. By 1940, GC&CS was working on the diplomatic codes and ciphers of 26 countries, tackling over 150 diplomatic cryptosystems. The US Cipher Bureau was established in 1919 and achieved some success at the Washington Naval Conference in 1921, through cryptanalysis by Herbert Yardley. Secretary of War Henry L. Stimson closed the US Cipher Bureau in 1929 with the words "Gentlemen do not read each other's mail." === World War II === The use of SIGINT had even greater implications during World War II. The combined effort of intercepts and cryptanalysis for the whole of the British forces in World War II came under the code name "Ultra", managed from Government Code and Cypher School at Bletchley Park. Properly used, the German Enigma and Lorenz ciphers should have been virtually unbreakable, but flaws in German cryptographic procedures, and poor discipline among the personnel carrying them out, created vulnerabilities which made Bletchley's attacks feasible. Bletchley's work was essential to defeating the U-boats in the Battle of the Atlantic, and to the British naval victories in the Battle of Cape Matapan and the Battle of North Cape. In 1941, Ultra exerted a powerful effect on the North African desert campaign against German forces under General Erwin Rommel. General Sir Claude Auchinleck wrote that were it not for Ultra, "Rommel would have certainly got through to Cairo". Ultra decrypts featured prominently in the story of Operation SALAM, László Almásy's mission across the desert behind Allied lines in 1942. Prior to the Normandy landings on D-Day in June 1944, the Allies knew the locations of all but two of Germany's fifty-eight Western Front divisions. Winston Churchill was reported to have told King George VI: "It is thanks to the secret weapon of General Menzies, put into use on all the fronts, that we won the war!" Supreme Allied Commander, Dwight D. Eisenhower, at the end of the war, described Ultra as having been "decisive" to Allied victory. Official historian of British Intelligence in World War II Sir Harry Hinsley argued that Ultra shortened the war "by not less than two years and probably by four years"; and that, in the absence of Ultra, it is uncertain how the war would have ended. At a lower level, German cryptanalysis, direction finding, and traffic analysis were vital to Rommel's early successes in the Western Desert Campaign until British forces tightened their communications discipline and Australian raiders destroyed his principal SIGINT Company. == Technical definitions == The United States Department of Defense has defined the term "signals intelligence" as: A category of intelligence comprising either individually or in combination all communications intelligence (COMINT), electronic intelligence (ELINT), and foreign instrumentation signals intelligence (FISINT), however transmitted. Intelligence derived from communications, electronic, and foreign instrumentation signals. Being a broad field, SIGINT has many sub-disciplines. The two main ones are communications intelligence (COMINT) and electronic intelligence (ELINT). == Disciplines shared across the branches == === Targeting === A collection system has to know to look for a particular signal. "System", in this context, has several nuances. Targeting is the process of developing collection requirements: "1. A

    Read more →
  • Data lineage

    Data lineage

    Data lineage refers to the process of tracking how data is generated, transformed, transmitted and used across systems over time. It documents data's origins, transformations and movements, providing detailed visibility into its life cycle. This process simplifies the identification of errors in data analytics workflows, by enabling users to trace issues back to their root causes. Data lineage facilitates the ability to replay specific segments or inputs of the dataflow. This can be used in debugging or regenerating lost outputs. In database systems, this concept is closely related to data provenance, which involves maintaining records of inputs, entities, systems and processes that influence data. Data provenance provides a historical record of data origins and transformations. It supports forensic activities such as data-dependency analysis, error/compromise detection, recovery, auditing and compliance analysis: "Lineage is a simple type of why provenance." Data governance plays a critical role in managing metadata by establishing guidelines, strategies and policies. Enhancing data lineage with data quality measures and master data management adds business value. Although data lineage is typically represented through a graphical user interface (GUI), the methods for gathering and exposing metadata to this interface can vary. Based on the metadata collection approach, data lineage can be categorized into three types: Those involving software packages for structured data, programming languages and Big data systems. Data lineage information includes technical metadata about data transformations. Enriched data lineage may include additional elements such as data quality test results, reference data, data models, business terminology, data stewardship information, program management details and enterprise systems associated with data points and transformations. Data lineage visualization tools often include masking features that allow users to focus on information relevant to specific use cases. To unify representations across disparate systems, metadata normalization or standardization may be required. == Representation of data lineage == Representation broadly depends on the scope of the metadata management and reference point of interest. Data lineage provides sources of the data and intermediate data flow hops from the reference point with backward data lineage, leading to the final destination's data points and its intermediate data flows with forward data lineage. These views can be combined with end-to-end lineage for a reference point that provides a complete audit trail of that data point of interest from sources to their final destinations. As the data points or hops increase, the complexity of such representation becomes incomprehensible. Thus, the best feature of the data lineage view is the ability to simplify the view by temporarily masking unwanted peripheral data points. Tools with the masking feature enable scalability of the view and enhance analysis with the best user experience for both technical and business users. Data lineage also enables companies to trace sources of specific business data to track errors, implement changes in processes and implementing system migrations to save significant amounts of time and resources. Data lineage can improve efficiency in business intelligence BI processes. Data lineage can be represented visually to discover the data flow and movement from its source to destination via various changes and hops on its way in the enterprise environment. This includes how the data is transformed along the way, how the representation and parameters change and how the data splits or converges after each hop. A simple representation of the Data Lineage can be shown with dots and lines, where dots represent data containers for data points, and lines connecting them represent transformations the data undergoes between the data containers. Data lineage can be visualized at various levels based on the granularity of the view. At a very high-level, data lineage is visualized as systems that the data interacts with before it reaches its destination. At its most granular, visualizations at the data point level can provide the details of the data point and its historical behavior, attribute properties and trends and data quality of the data passed through that specific data point in the data lineage. The scope of the data lineage determines the volume of metadata required to represent its data lineage. Usually, data governance and data management of an organization determine the scope of the data lineage based on their regulations, enterprise data management strategy, data impact, reporting attributes and critical data elements of the organization. == Rationale == Distributed systems like Google Map Reduce, Microsoft Dryad, Apache Hadoop (an open-source project) and Google Pregel provide such platforms for businesses and users. However, even with these systems, Big Data analytics can take several hours, days or weeks to run, simply due to the data volumes involved. For example, a ratings prediction algorithm for the Netflix Prize challenge took nearly 20 hours to execute on 50 cores, and a large-scale image processing task to estimate geographic information took 3 days to complete using 400 cores. "The Large Synoptic Survey Telescope is expected to generate terabytes of data every night and eventually store more than 50 petabytes, while in the bioinformatics sector, the 12 largest genome sequencing houses in the world now store petabytes of data apiece. It is very difficult for a data scientist to trace an unknown or an unanticipated result. === Big data debugging === Big data analytics is the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. Machine learning, among other algorithms, is used to transform and analyze the data. Due to the large size of the data, there could be unknown features in the data. The massive scale and unstructured nature of data, the complexity of these analytics pipelines, and long runtimes pose significant manageability and debugging challenges. Even a single error in these analytics can be extremely difficult to identify and remove. While one may debug them by re-running the entire analytics through a debugger for stepwise debugging, this can be expensive due to the amount of time and resources needed. Auditing and data validation are other major problems due to the growing ease of access to relevant data sources for use in experiments, the sharing of data between scientific communities and use of third-party data in business enterprises. As such, more cost-efficient ways of analyzing data intensive scale-able computing (DISC) are crucial to their continued effective use. === Challenges in Big Data debugging === ==== Massive scale ==== According to an EMC/IDC study, 2.8 ZB of data were created and replicated in 2012. Furthermore, the same study states that the digital universe will double every two years between now and 2020, and that there will be approximately 5.2 TB of data for every person in 2020. Based on current technology, the storage of this much data will mean greater energy usage by data centers. ==== Unstructured data ==== Unstructured data usually refers to information that doesn't reside in a traditional row-column database. Unstructured data files often include text and multimedia content, such as e-mail messages, word processing documents, videos, photos, audio files, presentations, web pages and many other kinds of business documents. While these types of files may have an internal structure, they are still considered "unstructured" because the data they contain doesn't fit neatly into a database. The amount of unstructured data in enterprises is growing many times faster than structured databases are growing. Big data can include both structured and unstructured data, but IDC estimates that 90 percent of Big Data is unstructured data. The fundamental challenge of unstructured data sources is that they are difficult for non-technical business users and data analysts alike to unbox, understand and prepare for analytic use. Beyond issues of structure, the sheer volume of this type of data contributes to such difficulty. Because of this, current data mining techniques often leave out valuable information and make analyzing unstructured data laborious and expensive. In today's competitive business environment, companies have to find and analyze the relevant data they need quickly. The challenge is going through the volumes of data and accessing the level of detail needed, all at a high speed. The challenge only grows as the degree of granularity increases. One possible solution is hardware. Some vendors are using increased memory and parallel processing to crunch large volumes of data quickly. Another method is putting data in-memory but using a grid

    Read more →