AI Assistant Vs AI Agent

AI Assistant Vs AI Agent — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Learning to rank

    Learning to rank

    Learning to rank (LTR) or machine-learned ranking (MLR) is the application of machine learning, often supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval and recommender systems. Training data may, for example, consist of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment (e.g. "relevant" or "not relevant") for each item. The goal of constructing the ranking model is to rank new, unseen lists in a similar way to rankings in the training data. == Applications == === In information retrieval === Ranking is a central part of many information retrieval problems, such as document retrieval, collaborative filtering, sentiment analysis, and online advertising. A possible architecture of a machine-learned search engine is shown in the accompanying figure. Training data consists of queries and documents matching them together with the relevance degree of each match. It may be prepared manually by human assessors (or raters, as Google calls them), who check results for some queries and determine relevance of each result. It is not feasible to check the relevance of all documents, and so typically a technique called pooling is used — only the top few documents, retrieved by some existing ranking models are checked. This technique may introduce selection bias. Alternatively, training data may be derived automatically by analyzing clickthrough logs (i.e. search results which got clicks from users), query chains, or such search engines' features as Google's (since-replaced) SearchWiki. Clickthrough logs can be biased by the tendency of users to click on the top search results on the assumption that they are already well-ranked. Training data is used by a learning algorithm to produce a ranking model which computes the relevance of documents for actual queries. Typically, users expect a search query to complete in a short time (such as a few hundred milliseconds for web search), which makes it impossible to evaluate a complex ranking model on each document in the corpus, and so a two-phase scheme is used. First, a small number of potentially relevant documents are identified using simpler retrieval models which permit fast query evaluation, such as the vector space model, Boolean model, weighted AND, or BM25. This phase is called top- k {\displaystyle k} document retrieval and many heuristics were proposed in the literature to accelerate it, such as using a document's static quality score and tiered indexes. In the second phase, a more accurate but computationally expensive machine-learned model is used to re-rank these documents. === In other areas === Learning to rank algorithms have been applied in areas other than information retrieval: In machine translation for ranking a set of hypothesized translations; In computational biology for ranking candidate 3-D structures in protein structure prediction problems; In recommender systems for identifying a ranked list of related news articles to recommend to a user after he or she has read a current news article. == Feature vectors == For the convenience of MLR algorithms, query-document pairs are usually represented by numerical vectors, which are called feature vectors. Such an approach is sometimes called bag of features and is analogous to the bag of words model and vector space model used in information retrieval for representation of documents. Components of such vectors are called features, factors or ranking signals. They may be divided into three groups (features from document retrieval are shown as examples): Query-independent or static features — those features, which depend only on the document, but not on the query. For example, PageRank or document's length. Such features can be precomputed in off-line mode during indexing. They may be used to compute document's static quality score (or static rank), which is often used to speed up search query evaluation. Query-dependent or dynamic features — those features, which depend both on the contents of the document and the query, such as TF-IDF score or other non-machine-learned ranking functions. Query-level features or query features, which depend only on the query. For example, the number of words in a query. Some examples of features, which were used in the well-known LETOR dataset: TF, TF-IDF, BM25, and language modeling scores of document's zones (title, body, anchors text, URL) for a given query; Lengths and IDF sums of document's zones; Document's PageRank, HITS ranks and their variants. Selecting and designing good features is an important area in machine learning, which is called feature engineering. == Evaluation measures == There are several measures (metrics) which are commonly used to judge how well an algorithm is doing on training data and to compare the performance of different MLR algorithms. Often a learning-to-rank problem is reformulated as an optimization problem with respect to one of these metrics. Examples of ranking quality measures: Mean average precision (MAP); DCG and NDCG; Precision@n, NDCG@n, where "@n" denotes that the metrics are evaluated only on top n documents; Mean reciprocal rank; Kendall's tau; Spearman's rho. DCG and its normalized variant NDCG are usually preferred in academic research when multiple levels of relevance are used. Other metrics such as MAP, MRR and precision, are defined only for binary judgments. Recently, there have been proposed several new evaluation metrics which claim to model user's satisfaction with search results better than the DCG metric: Expected reciprocal rank (ERR); Yandex's pfound. Both of these metrics are based on the assumption that the user is more likely to stop looking at search results after examining a more relevant document, than after a less relevant document. == Approaches == Learning to Rank approaches are often categorized using one of three approaches: pointwise (where individual documents are ranked), pairwise (where pairs of documents are ranked into a relative order), and listwise (where an entire list of documents are ordered). Tie-Yan Liu of Microsoft Research Asia has analyzed existing algorithms for learning to rank problems in his book Learning to Rank for Information Retrieval. He categorized them into three groups by their input spaces, output spaces, hypothesis spaces (the core function of the model) and loss functions: the pointwise, pairwise, and listwise approach. In practice, listwise approaches often outperform pairwise approaches and pointwise approaches. This statement was further supported by a large scale experiment on the performance of different learning-to-rank methods on a large collection of benchmark data sets. In this section, without further notice, x {\displaystyle x} denotes an object to be evaluated, for example, a document or an image, f ( x ) {\displaystyle f(x)} denotes a single-value hypothesis, h ( ⋅ ) {\displaystyle h(\cdot )} denotes a bi-variate or multi-variate function and L ( ⋅ ) {\displaystyle L(\cdot )} denotes the loss function. === Pointwise approach === In this case, it is assumed that each query-document pair in the training data has a numerical or ordinal score. Then the learning-to-rank problem can be approximated by a regression problem — given a single query-document pair, predict its score. Formally speaking, the pointwise approach aims at learning a function f ( x ) {\displaystyle f(x)} predicting the real-value or ordinal score of a document x {\displaystyle x} using the loss function L ( f ; x j , y j ) {\displaystyle L(f;x_{j},y_{j})} . A number of existing supervised machine learning algorithms can be readily used for this purpose. Ordinal regression and classification algorithms can also be used in pointwise approach when they are used to predict the score of a single query-document pair, and it takes a small, finite number of values. === Pairwise approach === In this case, the learning-to-rank problem is approximated by a classification problem — learning a binary classifier h ( x u , x v ) {\displaystyle h(x_{u},x_{v})} that can tell which document is better in a given pair of documents. The classifier shall take two documents as its input and the goal is to minimize a loss function L ( h ; x u , x v , y u , v ) {\displaystyle L(h;x_{u},x_{v},y_{u,v})} . The loss function typically reflects the number and magnitude of inversions in the induced ranking. In many cases, the binary classifier h ( x u , x v ) {\displaystyle h(x_{u},x_{v})} is implemented with a scoring function f ( x ) {\displaystyle f(x)} . As an example, RankNet adapts a probability model and defines h ( x u , x v ) {\displaystyle h(x_{u},x_{v})} as the estimated probability of the document x u {\displaystyle x_{u}} has higher quality than x v {\displaystyle x_{v}} : P u , v ( f ) = CDF ( f ( x u ) − f ( x v ) ) , {\displaystyle P_{u,v}(f)={\text{CDF}

    Read more →
  • Multistage interconnection networks

    Multistage interconnection networks

    Multistage interconnection networks (MINs) are a class of high-speed computer networks usually composed of processing elements (PEs) on one end of the network and memory elements (MEs) on the other end, connected by switching elements (SEs). The switching elements themselves are usually connected to each other in stages, hence the name. MINs are typically used in high-performance or parallel computing as a low-latency interconnection (as opposed to traditional packet switching networks), though they could be implemented on top of a packet switching network. Though the network is typically used for routing purposes, it could also be used as a co-processor to the actual processors for such uses as sorting; cyclic shifting, as in a perfect shuffle network; and bitonic sorting. == Background == Interconnection network are used to connect nodes, where nodes can be a single processor or group of processors, to other nodes. Interconnection networks can be categorized on the basis of their topology. Topology is the pattern in which one node is connected to other nodes. There are two main types of topology: static and dynamic. Static interconnect networks are hard-wired and cannot change their configurations. A regular static interconnect is mainly used in small networks made up of loosely couple nodes. The regular structure signifies that the nodes are arranged in specific shape and the shape is maintained throughout the networks. Some examples of static regular interconnections are: Completely connected network In a mesh network, multiple nodes are connected with each other. Each node in the network is connected to every other node in the network. This arrangement allows proper communication of the data between the nodes. But, there are a lot of communication overheads due to the increased number of node connections. Shared busThis network topology involves connection of the nodes with each other over a bus. Every node communicates with every other node using the bus. The bus utility ensures that no data is sent to the wrong node. But, the bus traffic is an important parameter which can affect the system. RingThis is one of the simplest ways of connecting nodes with each other. The nodes are connected with each other to form a ring. For a node to communicate with some other node, it has to send the messages to its neighbor. Therefore, the data message passes through a series of other nodes before reaching the destination. This involves increased latency in the system. TreeThis topology involves connection of the nodes to form a tree. The nodes are connected to form clusters and the clusters are in-turn connected to form the tree. This methodology causes increased complexity in the network. Hypercube This topology consists of connections of the nodes to form cubes. The nodes are also connected to the nodes on the other cubes. ButterflyThis is one of the most complex connections of the nodes. As the figure suggests, there are nodes which are connected and arranged in terms of their ranks. They are arranged in the form of a matrix. In dynamic interconnect networks, the nodes are interconnected via an array of simple switching elements. This interconnection can then be changed by use of routing algorithms, such that the path from one node to other nodes can be varied. Dynamic interconnections can be classified as: Single stage Interconnect Network Multistage interconnect Network Crossbar switch connections == Crossbar Switch Connections == In crossbar switch, there is a dedicated path from one processor to other processors. Thus, if there are n inputs and m outputs, we will need nm switches to realize a crossbar. As the number of outputs increases, the number of switches increases by factor of n. For large network this will be a problem. An alternative to this scheme is staged switching. == Single Stage Interconnect Network == In a single stage interconnect network, the input nodes are connected to output via a single stage of switches. The figure shows 88 single stage switch using shuffle exchange. As one can see, from a single shuffle, not all input can reach all output. Multiple shuffles are required for all inputs to be connected to all the outputs. == Multistage Interconnect Network == A multistage interconnect network is formed by cascading multiple single stage switches. The switches can then use their own routing algorithm, or be controlled by a centralized router, to form a completely interconnected network. Multistage Interconnect Network can be classified into three types: Non-blocking: A non-blocking network can connect any idle input to any idle output, regardless of the connections already established across the network. Crossbar is an example of this type of network. Rearrangeable non-blocking: This type of network can establish all possible connections between inputs and outputs by rearranging its existing connections. Blocking: This type of network cannot realize all possible connections between inputs and outputs. This is because a connection between one free input to another free output is blocked by an existing connection in the network. The number of switching elements required to realize a non-blocking network in highest, followed by rearrangeable non-blocking. Blocking network uses least switching elements. == Examples == Multiple types of multistage interconnection networks exist. === Omega network === An Omega network consists of multiple stages of 22 switching elements. Each input has a dedicated connection to an output. An NN omega network has log2(N) stages and N/2 switching elements in each stage for a perfect shuffle between stages. Thus the network has complexity of 0(N log(N)). Each switching element can employ its own switching algorithm. Consider an 88 omega network. There are 8! = 40320 1-to-1 mappings from input to output. There are 12 switching element for a total permutation of 2^12 = 4096. Thus, it is a blocking network. === Clos network === A Clos network uses 3 stages to switch from N inputs to N outputs. In the first stage, there are r= N/n crossbar switches and each switch is of size nm. In the second stage there are m switches of size rr and finally the last stage is a mirror of the first stage with r switches of size mn. A clos network will be completely non-blocking if m >= 2n-1. The number of connections, though more than omega network is much less than that of a crossbar network. === Beneš network === A Beneš network is a rearrangeably non-blocking network derived from the clos network by initializing n = m = 2. There are (2log2(N) - 1) stages, with each stage containing N/2 22 crossbar switches. An 88 Beneš network has 5 stages of switching elements, and each stage has 4 switching elements. The center three stages has two 44 benes network. The 44 Beneš network, can connect any input to any output recursively.

    Read more →
  • Secret London

    Secret London

    Secret London is a Facebook group started by 21-year-old Bristol University graduate, Tiffany Philippou, on 19 January 2010 in response to a Saatchi & Saatchi competition. The group grew rapidly (180,000 members as of 8 February 2010) and is composed mostly of Londoners who use the site to share suggestions and photos of London. After the group's early success, the founder announced her intention to launch a website of the same name by crowdsourcing the design and development. The website was launched on 16 February 2010. == Other secret cities == Following the initial success of Secret London, a number of other secret groups were independently started around the world, some of which already have over 100,000 users. As of 19 February 2010, the list of other groups includes: Secret Frankfurt, Secret Tel Aviv, Secret Paris, Secret New York, Secret Tokyo, Secret Toronto, Secret Los Angeles, Secret Exeter, Secret Boston, Secret Norwich, Secret Singapore, Secret Brighton, Secret Minneapolis, Secret Sydney, Secret Canberra, Secret Brisbane, Secret Wellington, Secret Christchurch, Secret Madeira, Secret Funchal, Secret Bristol and Secret Cardiff. == Controversy == Some commentators have questioned whether it possible to share secrets without compromising them, and whether sharing tips publicly will lead to over-exposure of the businesses who are recommended.

    Read more →
  • Internet

    Internet

    The Internet (or internet) is the global system of interconnected computer networks that uses the Internet protocol suite (TCP/IP) to communicate between networks and devices. It is a network of networks that comprises private, public, academic, business, and government networks of local to global scope, linked by electronic, wireless, and optical networking technologies. The Internet carries a vast range of information services and resources, such as the interlinked hypertext documents and applications of the World Wide Web (WWW), electronic mail, discussion groups, internet telephony, streaming media and file sharing. Most traditional communication media, including telephone, radio, television, paper mail, newspapers, and print publishing, have been transformed by the Internet, giving rise to new media such as email, online music, digital newspapers, news aggregators, and audio and video streaming websites. The Internet has enabled and accelerated new forms of personal interaction through instant messaging, Internet forums, and social networking services. Online shopping has also grown to occupy a significant market across industries, enabling firms to extend brick and mortar presences to serve larger markets. Business-to-business and financial services on the Internet affect supply chains across entire industries. The origins of the Internet date back to research that enabled the time-sharing of computer resources, the development of packet switching, and the design of computer networks for data communication. The set of communication protocols to enable internetworking on the Internet arose from research and development commissioned in the 1970s by the Defense Advanced Research Projects Agency (DARPA) of the United States Department of Defense in collaboration with universities and researchers across the United States, United Kingdom and France. The Internet has no single centralized governance in either technological implementation or policies for access and usage. Each constituent network sets its own policies. The overarching definitions of the two principal name spaces on the Internet, the Internet Protocol address (IP address) space and the Domain Name System (DNS), are directed by a maintainer organization, the Internet Corporation for Assigned Names and Numbers (ICANN). The technical underpinning and standardization of the core protocols is an activity of the non-profit Internet Engineering Task Force (IETF). == Terminology == The word internetted was used as early as 1849, meaning interconnected or interwoven. The word Internet was used in 1945 by the United States War Department in a radio operator's manual, and in 1974 as the shorthand form of Internetwork. Today, the term Internet most commonly refers to the global system of interconnected computer networks, though it may also refer to any group of smaller networks. The word Internet may be capitalized as a proper noun, although this is becoming less common. This reflects the tendency in English to capitalize new terms and move them to lowercase as they become familiar. The word is sometimes still capitalized to distinguish the global internet from smaller networks, though many publications, including the AP Stylebook since 2016, recommend the lowercase form in every case. In 2016, the Oxford English Dictionary found that, based on a study of around 2.5 billion printed and online sources, "Internet" was capitalized in 54% of cases. The terms Internet and World Wide Web are often used interchangeably; it is common to speak of "going on the Internet" when using a web browser to view web pages. However, the World Wide Web, or the Web, is only one of a large number of Internet services. It is the global collection of web pages, documents and other web resources linked by hyperlinks and URLs. == History == === 1960s === In the 1960s, computer scientists began developing systems for time-sharing of computer resources. J. C. R. Licklider proposed the idea of a universal network while working at Bolt Beranek & Newman and, later, leading the Information Processing Techniques Office at the Advanced Research Projects Agency (ARPA) of the United States Department of Defense. Research into packet switching, one of the fundamental Internet technologies, started in the work of Paul Baran at RAND in the early 1960s and, independently, Donald Davies at the United Kingdom's National Physical Laboratory in 1965. After the Symposium on Operating Systems Principles in 1967, packet switching from the proposed NPL network was incorporated into the design of the ARPANET, an experimental resource sharing network proposed by ARPA. ARPANET development began with two network nodes which were interconnected between the University of California, Los Angeles and the Stanford Research Institute on 29 October 1969. The third site was at the University of California, Santa Barbara, followed by the University of Utah. === 1970s === By the end of 1971, 15 sites were connected to the young ARPANET. Thereafter, the ARPANET gradually developed into a decentralized communications network, connecting remote centers and military bases in the United States. Other user networks and research networks, such as the Merit Network and CYCLADES, were developed in the late 1960s and early 1970s. Early international collaborations for the ARPANET were rare. Connections were made in 1973 to Norway (NORSAR and, later, NDRE) and to Peter Kirstein's research group at University College London, which provided a gateway to British academic networks, the first internetwork for resource sharing. ARPA projects, the International Network Working Group and commercial initiatives led to the development of various protocols and standards by which multiple separate networks could become a single network, or a network of networks. In 1974, Vint Cerf at Stanford University and Bob Kahn at DARPA published a proposal for "A Protocol for Packet Network Intercommunication". Cerf and his graduate students used the term internet as a shorthand for internetwork in RFC 675. The Internet Experiment Notes and later RFCs repeated this use. The work of Louis Pouzin and Robert Metcalfe had important influences on the resulting TCP/IP design. National PTTs and commercial providers developed the X.25 standard and deployed it on public data networks. === 1980s === The ARPANET initially served as a backbone for the interconnection of regional academic and military networks in the United States to enable resource sharing. Access to the ARPANET was expanded in 1981 when the National Science Foundation (NSF) funded the Computer Science Network (CSNET). In 1982, the Internet Protocol Suite (TCP/IP) was standardized, which facilitated worldwide proliferation of interconnected networks. TCP/IP network access expanded again in 1986 when the National Science Foundation Network (NSFNet) provided access to supercomputer sites in the United States for researchers, first at speeds of 56 kbit/s and later at 1.5 Mbit/s and 45 Mbit/s. The NSFNet expanded into academic and research organizations in Europe, Australia, New Zealand and Japan in 1988–89. Although other network protocols such as UUCP and PTT public data networks had global reach well before this time, this marked the beginning of the Internet as an intercontinental network. Commercial Internet service providers emerged in 1989 in the United States and Australia. The ARPANET was decommissioned in 1990. === 1990s === The linking of commercial networks and enterprises by the early 1990s, as well as the advent of the World Wide Web, marked the beginning of the transition to the modern Internet. Steady advances in semiconductor technology and optical networking created new economic opportunities for commercial involvement in the expansion of the network in its core and for delivering services to the public. In mid-1989, MCI Mail and Compuserve established connections to the Internet, delivering email and public access products to the half million users of the Internet. Just months later, on 1 January 1990, PSInet launched an alternate Internet backbone for commercial use; one of the networks that added to the core of the commercial Internet of later years. In March 1990, the first high-speed T1 (1.5 Mbit/s) link between the NSFNET and Europe was installed between Cornell University and CERN, allowing much more robust communications than were capable with satellites. Later in 1990, Tim Berners-Lee began writing WorldWideWeb, the first web browser, after two years of lobbying CERN management. By Christmas 1990, Berners-Lee had built all the tools necessary for a working Web: the HyperText Transfer Protocol (HTTP) 0.9, the HyperText Markup Language (HTML), the first Web browser (which was also an HTML editor and could access Usenet newsgroups and FTP files), the first HTTP server software (later known as CERN httpd), the first web server, and the first Web pages that described the project itself. In 1991 the

    Read more →
  • JustWatch

    JustWatch

    JustWatch is a website that provides information on the availability of films and TV shows on various streaming platforms such as Netflix, HBO Max, Disney+, Hulu, Peacock, Fandango at Home, Apple TV, and Amazon Prime Video, among others. It is also available as a mobile application and smart TV application. JustWatch provides a search engine that allows users to discover which digital platforms host a particular movie or TV series. As of November 2023, JustWatch is available to users in 139 countries. == Features == JustWatch functions as a search engine by aggregating information about the online availability of films and TV series from video-on-demand streaming services. It aggregates information from more than 100 video content libraries, as well providing information about video resolution quality, pricing, and purchase or rental options. The website includes various filters for searching, including genre, price, release date, rating, and popularity. Users are also able to create lists of shows and movies and to share these lists with other users. == History == JustWatch GmbH is an international database company that is privately held and headquartered in Berlin, Germany. The company specializes in the online availability of movies and TV series. In addition to its user-facing website, the company also has an advertising-focused arm, JustWatch Media, that works with corporate clients, using data about what people watch that it gleans from user behavior to help entertainment companies tailor their marketing strategies. Its clients include Universal Pictures, Paramount Pictures, and Sony Pictures, among others. Development of the website began in 2014, and it was launched in the U.S. and Germany in February 2015. In 2018, the company received funding to improve databases within the European Union. In December 2019, the company acquired a rival streaming aggregation service, GoWatchIt, from Plexus Entertainment. JustWatch also used the acquisition to open its first New York office. In 2019, JustWatch had over 30 million users across 38 countries. By 2020, the company's streaming aggregation service was available in over 45 countries. By November 2023, it was available in 139 countries, and had over 40 million monthly users. === Founding === JustWatch was co-founded in 2013 by David Croyé, Cristoph Hoyer, Kevin Hiller, Dominik Raute, Ingke Weimert, and Michael Wilken. In a company blog post from February 2017, Croyé described the group of co-founders as all having previously "worked in leading roles at successful international tech-startups in Berlin." Croyé, who currently holds the title of CEO at JustWatch GmbH, had previously worked as the chief marketing officer at kaufDA, a European location-based mobile coupon and promotion service, and the background of other co-founders included time at the adtech company Trademob and the streaming site MyVideo. Startup capital for the website initially came from the founders themselves. Croyé in particular was able to reinvest funds he had obtained from the sale of kaufDA to Axel Springer, a European media company, in March 2011. Since 2015, the company has had at least one additional round of seed funding, with investors including venture capital groups CG Partners and STS Ventures.

    Read more →
  • Social media measurement

    Social media measurement

    Social media measurement, also called social media controlling, is the management practice of evaluating successful social media communications of brands, companies, or other organizations. Key performance indicators may be measured by extracting information from social media channels, such as blogs, wikis, micro-blogs such as Twitter, social networking sites, or video/photo sharing websites, forums from time to time. It is also used by companies to gauge current trends in the industry. The process first gathers data from different websites and then performs analysis based on different metrics like time spent on the page, click through rate, content share, comments, text analytics to identify positive or negative emotions about the brand. Some other social media metrics include share of voice, owned mentions, and earned mentions. The social media measurement process starts with defining a goal that needs to be achieved and defining the expected outcome of the process. The expected outcome varies per the goal and is usually measured by a variety of metrics. This is followed by defining possible social strategies to be used to achieve the goal. Then the next step is designing strategies to be used and setting up configuration tools that ease the process of collecting the data. In the next step, strategies and tools are deployed in real-time. This step involves conducting Quality Assurance tests of the methods deployed to collect the data. And in the final step, data collected from the system is analyzed and if the need arises, it is refined on the run time to enhance the methodologies used. The last step ensures that the result obtained is more aligned with the goal defined in the first step. == Data Acquisition == Acquiring data from social media is in demand of an exploring the user participation and population with the purpose of retrieving and collecting so many kinds of data(ex: comments, downloads etc.). There are several prevalent techniques to acquire data such as Network traffic analysis, Ad-hoc application and Crawling Network Traffic Analysis - Network traffic analysis is the process of capturing network traffic and observing it closely to determine what is happening in the network. It is primarily done to improve the performance, security and other general management of the network. However concerned about the potential tort of privacy on the Internet, network traffic analysis is always restricted by the government. Furthermore, high-speed links are not adaptable to traffic analysis because of the possible overload problem according to the packet sniffing mechanism Ad-hoc Application - Ad-hoc application is a kind of application that provides services and games to social network users by developing the APIs offered by social network companies (Facebook Developer Platform). The infrastructure of Ad-hoc application allows the user to interact with the interface layer instead of the application servers. The API provides a path for application to access information after the user login. Moreover, the size of the data set collected vary with the popularity of the social media platform i.e. social media platforms having high number of users will have more data than platforms having less user base. Scraping is a process in which the APIs collect online data from social media. The data collected from Scraping is in raw format. However, having access to these types of data is a bit difficult because of its commercial value. Crawling - Crawling is a process in which a web crawler creates indexes of all the words in a web-page, stores them, then follows all the hyperlinks and indexes on that page and again stores them. It is the most popular technique for data acquisition and is also well known for its easy operation based on prevalent Object-Orientated Programming Language (Java or Python etc.). And most important, social network companies (YouTube, Flicker, Facebook, Instagram, etc.) are friendly to crawling techniques by providing public APIs == Applications == === For branding === Monitoring social media allows researchers to find insights into a brand's overall visibility on social media, to measure the impact of campaigns, to identify opportunities for engagement, to assess competitor activity and share of voice, and to detect impending crises. It can also provide valuable information about emerging trends and what consumers and clients think about specific topics, brands or products. This is the work of a cross-section of groups that include market researchers, PR staff, marketing teams, social-engagement, and community staff, agencies and sales teams. Several different providers have developed tools to facilitate the monitoring of a variety of social media channels - from blogging to internet video to internet forums. This allows companies to track what consumers say about their brands and actions. Companies can then react to these conversations and interact with consumers through social media platforms. === In government === Apart from commercial applications, social media monitoring has become a pervasive technique applied by public organizations and governments. Monitoring is a tradition within the public sector, and social-media monitoring provides a real-time approach to detecting and responding to social developments. Governments have come to realize the need for strategies to cope with surprises from the rapid expansion of public issues. Sobkowicz introduced a framework with three blocks of social-media opinion tracking, simulating and forecasting. It includes: real-time detection of emotions, topics and opinions information-flow modelling and agent-based simulation modeling of opinion networks Bekkers introduced the application of social media monitoring in the Netherlands. Public organizations in the Netherlands (such as the Tax Agency and the Education Ministry) have started to use social media monitoring to obtain better insights into the sentiments of target groups. On the one hand, the public sector will be enabled to provide timely and efficient answers to the public by using social media monitoring techniques, but on the other hand, they also have to deal with concerns about ethical issues such as transparency and privacy. == Quantifying social media == Social media management software (SMMS) is an application program or software that facilitates an organization's ability to successfully engage in social media across different communication channels. SMMS is used to monitor inbound and outbound conversations, support customer interaction, audit or document social marketing initiatives and evaluate the usefulness of a social media presence. It can be difficult to measure all social media conversations. Due to privacy settings and other issues, not all social media conversations can be found and reported by monitoring tools. However, whilst social media monitoring cannot give absolute figures, it can be extremely useful for identifying trends and for benchmarking, in addition to the uses mentioned above. These findings can, in turn, influence and shape future business decisions. In order to access social media data (posts, Tweets, and meta-data) and to analyze and monitor social media, many companies use software technologies built for business. These range from in-platform analytics dashboards to dedicated third-party platforms, which offer more advanced capabilities including cross-platform audience intelligence, sentiment analysis, and trend detection at scale. == Location-based == Most social media networks allow users to add a location to their posts (reference all of our feeds). The location can be classified as either 'at-the-location' or 'about-the-location'. "'At-the-location' services can be defined as services where location-based content is created at the geographic location. 'About-the-location' services can be defined as services which are referring to a particular location but the content is not necessarily created in this particular physical place." The added information available from geotagged (link to Geotagging article) posts means that they can be displayed on a map. This means that a location can be used as the start of a social media search rather than a keyword or hashtag. This has major implications for disaster relief, event monitoring, safety and security professionals since a large portion of their job is related to tracking and monitoring specific locations. == Technologies used == Various monitoring platforms use different technologies for social media monitoring and measurement. These technology providers may connect to the API provided by social platforms that are created for 3rd party developers to develop their own applications and services that access data. Facebook's Graph API is one such API that social media monitoring solution products would connect to pull data from. Some social media monitoring and analytics companies use calls to data providers each time an end-user d

    Read more →
  • Influencer

    Influencer

    An influencer is an individual who has the capacity to shape the attitudes, behavior, or decisions of others through authority, knowledge, position, or the nature of the relationship with the audience. The term is used in various fields such as media, business, politics, religion, and communication, referring to influencers such as social media influencers, podcasters, public speakers, religious influencers, writers, and newsletter writers etc who have dedicated followings in various areas. One writer defines influencers as "a range of third parties who exercise influence over the organization and its potential customers." Another writer defines an influencer as a "third party who significantly shapes the customer's purchasing decision but may never be accountable for it." According to another writer, influencers are "well-connected, create an impact, have active minds, and are trendsetters". Just because a person has many followers does not necessarily mean they have much influence over those people. In contemporary usage, the term frequently refers to a social media influencer, (also known as an online influencer or simply influencer) a person who builds a grassroots online presence through engaging content such as photos, videos, and updates. This is done by using direct audience interaction to establish authenticity, expertise, and appeal, and by standing apart from traditional celebrities by growing their platform through social media rather than pre-existing fame. The modern referent of the term is commonly a paid role in which a business entity pays for the social media influence-for-hire activity to promote its products and services, known as influencer marketing. A 1% increase in spending on influencer marketing can lead to a 0.5% increase in audience engagement. As such, an influencer effectively acts as a modern salesperson or a marketer. Types of influencers include fashion influencer, travel influencer, and virtual influencer, and they involve content creators and streamers. Some influencers are associated primarily with specific social media apps such as TikTok, Instagram, or Pinterest; many influencers are also considered internet celebrities. As of 2023, Instagram is the social media platform businesses spend the most advertising money towards marketing with influencers. However, influencers can have an impact on any social media network. == History == === Origins === The word influencer in its general sense of a person or thing that exerts influence, is attested in historical sources at least since the 17th century. The Oxford English Dictionary (OED) gives 1664 as the earliest example of usage and cites a sentence from Henry More's A Modest Enquiry into the Mystery of Iniquity: "The head and influencer of the whole Church". The origins of online influencing can be traced back to the emergence of digital blogs and platforms in the early 2000s. Nevertheless, recent studies demonstrate that Instagram, an application with more than one billion users, harbors the majority of the influencer demographic. These individuals are sometimes referred to as "Instagrammers" or "Instafamous". A crucial aspect of influencing is their association with sponsors. The 2015 debut of Vamp, a company that links influencers with sponsorships, transformed the landscape of influencing. There is much debate about whether social media influencers can be considered celebrities, as their path to fame is often less traditional and arguably easier. Melody Nouri addressed the differences between the two types in her article "The Power of Influence: Traditional Celebrities vs Social Media Influencer". Nouri asserts that social media platforms have a greater negative impact on young, impressionable audiences in comparison with traditional media such as magazines, billboards, advertisements, and tabloids featuring celebrities. Online, it is thought to be simpler to manipulate an image and lifestyle in such a way that viewers are more susceptible to believing it. One theory considers the former American First Lady Eleanor Roosevelt (1884–1962) to be the "original media influencer." While she achieved celebrity in her role as First Lady, she built a global personal brand as a wise, informative, trustworthy American woman. Her voice was her own, unrestricted by political advisors and powerful men, and with it, Roosevelt exerted unprecedented social and cultural influence in radio, print, public speaking, film, and television until she died. In one notable example, it may have been Roosevelt's television support of John F. Kennedy which nudged his "hairline victory" during the 1960 Presidential campaign. In another example, David Ogilvy paid Roosevelt more than a quarter of a million dollars in today's currency to make a TV commercial for Good Luck margarine (1959), in which Roosevelt also managed to mention world hunger. As a content creator, she wrote My Day, a popular daily newspaper column that ran nationwide for twenty-six years. Like a social media post, My Day covered all aspects of her life, and in it Roosevelt often recommended movies, books, and products that she admired. Roosevelt also had a hand in designing all three of her public affairs television shows. Unlike contemporary influencers, she was less motivated by a pay-to-play situation than by a desire to educate and inspire; but she did use her influence to benefit the entertainment industry careers of her children, and she welcomed the revenue that her influence bought, most of which was donated to charity. === 2000s === The early 2000s showed corporate endeavors to leverage the internet for influence, with some companies participating in forums for promotions or providing bloggers with complimentary products in return for favorable reviews. A few of these practices were viewed as unethical for taking advantage of the labor of young individuals without providing remuneration. In 2004, The Blogstar Network was established by Ted Murphy of MindComet. Bloggers were encouraged to join an email list and receive remunerated offers from corporations in exchange for creating specific posts. For instance, bloggers were compensated for writing reviews of fast-food meals on their blogs. Blogstar is widely regarded as the first influencer marketing network. Murphy succeeded Blogstar with PayPerPost, which was introduced in 2006. This platform compensated significant posters on prominent forums and social media platforms for every post made about a corporate product. Payment rates were determined by the influencer's status. Though very popular, PayPerPost, received a great deal of criticism as these influencers were not required to disclose their involvement with PayPerPost as traditional journalism would have. With the success of PayPerPost, the public became aware that there was a drive for corporate interests to influence what some people were posting to these sites. The platform also incentivized other firms to establish comparable programs. Despite concerns, marketing networks with influencers continued to grow throughout the 2000s and into the 2010s. The influencer marketing industry was worth as much as $8 billion in 2019, according to estimates from Business Insider Intelligence, which are based on Mediakix data. Evan Asano, the Former CEO and founder of the agency Mediakix, previously spoke with Business Insider and said he believed influencer marketing on Instagram would continue to grow despite likes being hidden. === 2010s === By the 2010s, the term "influencer" described digital content creators with a large following, distinctive brand persona, and a patterned relationship with commercial sponsors. By this period, influencer marketing had become a widely researched field globally, with systematic reviews drawing on hundreds of studies that documented the growing role of authenticity, audience engagement, and parasocial relationships in shaping how consumers responded to influencer content across different markets. During this period, influencer culture also developed through distinct channels outside Western markets. In South Korea, the global spread of Korean pop culture, also called K-Pop, through platforms such as YouTube, Facebook, and Twitter gave rise to what scholars have called 'Hallyu 2.0' or the 'New Korean Wave', where fans throughout Southeast Asia, North America, Latin America, and Europe shared, subtitled, and redistributed Korean music and film content on a large scale. This helped Korean entertainers to build substantial followings internationally. Consumers often mistakenly view celebrities as reliable, leading to trust and confidence in the products being promoted. A 2001 study from Rutgers University discovered that individuals were using "internet forums as influential sources of consumer information." The study proposes that consumers preferred internet forums and social media when making purchasing decisions over conventional advertising and print sources. An in

    Read more →
  • Radical trust

    Radical trust

    Radical trust is the confidence that any structured organization, such as a government, library, business, religion, or museum, has in collaboration and empowerment within online communities. Specifically, it pertains to the use of blogs, wiki and online social networking platforms by organizations to cultivate relationships with an online community that then can provide feedback and direction for the organization's interest. The organization 'trusts' and uses that input in its management. One of the first appearances of the notion of radical trust appears in an info graphic outlining the base principles of web 2.0 in Tim O'Reilly's weblog post "What is Web 2.0". Radical Trust is listed as the guiding example of trusting the validity of consumer generated media. This concept is considered to be an underlying assumption of Library 2.0. The adoption of radical trust by a library would require its management let go of some of its control over the library and building an organization without an end result in mind. The direction a library would take would be based on input provided by people through online communities. These changes in the organization may merely be anecdotal in nature, making this method of organization management dramatically distinct from data-based or evidence based management. In marketing, Collin Douma further describes the notion of radical trust as a key mindset required for marketers and advertisers to enter the social media marketing space. Conventional marketing dictates and maintains control of messages to cause the greatest persuasion in consumer decisions, but Douma argued that in the social media space, brands would need to cede that control in order to build brand loyalty.

    Read more →
  • Edge inference

    Edge inference

    Edge inference is the process of running machine learning or deep learning models on local devices (edge devices) such as smartphones, IoT devices, embedded systems, and edge servers instead of centralized cloud computing infrastructure. A key feature of edge computing is edge inference, which allows for real-time data processing, low latency, and improved privacy by reducing the amount of data sent to remote servers.

    Read more →
  • Hybrid cryptosystem

    Hybrid cryptosystem

    In cryptography, a hybrid cryptosystem is one which combines the convenience of a public-key cryptosystem with the efficiency of a symmetric-key cryptosystem. Public-key cryptosystems are convenient in that they do not require the sender and receiver to share a common secret in order to communicate securely. However, they often rely on complicated mathematical computations and are thus generally much more inefficient than comparable symmetric-key cryptosystems. In many applications, the high cost of encrypting long messages in a public-key cryptosystem can be prohibitive. This is addressed by hybrid systems by using a combination of both. A hybrid cryptosystem can be constructed using any two separate cryptosystems: a key encapsulation mechanism, which is a public-key cryptosystem a data encapsulation scheme, which is a symmetric-key cryptosystem The hybrid cryptosystem is itself a public-key system, whose public and private keys are the same as in the key encapsulation scheme. Note that for very long messages the bulk of the work in encryption/decryption is done by the more efficient symmetric-key scheme, while the inefficient public-key scheme is used only to encrypt/decrypt a short key value. == Implementations and standards == All practical implementations of public key cryptography today employ a hybrid system. Examples include the TLS protocol and the SSH protocol, that use a public-key mechanism for key exchange (such as Diffie-Hellman) and a symmetric-key mechanism for data encapsulation (such as AES). The OpenPGP file format and the PKCS#7 file format are other examples. Hybrid Public Key Encryption (HPKE, published as RFC 9180) is a modern standard for generic hybrid encryption. HPKE is used within multiple IETF protocols, including Messaging Layer Security (MLS), Oblivious DNS over HTTPS, Oblivious HTTP, Privacy Preserving Measurement, and TLS Encrypted Client Hello. Envelope encryption is an example of a usage of hybrid cryptosystems in cloud computing. In a cloud context, hybrid cryptosystems also enable centralized key management. == Example == To encrypt a message addressed to Alice in a hybrid cryptosystem, Bob does the following: Obtains Alice's public key. Generates a fresh symmetric key for the data encapsulation scheme. Encrypts the message under the data encapsulation scheme, using the symmetric key just generated. Encrypts the symmetric key under the key encapsulation scheme, using Alice's public key. Sends both of these ciphertexts to Alice. To decrypt this hybrid ciphertext, Alice does the following: Uses her private key to decrypt the symmetric key contained in the key encapsulation segment. Uses this symmetric key to decrypt the message contained in the data encapsulation segment. == Security == If both the key encapsulation and data encapsulation schemes in a hybrid cryptosystem are secure against adaptive chosen ciphertext attacks, then the hybrid scheme inherits that property as well. However, it is possible to construct a hybrid scheme secure against adaptive chosen ciphertext attacks even if the key encapsulation has a slightly weakened security definition (though the security of the data encapsulation must be slightly stronger). == Envelope encryption == Envelope encryption is term used for encrypting with a hybrid cryptosystem used by all major cloud service providers, often as part of a centralized key management system in cloud computing. Envelope encryption gives names to the keys used in hybrid encryption: Data Encryption Keys (abbreviated DEK, and used to encrypt data) and Key Encryption Keys (abbreviated KEK, and used to encrypt the DEKs). In a cloud environment, encryption with envelope encryption involves generating a DEK locally, encrypting one's data using the DEK, and then issuing a request to wrap (encrypt) the DEK with a KEK stored in a potentially more secure service. Then, this wrapped DEK and encrypted message constitute a ciphertext for the scheme. To decrypt a ciphertext, the wrapped DEK is unwrapped (decrypted) via a call to a service, and then the unwrapped DEK is used to decrypt the encrypted message. In addition to the normal advantages of a hybrid cryptosystem, using asymmetric encryption for the KEK in a cloud context provides easier key management and separation of roles, but can be slower. In cloud systems, such as Google Cloud Platform and Amazon Web Services, a key management system (KMS) can be available as a service. In some cases, the key management system will store keys in hardware security modules, which are hardware systems that protect keys with hardware features like intrusion resistance. This means that KEKs can also be more secure because they are stored on secure specialized hardware. Envelope encryption makes centralized key management easier because a centralized key management system only needs to store KEKs, which occupy less space, and requests to the KMS only involve sending wrapped and unwrapped DEKs, which use less bandwidth than transmitting entire messages. Since one KEK can be used to encrypt many DEKs, this also allows for less storage space to be used in the KMS. This also allows for centralized auditing and access control at one point of access.

    Read more →
  • Social media newsroom

    Social media newsroom

    A social media newsroom is a company resource, set up to increase the functionality and usability of the traditional online newsroom. Social media newsrooms (SMNs) are intended to encourage dialogue and information sharing. Unlike online newsrooms, content is accessible to more than just journalists, but to all those with whom the company engages such as bloggers, their prospects, customers, business partners and investors. It gives these stakeholders access to news, public relations announcements, images, audio, video and other multimedia files. In addition to posting press releases and corporate news, companies can integrate other social content from sites such as YouTube, Flickr and Slideshow as well as streams from corporate Twitter accounts. Traditional tools for journalists such as corporate fast facts, leadership information, a multimedia library, financial information, awards and other recent media coverage are also included in an SMN. Examples of companies effectively using social media newsrooms include Opel Group, Pressat, First Direct, MyNewsdesk, Scania and Newport Beach.

    Read more →
  • Opinion Space

    Opinion Space

    Developed at UC Berkeley, "Opinion Space" (also known as The Collective Discovery Engine) is a social media technology designed to help communities generate and exchange ideas about important issues and policies. Version 1.0 was launched on April 4, 2009, at UC Berkeley, and explored the question "Do you think legalizing marijuana is a good idea?" It has since undergone 4 different iterations, and been used in partnership with various organizations including The Occupy movement (Version 4.0, 5/24/2013) and the African Robots Network (Version 4.0, 5/25/2013). Opinion Space has also been used in collaboration with the United States State Department and the University of California's Berkeley Center for New Media (Version 2.0, 12/1/2009 and Version 3.0, 2/25/2012) to gain public perspective on foreign policy issues. Then U.S. Secretary of State Hillary Rodham Clinton explained, "Opinion Space will harness the power of connection technologies to provide a unique forum for international dialogue. This is...an opportunity to extend our engagement beyond the halls of government directly to the people of the world" (2010). The website uses data visualization and statistical analysis to present and develop public opinion and ideas. Opinion Space is a self-organizing system that uses an intuitive graphical "map" that displays patterns, trends, and insights as they emerge and employs the wisdom of crowds to identify and highlight the most insightful ideas. The system uses a game model that incorporates techniques from deliberative polling, collaborative filtering, and multidimensional visualization.

    Read more →
  • NeoPaint

    NeoPaint

    NeoPaint is a raster graphics editor for Windows and MS-DOS. It supports several file formats including JPEG, GIF, BMP, PNG, and TIFF. The developer, NeoSoft, advertises NeoPaint as "being simple enough for use by children while remaining powerful enough for the purposes of advanced image editing". The first version, NeoPaint 1.0, was released in 1992 on floppy disks. It supported video modes ranging from 640x350 to 1024x768 and multiple fonts. NeoPaint 2.2 came out for MS-DOS 3.1 in 1993, with support of for 2, 16, or 256 color images in Hercules, EGA, VGA, and Super VGA modes. NeoPaint 3.1 was released in 1995 supporting 24-bit images and formats like PCX, TIFF and BMP. NeoPaint 3.2 was released in 1996. An updated version, NeoPaint 3.2a, supported the GIF file format. NeoPaint 3.2d was released in 1998. A Windows 95 version named NeoPaint for Windows v4.0 was released in 1999 supporting the PNG file format. On September 1, 2018 the program was rebranded as PixelNEO, becoming one of the VisualNEO software products. Formats such as JPEG 2000, ICO, CUR, PSD and RAW are supported.

    Read more →
  • Unknown key-share attack

    Unknown key-share attack

    As defined by Blake-Wilson & Menezes (1999), an unknown key-share (UKS) attack on an authenticated key agreement (AK) or authenticated key agreement with key confirmation (AKC) protocol is an attack whereby an entity A {\displaystyle A} ends up believing she shares a key with B {\displaystyle B} , and although this is in fact the case, B {\displaystyle B} mistakenly believes the key is instead shared with an entity E ≠ A {\displaystyle E\neq A} . In other words, in a UKS, an opponent, say Eve, coerces honest parties Alice and Bob into establishing a secret key where at least one of Alice and Bob does not know that the secret key is shared with the other. For example, Eve may coerce Bob into believing he shares the key with Eve, while he actually shares the key with Alice. The “key share” with Alice is thus unknown to Bob.

    Read more →
  • Data deduplication

    Data deduplication

    In computing, data deduplication is a technique for eliminating duplicate copies of repeating data. Successful implementation of the technique can improve storage utilization, which may in turn lower capital expenditure by reducing the overall amount of storage media required to meet storage capacity needs. It can also be applied to network data transfers to reduce the number of bytes that must be sent. The deduplication process requires comparison of data 'chunks' (also known as 'byte patterns') which are unique, contiguous blocks of data. These chunks are identified and stored during a process of analysis, and compared to other chunks within existing data. Whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is dependent on the chunk size), the amount of data that must be stored or transferred can be greatly reduced. A related technique is single-instance (data) storage, which replaces multiple copies of content at the whole-file level with a single shared copy. While possible to combine this with other forms of data compression and deduplication, it is distinct from newer approaches to data deduplication (which can operate at the segment or sub-block level). Deduplication is different from data compression algorithms, such as LZ77 and LZ78. Whereas compression algorithms identify redundant data inside individual files and encodes this redundant data more efficiently, the intent of deduplication is to inspect large volumes of data and identify large sections – such as entire files or large sections of files – that are identical, and replace them with a shared copy. == Functioning principle == For example, a typical email system might contain 100 instances of the same 1 MB (megabyte) file attachment. Each time the email platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; the subsequent instances are referenced back to the saved copy for deduplication ratio of roughly 100 to 1. Deduplication is often paired with data compression for additional storage saving: Deduplication is first used to eliminate large chunks of repetitive data, and compression is then used to efficiently encode each of the stored chunks. In computer code, deduplication is done by, for example, storing information in variables so that they don't have to be written out individually but can be changed all at once at a central referenced location. Examples are CSS classes and named references in MediaWiki. == Benefits == Storage-based data deduplication reduces the amount of storage needed for a given set of files. It is most effective in applications where many copies of very similar or even identical data are stored on a single disk. In the case of data backups, which routinely are performed to protect against data loss, most data in a given backup remain unchanged from the previous backup. Common backup systems try to exploit this by omitting (or hard linking) files that haven't changed or storing differences between files. Neither approach captures all redundancies, however. Hard-linking does not help with large files that have only changed in small ways, such as an email database; differences only find redundancies in adjacent versions of a single file (consider a section that was deleted and later added in again, or a logo image included in many documents). In-line network data deduplication is used to reduce the number of bytes that must be transferred between endpoints, which can reduce the amount of bandwidth required. See WAN optimization for more information. Virtual servers and virtual desktops benefit from deduplication because it allows nominally separate system files for each virtual machine to be coalesced into a single storage space. At the same time, if a given virtual machine customizes a file, deduplication will not change the files on the other virtual machines—something that alternatives like hard links or shared disks do not offer. Backing up or making duplicate copies of virtual environments is similarly improved. == Classification == === Post-process versus in-line deduplication === Deduplication may occur "in-line", as data is flowing, or "post-process" after it has been written. With post-process deduplication, new data is first stored on the storage device and then a process at a later time will analyze the data looking for duplication. The benefit is that there is no need to wait for the hash calculations and lookup to be completed before storing the data, thereby ensuring that store performance is not degraded. Implementations offering policy-based operation can give users the ability to defer optimization on "active" files, or to process files based on type and location. One potential drawback is that duplicate data may be unnecessarily stored for a short time, which can be problematic if the system is nearing full capacity. Alternatively, deduplication hash calculations can be done in-line: synchronized as data enters the target device. If the storage system identifies a block which it has already stored, only a reference to the existing block is stored, rather than the whole new block. The advantage of in-line deduplication over post-process deduplication is that it requires less storage and network traffic, since duplicate data is never stored or transferred. On the negative side, hash calculations may be computationally expensive, thereby reducing the storage throughput. However, certain vendors with in-line deduplication have demonstrated equipment which performs in-line deduplication at high rates. Post-process and in-line deduplication methods are often heavily debated. === Data formats === The SNIA Dictionary identifies two methods: Content-agnostic data deduplication – a data deduplication method that does not require awareness of specific application data formats. Content-aware data deduplication – a data deduplication method that leverages knowledge of specific application data formats. === Source versus target deduplication === Another way to classify data deduplication methods is according to where they occur. Deduplication occurring close to where data is created, is referred to as "source deduplication". When it occurs near where the data is stored, it is called "target deduplication". Source deduplication ensures that data on the data source is deduplicated. This generally takes place directly within a file system. The file system will periodically scan new files creating hashes and compare them to hashes of existing files. When files with same hashes are found then the file copy is removed and the new file points to the old file. Unlike hard links however, duplicated files are considered to be separate entities and if one of the duplicated files is later modified, then using a system called copy-on-write a copy of that changed file or block is created. The deduplication process is transparent to the users and backup applications. Backing up a deduplicated file system will often cause duplication to occur resulting in the backups being bigger than the source data. Source deduplication can be declared explicitly for copying operations, as no calculation is needed to know that the copied data is in need of deduplication. This leads to a new form of link on file systems, called a reference-counted link, or reflink, in some systems (e.g. Linux), or a cloned file on macOS, where one or more inodes (file information entries) are made to share some or all of their data. It is named analogously to hard links, which work at the inode level, and symbolic links, which work at the filename level.The individual entries have a copy-on-write behavior that is non-aliasing, i.e. changing one copy afterwards will not affect other copies. Microsoft's ReFS also supports this operation. Target deduplication is the process of removing duplicates when the data was not generated at that location. Example of this would be a server connected to a SAN/NAS, The SAN/NAS would be a target for the server (target deduplication). The server is not aware of any deduplication, the server is also the point of data generation. A second example would be backup. Generally this will be a backup store such as a data repository or a virtual tape library. === Deduplication methods === One of the most common forms of data deduplication implementations works by comparing chunks of data to detect duplicates. For that to happen, each chunk of data is assigned an identification, calculated by the software, typically using cryptographic hash functions. In many implementations, the assumption is made that if the identification is identical, the data is identical, even though this cannot be true in all cases due to the pigeonhole principle; other implementations do not as

    Read more →