Determining the number of clusters in a data set

Determining the number of clusters in a data set

Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem. For a certain class of clustering algorithms (in particular k-means, k-medoids and expectation–maximization algorithm), there is a parameter commonly referred to as k that specifies the number of clusters to detect. Other algorithms such as DBSCAN and OPTICS algorithm do not require the specification of this parameter; hierarchical clustering avoids the problem altogether. The correct choice of k is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing k without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when k equals the number of data points, n). Intuitively then, the optimal choice of k will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. If an appropriate value of k is not apparent from prior knowledge of the properties of the data set, it must be chosen somehow. There are several categories of methods for making this decision. == Elbow method == The elbow method looks at the percentage of explained variance as a function of the number of clusters: One should choose a number of clusters so that adding another cluster does not give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the "elbow criterion". In most datasets, this "elbow" is ambiguous, making this method subjective and unreliable. Because the scale of the axes is arbitrary, the concept of an angle is not well-defined, and even on uniform random data, the curve produces an "elbow", making the method rather unreliable. Percentage of variance explained is the ratio of the between-group variance to the total variance, also known as an F-test. A slight variation of this method plots the curvature of the within group variance. The method can be traced to speculation by Robert L. Thorndike in 1953. While the idea of the elbow method sounds simple and straightforward, other methods (as detailed below) give better results. == X-means clustering == In statistics and data mining, X-means clustering is a variation of k-means clustering that refines cluster assignments by repeatedly attempting subdivision, and keeping the best resulting splits, until a criterion such as the Akaike information criterion (AIC) or Bayesian information criterion (BIC) is reached. == Information criterion approach == Another set of methods for determining the number of clusters are information criteria, such as the Akaike information criterion (AIC), Bayesian information criterion (BIC), or the deviance information criterion (DIC) — if it is possible to make a likelihood function for the clustering model. For example: The k-means model is "almost" a Gaussian mixture model and one can construct a likelihood for the Gaussian mixture model and thus also determine information criterion values. == Information–theoretic approach == Rate distortion theory has been applied to choosing k called the "jump" method, which determines the number of clusters that maximizes efficiency while minimizing error by information-theoretic standards. The strategy of the algorithm is to generate a distortion curve for the input data by running a standard clustering algorithm such as k-means for all values of k between 1 and n, and computing the distortion (described below) of the resulting clustering. The distortion curve is then transformed by a negative power chosen based on the dimensionality of the data. Jumps in the resulting values then signify reasonable choices for k, with the largest jump representing the best choice. The distortion of a clustering of some input data is formally defined as follows: Let the data set be modeled as a p-dimensional random variable, X, consisting of a mixture distribution of G components with common covariance, Γ. If we let c 1 … c K {\displaystyle c_{1}\ldots c_{K}} be a set of K cluster centers, with c X {\displaystyle c_{X}} the closest center to a given sample of X, then the minimum average distortion per dimension when fitting the K centers to the data is: d K = 1 p min c 1 … c K E [ ( X − c X ) T Γ − 1 ( X − c X ) ] {\displaystyle d_{K}={\frac {1}{p}}\min _{c_{1}\ldots c_{K}}{E[(X-c_{X})^{T}\Gamma ^{-1}(X-c_{X})]}} This is also the average Mahalanobis distance per dimension between X and the closest cluster center c X {\displaystyle c_{X}} . Because the minimization over all possible sets of cluster centers is prohibitively complex, the distortion is computed in practice by generating a set of cluster centers using a standard clustering algorithm and computing the distortion using the result. The pseudo-code for the jump method with an input set of p-dimensional data points X is: JumpMethod(X): Let Y = (p/2) Init a list D, of size n+1 Let D[0] = 0 For k = 1 ... n: Cluster X with k clusters (e.g., with k-means) Let d = Distortion of the resulting clustering D[k] = d^(-Y) Define J(i) = D[i] - D[i-1] Return the k between 1 and n that maximizes J(k) The choice of the transform power Y = ( p / 2 ) {\displaystyle Y=(p/2)} is motivated by asymptotic reasoning using results from rate distortion theory. Let the data X have a single, arbitrarily p-dimensional Gaussian distribution, and let fixed K = ⌊ α p ⌋ {\displaystyle K=\lfloor \alpha ^{p}\rfloor } , for some α greater than zero. Then the distortion of a clustering of K clusters in the limit as p goes to infinity is α − 2 {\displaystyle \alpha ^{-2}} . It can be seen that asymptotically, the distortion of a clustering to the power ( − p / 2 ) {\displaystyle (-p/2)} is proportional to α p {\displaystyle \alpha ^{p}} , which by definition is approximately the number of clusters K. In other words, for a single Gaussian distribution, increasing K beyond the true number of clusters, which should be one, causes a linear growth in distortion. This behavior is important in the general case of a mixture of multiple distribution components. Let X be a mixture of G p-dimensional Gaussian distributions with common covariance. Then for any fixed K less than G, the distortion of a clustering as p goes to infinity is infinite. Intuitively, this means that a clustering of less than the correct number of clusters is unable to describe asymptotically high-dimensional data, causing the distortion to increase without limit. If, as described above, K is made an increasing function of p, namely, K = ⌊ α p ⌋ {\displaystyle K=\lfloor \alpha ^{p}\rfloor } , the same result as above is achieved, with the value of the distortion in the limit as p goes to infinity being equal to α − 2 {\displaystyle \alpha ^{-2}} . Correspondingly, there is the same proportional relationship between the transformed distortion and the number of clusters, K. Putting the results above together, it can be seen that for sufficiently high values of p, the transformed distortion d K − p / 2 {\displaystyle d_{K}^{-p/2}} is approximately zero for K < G, then jumps suddenly and begins increasing linearly for K ≥ G. The jump algorithm for choosing K makes use of these behaviors to identify the most likely value for the true number of clusters. Although the mathematical support for the method is given in terms of asymptotic results, the algorithm has been empirically verified to work well in a variety of data sets with reasonable dimensionality. In addition to the localized jump method described above, there exists a second algorithm for choosing K using the same transformed distortion values known as the broken line method. The broken line method identifies the jump point in the graph of the transformed distortion by doing a simple least squares error line fit of two line segments, which in theory will fall along the x-axis for K < G, and along the linearly increasing phase of the transformed distortion plot for K ≥ G. The broken line method is more robust than the jump method in that its decision is global rather than local, but it also relies on the assumption of Gaussian mixture components, whereas the jump method is fully non-parametric and has been shown to be viable for general mixture distributions. == Silhouette method == The average silhouette of the data is another useful criterion for assessing the natural number of clusters. The silhouette of a data instance is a measure of how closely it is match

2018 Google data breach

The 2018 Google data breach was a major data privacy scandal in which the Google+ API exposed the private data of over five hundred thousand users. Google+ managers first noticed harvesting of personal data in March 2018, during a review following the Facebook–Cambridge Analytica data scandal. The bug, despite having been fixed immediately, exposed the private data of approximately 500,000 Google+ users to the public. Google did not reveal the leak to the network's users. In November 2018, another data breach occurred following an update to the Google+ API. Although Google found no evidence of failure, approximately 52.5 million personal profiles were potentially exposed. In August 2019, Google declared a shutdown of Google+ due to low use and technological challenges. == Overview of Google+ == Google+ was launched in June 2011 as an invite-only social network, but was opened for public access later in the year. It was managed by Vic Gundotra. Similar to Facebook, Google+ also included key features Circles, Hangouts and Sparks. Circles let users personalize their social groups by sorting friends into different categories. Once allowed into a Circle, users could regulate information in their individual spaces. Hangouts included video chatting and instant messaging between users. Sparks allowed Google to track users' past searches to find news and content related to their interests. Google+ was linked to other Google services, such as YouTube, Google Drive and Gmail, giving it access to roughly 2 billion user accounts. However, less than 400 million consumers actively used Google+, with 90% of those users using it for less than five seconds. == The breaches == In March 2018, Google developers found a data breach within the Google+ People API in which external apps acquired access to Profile fields that were not marked as public. According to The Wall Street Journal, Google didn’t disclose the breach when it was first discovered in March to avoid regulatory scrutiny and reputational damage. 500,000 Google+ accounts were included in the breach, which allowed 438 external apps unauthorized access to private users' names, emails, addresses, occupations, genders and ages. This information was available between 2015 and 2018. Google found no evidence of any user's personal information being misused, nor that any third-party app developers were aware of the leak. In November 2018, a software update created another data breach within the Google+ API. The bug impacted 52.5 million users, where, similarly to the March breach, unauthorized apps were able to access Google+ profiles, including users' names, email addresses, occupations and ages. Apps could not access financial information, national identification, numbers, or passwords. Blog posts, messages and phone numbers also remained inaccessible if marked as private. Unlike the previous breach, access was only available for six days before Google+ learned of the breach. Once more, Google+ found no evidence of data being misused by third-party developers. == Responses == In October 2018, the Wall Street Journal published an article outlining the initial breach and Google's decision to not disclose it to users. At the time, there was no federal law that required Google to inform their consumers of data breaches. Google+ originally did not disclose the breach out of fears of being compared to Facebook's recent data leak and subsequent loss of consumer confidence. In response to the Wall Street Journal article, Google announced the shutdown of Google+ in August 2019. After the second data leak, the date was moved to April 2019. In response to the data breach, enterprise consumers were notified of the bug's impact and given instructions on how to save, download and delete their data prior to the Google+ shut down. Google's Privacy and Data Protection Office found no misuse of user data. Prior to the Google+ shutdown, Google set a 10-month period in which users could download and migrate their data. After the 10-month period, user content was deleted. On 4 February 2019, consumers were no longer able to create new Google+ profiles. Google shut down Google+ APIs on 7 March 2019 to ensure that developers did not continue to rely on the APIs prior to the Google+ shutdown. Google is the principal entity of its parent company, Alphabet Inc. After the data breach, Alphabet Inc. share prices fell by 1% to $1,157.06 on 9 October 2018 after an earlier drop of $1,135.40 that morning, the lowest price since 5 July 2018. After the publication of The Wall Street Journal article, share prices dropped as low as 2.1% in two days on 10 October 2018. Share prices steadily increased from this point and met the 8 October 2018 share price on 5 February 2019. Google planned to rebuild Google+ as a corporate enterprise network. Google Play will now assess which apps can ask for permission to access the user's SMS data. Only the default app for telephone distribution is able to make requests. Prior to the data breaches, apps were able to request access to all of a consumer's data simultaneously. Now, each app must request permission for each aspect of a consumer's profile.

Comparison of user features of operating systems

Comparison of user features of operating systems refers to a comparison of the general user features of major operating systems in a narrative format. It does not encompass a full exhaustive comparison or description of all technical details of all operating systems. It is a comparison of basic roles and the most prominent features. It also includes the most important features of the operating system's origins, historical development, and role. == Overview == An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs. Time-sharing operating systems schedule tasks for efficient use of the system and may also include accounting software for cost allocation of processor time, mass storage, printing, and other resources. For hardware functions such as input and output and memory allocation, the operating system acts as an intermediary between programs and the computer hardware, although the application code is usually executed directly by the hardware and frequently makes system calls to an OS function or is interrupted by it. Operating systems are found on many devices that contain a computer – from cellular phones and video game consoles to web servers and supercomputers. As of June 2024, the dominant general-purpose desktop operating system is Microsoft Windows with a market share of around 72.91%. macOS by Apple Inc. is in second place (14.93%), and the varieties of Linux are collectively in third place (4.04%). In the mobile sector, including both smartphones and tablets, Android is dominant with a market share of 71%, followed by Apple's iOS with 28%; for smartphones alone, Android has 72% and iOS has 28%. Linux distributions are dominant in the server and supercomputing sectors. Other specialized classes of operating systems (special-purpose operating systems)), such as embedded and real-time systems, exist for many applications. Security-focused operating systems also exist. Some operating systems have low system requirements (i.e. light-weight Linux distribution). Others may have higher system requirements. Some operating systems require installation or may come pre-installed with purchased computers (OEM-installation), whereas others may run directly from media (i.e. live cd) or flash memory (i.e. USB stick). == MS-DOS == === Overview === MS-DOS (acronym for Microsoft Disk Operating System) is an operating system for x86-based personal computers mostly developed by Microsoft. Collectively, MS-DOS, its rebranding as IBM PC DOS, and some operating systems attempting to be compatible with MS-DOS, are sometimes referred to as "DOS" (which is also the generic acronym for disk operating system). MS-DOS was the main operating system for IBM PC compatible personal computers during the 1980s, from which point it was gradually superseded by operating systems offering a graphical user interface (GUI), in various generations of the graphical Microsoft Windows operating system. IBM licensed and re-released it in 1981 as PC DOS 1.0 for use in its PCs. Although MS-DOS and PC DOS were initially developed in parallel by Microsoft and IBM, the two products diverged after twelve years, in 1993, with recognizable differences in compatibility, syntax, and capabilities. During its lifetime, several competing products were released for the x86 platform, and MS-DOS went through eight versions, until development ceased in 2000. Initially, MS-DOS was targeted at Intel 8086 processors running on computer hardware using floppy disks to store and access not only the operating system, but application software and user data as well. Progressive version releases delivered support for other mass storage media in ever greater sizes and formats, along with added feature support for newer processors and rapidly evolving computer architectures. Ultimately, it was the key product in Microsoft's development from a programming language company to a diverse software development firm, providing the company with essential revenue and marketing resources. It was also the underlying basic operating system on which early versions of Windows ran as a GUI. == Microsoft Windows == === Overview === Microsoft Windows, commonly referred to as Windows, is a group of several proprietary graphical operating system families, all of which are developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. Active Microsoft Windows families include Windows NT and Windows IoT; these may encompass subfamilies, (e.g. Windows Server or Windows Embedded Compact) (Windows CE). Defunct Microsoft Windows families include Windows 9x, Windows Mobile, and Windows Phone. Microsoft announced an operating environment named Windows on 10 November 1983, as a graphical operating system shell for MS-DOS in response to the growing interest in graphical user interfaces (GUIs); Windows 1.0 first shipped on 20 November 1985. Microsoft Windows came to dominate the world's personal computer (PC) market with over 90% market share, overtaking Mac OS, which had been introduced in 1984, while Microsoft has in 2020 lost its dominance of the consumer operating system market, with Windows down to 30%, lower than Apple's 31% mobile-only share (65% for desktop operating systems only, i.e. "PCs" vs. Apple's 28% desktop share) in its home market, the US, and 32% globally (77% for desktops), where Google's Android leads. Apple came to see Windows as an unfair encroachment on their innovation in GUI development as implemented on products such as the Lisa and Macintosh (eventually settled in court in Microsoft's favor in 1993). As of January 2023, on PCs, Windows is still the most popular operating system in all countries. However, in 2014, Microsoft admitted losing the majority of the overall operating system market to Android, because of the massive growth in sales of Android smartphones. In 2014, the number of Windows devices sold was less than 25% that of Android devices sold. This comparison, however, may not be fully relevant, as the two operating systems traditionally target different platforms. Still, numbers for server use of Windows (that are comparable to competitors) show one third market share, similar to that for end user use. As of October 2020, the most recent version of Windows for PCs, tablets and embedded devices is Windows 10, version 20H2. The most recent version for server computers is Windows Server, version 20H2. A specialized version of Windows also runs on the Xbox One video game console. === Windows 95 === Windows 95 introduced a redesigned shell based around a desktop metaphor; File shortcuts (also known as shell links) were introduced and the desktop was re-purposed to hold shortcuts to applications, files and folders, reminiscent of Mac OS. In Windows 3.1 the desktop was used to display icons of running applications. In Windows 95, the currently running applications were displayed as buttons on a taskbar across the bottom of the screen. The taskbar also contained a notification area used to display icons for background applications, a volume control and the current time. The Start menu, invoked by clicking the "Start" button on the taskbar or by pressing the Windows key, was introduced as an additional means of launching applications or opening documents. While maintaining the program groups used by its predecessor Program Manager, it also displayed applications within cascading sub-menus. The previous File Manager program was replaced by Windows Explorer and the Explorer-based Control Panel and several other special folders were added such as My Computer, Dial Up Networking, Recycle Bin, Network Neighborhood, My Documents, Recent documents, Fonts, Printers, and My Briefcase among others. AutoRun was introduced for CD drives. The user interface looked dramatically different from prior versions of Windows, but its design language did not have a special name like Metro, Aqua or Material Design. Internally it was called "the new shell" and later simply "the shell". The subproject within Microsoft to develop the new shell was internally known as "Stimpy". In 1994, Microsoft designers Mark Malamud and Erik Gavriluk approached Brian Eno to compose music for the Windows 95 project. The result was the six-second start-up music-sound of the Windows 95 operating system, The Microsoft Sound and it was first released as a startup sound in May 1995 on Windows 95 May Test Release build 468. When released for Windows 95 and Windows NT 4.0, Internet Explorer 4 came with an optional Windows Desktop Update, which modified the shell to provide several additional updates to Windows Explorer, including a Quick Launch toolbar, and new features integrated with Internet Explorer, such as Active Desktop (which allowed Internet content to be displayed directly on the desktop). Some of the user interface elements introduced in Windows 95, such as the desktop, taskbar, Start menu and Windows

T.38

T.38 is an ITU recommendation for allowing transmission of fax over IP networks (FoIP) in real time. == History == The T.38 fax relay standard was devised in 1998 as a way to transport faxes across IP networks between existing Group 3 (G3) fax terminals. T.4 and related fax standards were published by the ITU in 1980, before the rise of the Internet. In the late 1990s, VoIP, or voice over IP, began to gain ground as an alternative to the conventional public switched telephone network (PSTN). However, because most VoIP systems are optimized (through their use of aggressive lossy bandwidth-saving compression) for voice rather than data calls, conventional fax machines worked poorly or not at all on them due to the network impairments such as delay, jitter, packet loss, and so on. Thus, some way of transmitting fax over IP was needed. == Overview == In practical scenarios, a T.38 fax call has at least part of the call being carried over PSTN, although this is not required by the T.38 definition, and two T.38 devices can send faxes to each other. This particular type of device is called Internet-Aware Fax device, or IAF, and it is capable of initiating or completing a fax call towards the IP network. The typical scenario where T.38 is used is – T.38 fax relay – where a T.30 fax device sends a fax over PSTN to a T.38 fax gateway which converts or encapsulates the T.30 protocol into a T.38 data stream. This is then sent either to a T.38-enabled end point such as fax machine or fax server or another T.38 gateway that converts it back to a PSTN PCM or analog signal and terminates the fax on a T.30 device. The T.38 recommendation defines the use of both TCP and UDP to transport T.38 packets. Implementations tend to use UDP, due to TCP's requirement for acknowledgement packets and resulting retransmission during packet loss, which introduces delays. When using UDP, T.38 copes with packet loss by using redundant data packets. T.38 is not a call setup protocol, thus the T.38 devices need to use standard call setup protocols to negotiate the T.38 call, e.g. H.323, SIP & MGCP. == Operation == There are two primary ways that fax transactions are conveyed across packet networks. The T.37 standard specifies how a fax image is encapsulated in e-mail and transported, ultimately, to the recipient using a store-and-forward process through intermediary entities. T.38, however, defines a protocol that supports the use of the T.30 protocol in both the sender and recipient terminals. (See diagram above.) T.38 lets one transmit a fax across an IP network in real time, just as the original G3 fax standards did for the traditional (time-division multiplexed (TDM)) network, also called the public switched telephone network or PSTN. A special protocol is needed for real-time fax over IP (Internet Protocol) since existing fax terminals only supported PSTN connections, where the information flow was generally smooth and uninterrupted, as opposed to the jittery arrival of IP packets. The trick was to come up with a protocol that makes the IP network “invisible” to the endpoint fax terminals, which would mean the user of a legacy fax terminal need not know that the fax call was traversing an IP network. The network interconnections supported by T.38 are shown above. The two fax terminals on either side of the figure communicate using the T.30 fax protocol published by the ITU in 1980. Interconnection of the PSTN with the IP packet network requires a “gateway” between the PSTN and IP networks. PSTN-IP Gateways support TDM voice on the PSTN side and VoIP and FoIP on the packet side. For voice sessions, the gateway will take in voice packets on the IP side, accumulate a few packets to ensure a smooth flow of TDM data upon their release, and then meter them out over TDM where they eventually are heard by a human or stored on a computer for later playback. The gateway employs packet-management techniques to enhance the quality of the speech in the presence of network errors by taking advantage of the natural ability of a listener to not really hear the occasional missing or repeated packet. But facsimile data are transmitted by modems, which aren't as forgiving as the human ear is for speech. Missing packets will often cause a fax session to fail at worst or create one or more image lines in error at best. So the job of T.38 is to “fool” the terminal into “thinking” that it's communicating directly with another T.30 terminal. It will also correct for network delays with so-called spoofing techniques, and missing or delayed packets with fax-aware buffer-management techniques. Spoofing refers to the logic implemented in the protocol engine of a T.38 relay that modifies the protocol commands and responses on the TDM side to keep network delays on the IP side from causing the transaction to fail. This is done, for example, by padding image lines or deliberately causing a message to be re-transmitted to render network delays transparent to the sending/receiving fax terminals. Networks that do not have packet loss or excessive delay can exhibit acceptable fax performance without T.38, provided the PCM clocks in all gateways are of very high accuracy (explained below). T.38 not only removes the effect of PCM clocks not being synchronized, but also reduces the required network bandwidth by a factor of 10, while it corrects for packet loss and delay. === Bandwidth reduction === As shown in the diagram below, a T.38 gateway is composed of two primary elements: the fax modems and the T.38 subsystem. The fax modems modulate and demodulate the PCM samples of the analog data, turning the sampled-data representation of the fax terminal's analog signal to its binary translation, and vice versa. The PSTN network samples the analog signal of a voice or modem signal (it doesn't know the difference) 8,000 times per second (SPS), and encodes them as 8-bit data bytes. This means 8000 samples-per-second times 8-bits per sample, or 64,000 bits per second (bit/s) to represent the modem (or voice) data in one direction. For both directions the modem transaction consumes 128,000 bits of network bandwidth. However, the typical modem in a fax terminal transmits the image data at 33,600 bit/s, so if the analog data are first converted to the digital content they represent, only 33,600 bits (plus network overhead of a few bytes) are needed. And since T.30 fax is a half-duplex protocol, the network is only needed for one direction at a time. Refer to RFC 3261 === PCM clock synchronization === In the diagram above, there is a sample-rate clock in the fax terminal and one in the gateway's modems that is used to trigger the sampling of the analog line 8,000 times per second. These clocks are usually quite accurate, but in some low-cost terminal adapters (a one or two-line gateway) the PCM clock can be surprisingly inaccurate. If the terminal is sending data to the gateway, and the gateway's clock is too slow, the buffers (jitter buffers) in the gateway will eventually overflow, causing the transaction to fail. Since the difference is often quite small, this problem occurs on long, detailed fax images giving the clocks more time to cause the jitter buffer in gateway to either underflow or overflow, which is just the same as missing or duplicated packets. === Packet loss === T.38 provides facilities to eliminate the effects of packet loss through data redundancy. When a packet is sent, either zero, one, two, three, or even more of the previously sent packets are repeated. (The specification does not impose a limit.) This increases the network bandwidth required (it's still much less than not using T.38) but it allows the receiving gateway to reconstruct the complete packet sequence, even with a fairly high level of packet loss. == Related standards == T.4 is the umbrella specification for fax. It specifies the standard image sizes, two forms of image-data compression (encoding), the image-data format, and references, T.30 and the various modem standards. T.6 specifies a compression scheme that reduces the time required to transmit an image by roughly 50-percent. T.30 specifies the procedures that a sending and receiving terminal use to set up a fax call, determine the image size, encoding, and transfer speed, the demarcation between pages, and the termination of the call. T.30 also references the various modem standards. V.21, V.27ter, V.29, V.17, V.34: ITU modem standards used in facsimile. The first three were ratified prior to 1980, and were specified in the original T.4 and T.30 standards. V.34 was published for fax in 1994. T.37 The ITU standard for sending a fax-image file via e-mail to the intended recipient of a fax. G.711 pass through - this is where the T.30 fax call is carried in a VoIP call encoded as audio. This is sensitive to network packet loss, jitter and clock synchronization. When using voice high-compression encoding techniques such as, but not limited to, G.729, some fax tonal signa

Algorithmic amplification

Algorithmic amplification is the process by which automated ranking and recommendation systems on digital platforms increase the visibility of certain content beyond its initial audience. Major platforms including Facebook, YouTube, TikTok, and X (formerly Twitter) use such systems to determine what appears in users' feeds and search results. The term is used in research on social media and digital media regulation to describe how platform design choices influence the distribution of online information. Unlike chronological feeds, algorithmic systems evaluate content using signals such as engagement rates, viewing duration, and predicted relevance to individual users. Content that performs strongly on these metrics may be promoted to progressively larger audiences through feeds, search rankings, or autoplay systems. The process is distinct from content moderation, which involves removing, labelling, or restricting content under platform rules, although the two can interact in practice. The concept is closely connected to the attention economy. Research has linked algorithmic amplification to the spread of misinformation and the circulation of political content, as well as to effects on young users' mental health. The scale and direction of those effects remain debated, in part because independent researchers have limited access to the internal workings of platform recommendation systems. Governments in the European Union, United Kingdom, United States, and China have pursued differing regulatory approaches to recommendation algorithms. The EU's Digital Services Act and the UK's Online Safety Act 2023 impose obligations on large platforms related to recommendation system transparency and risk, while China became the first country to enact binding legislation specifically targeting such systems. Internal documents and whistleblower testimony reported by the BBC in 2026 described how competitive pressure between Meta and TikTok led to trade-offs between engagement and user safety in the design of their recommendation systems. == Terminology == The term algorithmic amplification is used in media studies, platform governance scholarship and regulatory literature to describe how automated systems influence the distribution of content beyond what organic user sharing alone would produce. It is distinct from viral spread, which refers primarily to user-driven sharing behaviour, and from algorithmic bias, which describes systematic errors or unfairness in algorithmic outputs. The related term algorithmic curation is used for the broader process of selecting and ordering content, of which amplification is one possible outcome. The phrase also appears in regulatory and legislative discussion of recommendation systems. The European Union's Digital Services Act (DSA) identifies recommendation systems as a potential source of systemic risk, and the term appears frequently in academic and policy commentary on the regulation. In the United States, proposals including the Filter Bubble Transparency Act and the Kids Online Safety Act (KOSA) have used it to frame requirements around recommendation system transparency. In the United Kingdom, the House of Commons Science, Innovation and Technology Committee used the term in a 2025 report on how recommendation algorithms contributed to the spread of misinformation during the 2024 Southport riots. A Joint Declaration on AI and Freedom of Expression adopted in October 2025 by four international freedom of expression mandate holders, including the UN Special Rapporteur on Freedom of Opinion and Expression and the OSCE Representative on Freedom of the Media, stated that recommender systems and other AI-powered curation tools exert "a large hidden influence and gatekeeper role" over what information people access and consume. == Background == Early internet platforms typically displayed content in reverse-chronological order or through keyword-based search systems. Although the term is most often applied to social media, the underlying logic predates social media itself. A 2021 overview traced the origins of modern recommendation systems to the early 1990s, when they were first used experimentally for personal email and information filtering. The 1992 Tapestry mail system and the 1994 GroupLens news filtering system were early milestones before recommendation systems spread into e-commerce and other online services. As user bases and content volumes grew during the 2000s, major platforms including Google, YouTube, and Facebook developed machine-learning systems to personalise content delivery and prioritise material predicted to generate engagement. Facebook introduced its News Feed in 2006, which gradually shifted from chronological presentation towards algorithmically ranked content. YouTube altered its recommendation system in 2012 to prioritise watch time rather than clicks, a change the platform said was prompted by concerns that click-based metrics encouraged misleading thumbnails and low-quality videos. TikTok, launched internationally in 2018, adopted a model in which its primary content surface, the For You feed, is driven almost entirely by algorithmic recommendation rather than by a user's social graph. An internal document obtained by The New York Times in 2021 showed that the platform's algorithm optimised for retention and time spent, using signals such as watch duration, replays, likes, and comments to score and rank videos. Algorithmic recommendation also became central to platforms outside social media. Spotify's personalised features, including Discover Weekly, Release Radar, and Home recommendations, use behavioural signals and inferred "taste profiles" to surface tracks and artists beyond a listener's existing library. An ethnographic study of music curators at streaming platforms described this blend of algorithmic and human editorial selection as an "algo-torial" model of gatekeeping. Amazon adopted item-based collaborative filtering for product recommendations in 1998, and its recommendation engine has been described as one of the earliest large-scale deployments of recommendation technology in e-commerce. The same dynamics operate on adult content platforms. Law professor Amy Adler has argued that from 2007 onwards the pornography industry migrated to algorithm-driven streaming platforms, most of which are controlled by a single near-monopoly company, Aylo (formerly MindGeek). These platforms use algorithmic search engines, suggestions, rigid categorisation of content, and AI-driven search term optimisation in ways that produce the same distorting effects found on mainstream speech platforms, including filter bubbles, feedback loops, and the tendency of algorithmic recommendations to alter individual preferences. == Mechanisms == Recommendation systems commonly combine collaborative filtering, which predicts a user's preferences from the behaviour of similar users, with machine-learning models that predict which content a user is likely to engage with from their prior activity. In a common two-stage design, a platform first generates a set of candidate items from a large content pool and then ranks them using a scoring model with objectives such as predicted engagement or user satisfaction. Small changes in ranking criteria can shift exposure at scale, particularly when applied repeatedly across multiple browsing sessions. These systems typically rely on signals including engagement rates, viewing duration, click-through rates, and network relationships between users. Modern recommendation pipelines continuously update predictions as new behavioural data arrives, allowing platforms to adjust rankings in near real time. Users' revealed preferences, expressed through behaviour such as clicks and viewing time, do not always align with their stated preferences, expressed through explicit feedback such as surveys or content controls. Popularity signals can create feedback dynamics in which early engagement increases the likelihood that content will be shown to additional users. Experimental research on online cultural markets has demonstrated how such feedback processes can produce unequal visibility outcomes even when initial differences in content quality are small. == Beneficial and public-interest uses == Recommendation systems can help users navigate large volumes of content by surfacing material predicted to match their interests or needs, which can improve discoverability on platforms with large content libraries. In public health communication, platforms can help health authorities distribute timely information at scale, though the same recommendation systems also risk amplifying misinformation alongside official guidance. Sociologist Zeynep Tufekci has argued that the shift from independent blogs to large centralised platforms transferred gatekeeping power from traditional media to corporate algorithms. In the case of the Egyptian uprising of 2011, she noted that ordinary users

Textual entailment

In natural language processing, textual entailment (TE), also known as natural language inference (NLI), is a directional relation between text fragments. The relation holds whenever the truth of one text fragment follows from another text. == Definition == In the TE framework, the entailing and entailed texts are termed text (t) and hypothesis (h), respectively. Textual entailment is not the same as pure logical entailment – it has a more relaxed definition: "t entails h" (t ⇒ h) if, typically, a human reading t would infer that h is most likely true. (Alternatively: t ⇒ h if and only if, typically, a human reading t would be justified in inferring the proposition expressed by h from the proposition expressed by t.) The relation is directional because even if "t entails h", the reverse "h entails t" is much less certain. Determining whether this relationship holds is an informal task, one which sometimes overlaps with the formal tasks of formal semantics (satisfying a strict condition will usually imply satisfaction of a less strict conditioned); additionally, textual entailment partially subsumes word entailment. == Examples == Textual entailment can be illustrated with examples of three different relations: An example of a positive TE (text entails hypothesis) is: text: If you help the needy, God will reward you. hypothesis: Giving money to a poor man has good consequences. An example of a negative TE (text contradicts hypothesis) is: text: If you help the needy, God will reward you. hypothesis: Giving money to a poor man has no consequences. An example of a non-TE (text does not entail nor contradict) is: text: If you help the needy, God will reward you. hypothesis: Giving money to a poor man will make you a better person. == Ambiguity of natural language == A characteristic of natural language is that there are many different ways to state what one wants to say: several meanings can be contained in a single text and the same meaning can be expressed by different texts. This variability of semantic expression can be seen as the dual problem of language ambiguity. Together, they result in a many-to-many mapping between language expressions and meanings. The task of paraphrasing involves recognizing when two texts have the same meaning and creating a similar or shorter text that conveys almost the same information. Textual entailment is similar but weakens the relationship to be unidirectional. Mathematical solutions to establish textual entailment can be based on the directional property of this relation, by making a comparison between some directional similarities of the texts involved. == Approaches == Textual entailment measures natural language understanding as it asks for a semantic interpretation of the text, and due to its generality remains an active area of research. Many approaches and refinements of approaches have been considered, such as word embedding, logical models, graphical models, rule systems, contextual focusing, and machine learning. Practical or large-scale solutions avoid these complex methods and instead use only surface syntax or lexical relationships, but are correspondingly less accurate. As of 2005, state-of-the-art systems are far from human performance; a study found humans to agree on the dataset 95.25% of the time. Algorithms from 2016 had not yet achieved 90%. == Applications == Many natural language processing applications, like question answering, information extraction, summarization, multi-document summarization, and evaluation of machine translation systems, need to recognize that a particular target meaning can be inferred from different text variants. Typically entailment is used as part of a larger system, for example in a prediction system to filter out trivial or obvious predictions. Textual entailment also has applications in adversarial stylometry, which has the objective of removing textual style without changing the overall meaning of communication. == Datasets == Some of available English NLI datasets include: SNLI MultiNLI SciTail SICK MedNLI QA-NLI In addition, there are several non-English NLI datasets, as follows: XNLI DACCORD, RTE3-FR, SICK-FR for French FarsTail for Farsi OCNLI for Chinese SICK-NL for Dutch IndoNLI for Indonesian

Hardware backdoor

A hardware backdoor is a backdoor implemented within the physical components of a computer system, also known as its hardware. They can be created by introducing malicious code to a component's firmware, or even during the manufacturing process of an integrated circuit. Often, they are used to undermine security in smartcards and cryptoprocessors, unless investment is made in anti-backdoor design methods. They have also been considered for car hacking. Backdoors differ from hardware Trojans as backdoors are introduced intentionally by the original designer or during the design process, whereas hardware Trojans are inserted later by an external party. == Background == The existence of hardware backdoors poses significant security risks for several reasons. They are difficult to detect and are impossible to remove using conventional methods like antivirus software. They can also bypass other security measures, such as disk encryption. Hardware trojans can be introduced during manufacturing where the end-user lacks control over the production chain. == History == In 2008, the FBI reported the discovery of approximately 3,500 counterfeit Cisco network components in the United States, some of which were introduced in military and government infrastructure. In the same year, the possibility of a backdoor SPARC CPU was demonstrated with an FPGA running Linux that supported various hidden malicious services. A few years later, in 2011, Jonathan Brossard presented "Rakshasa", a proof-of-concept hardware backdoor. This backdoor could be installed by an individual with physical access to the hardware. It utilized coreboot to re-flash the BIOS with a SeaBIOS and iPXE-based bootkit composed of legitimate, open-source tools, allowing malware to be fetched from the internet during the boot process. The following year, in 2012, Sergei Skorobogatov and Christopher Woods from the University of Cambridge Computer Laboratory reported the discovery of a backdoor in a military-grade FPGA device, which could be exploited to access and modify sensitive information. It has been said that this was proven to be a software problem and not a deliberate attempt at sabotage. This still brought to attention that equipment manufacturers should ensure that microchips operate as intended. Later that year, two mobile phones developed by the Chinese company ZTE were found to carry a root access backdoor. According to security researcher Dmitri Alperovitch, the exploit used a hard-coded password in its software. Starting in 2012, the United States stated that Huawei might have backdoors present in their products. In 2013, researchers at the University of Massachusetts devised a method of breaking a CPU's internal cryptographic mechanisms by introducing specific impurities into the crystalline structure of transistors to change Intel's random-number generator. Documents revealed from 2013 onwards during the surveillance disclosures initiated by Edward Snowden showed that the Tailored Access Operations (TAO) unit and other NSA employees intercepted servers, routers, and other network gear being shipped to organizations targeted for surveillance to install covert implant firmware onto them before delivery. These tools include custom BIOS exploits that survive the reinstallation of operating systems and USB cables with spy hardware and radio transceiver packed inside. In June 2016 it was reported that University of Michigan Department of Electrical Engineering and Computer Science had built a hardware backdoor that leveraged "analog circuits to create a hardware attack" so that after the capacitors store up enough electricity to be fully charged, it would be switched on, to give an attacker complete access to whatever system or device − such as a PC − that contains the backdoored chip. In the study that won the "best paper" award at the IEEE Symposium on Privacy and Security they also note that microscopic hardware backdoor wouldn't be caught by practically any modern method of hardware security analysis, and could be planted by a single employee of a chip factory. In October 2018 Bloomberg reported that an attack by Chinese spies reached almost 30 U.S. companies, including Amazon and Apple, by compromising America's technology supply chain. == Countermeasures == Skorobogatov has developed a technique capable of detecting malicious insertions into chips. New York University Tandon School of Engineering researchers have developed a way to corroborate a chip's operation using verifiable computing whereby "manufactured for sale" chips contain an embedded verification module that proves the chip's calculations are correct and an associated external module validates the embedded verification module. Another technique developed by researchers at University College London (UCL) relies on distributing trust between multiple identical chips from disjoint supply chains. Assuming that at least one of those chips remains honest the security of the device is preserved. Researchers at the University of Southern California Ming Hsieh Department of Electrical and Computer Engineering and the Photonic Science Division at the Paul Scherrer Institute have developed a new technique called Ptychographic X-ray laminography. This technique is the only current method that allows for verification of the chips blueprint and design without destroying or cutting the chip. It also does so in significantly less time than other current methods. Anthony F. J. Levi Professor of electrical and computer engineering at University of Southern California explains “It’s the only approach to non-destructive reverse engineering of electronic chips—[and] not just reverse engineering but assurance that chips are manufactured according to design. You can identify the foundry, aspects of the design, who did the design. It’s like a fingerprint.” This method currently is able to scan chips in 3D and zoom in on sections and can accommodate chips up to 12 millimeters by 12 millimeters easily accommodating an Apple A12 chip but not yet able to scan a full Nvidia Volta GPU. "Future versions of the laminography technique could reach a resolution of just 2 nanometers or reduce the time for a low-resolution inspection of that 300-by-300-micrometer segment to less than an hour, the researchers say."