Locality-sensitive hashing

Locality-sensitive hashing

In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. The number of buckets is much smaller than the universe of possible input items. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items. Hashing-based approximate nearest-neighbor search algorithms generally use one of two main categories of hashing methods: either data-independent methods, such as locality-sensitive hashing (LSH); or data-dependent methods, such as locality-preserving hashing (LPH). Locality-preserving hashing was initially devised as a way to facilitate data pipelining in implementations of massively parallel algorithms that use randomized routing and universal hashing to reduce memory contention and network congestion. == Definitions == A finite family F {\displaystyle {\mathcal {F}}} of functions h : M → S {\displaystyle h\colon M\to S} is defined to be an LSH family for a metric space M = ( M , d ) {\displaystyle {\mathcal {M}}=(M,d)} , a threshold r > 0 {\displaystyle r>0} , an approximation factor c > 1 {\displaystyle c>1} , and probabilities p 1 > p 2 {\displaystyle p_{1}>p_{2}} if it satisfies the following condition. For any two points a , b ∈ M {\displaystyle a,b\in M} and a hash function h {\displaystyle h} chosen uniformly at random from F {\displaystyle {\mathcal {F}}} : If d ( a , b ) ≤ r {\displaystyle d(a,b)\leq r} , then h ( a ) = h ( b ) {\displaystyle h(a)=h(b)} (i.e., a and b collide) with probability at least p 1 {\displaystyle p_{1}} , If d ( a , b ) ≥ c r {\displaystyle d(a,b)\geq cr} , then h ( a ) = h ( b ) {\displaystyle h(a)=h(b)} with probability at most p 2 {\displaystyle p_{2}} . Such a family F {\displaystyle {\mathcal {F}}} is called ( r , c r , p 1 , p 2 ) {\displaystyle (r,cr,p_{1},p_{2})} -sensitive. === LSH with respect to a similarity measure === Alternatively it is possible to define an LSH family on a universe of items U endowed with a similarity function ϕ : U × U → [ 0 , 1 ] {\displaystyle \phi \colon U\times U\to [0,1]} . In this setting, a LSH scheme is a family of hash functions H coupled with a probability distribution D over H such that a function h ∈ H {\displaystyle h\in H} chosen according to D satisfies P r [ h ( a ) = h ( b ) ] = ϕ ( a , b ) {\displaystyle Pr[h(a)=h(b)]=\phi (a,b)} for each a , b ∈ U {\displaystyle a,b\in U} . === Amplification === Given a ( d 1 , d 2 , p 1 , p 2 ) {\displaystyle (d_{1},d_{2},p_{1},p_{2})} -sensitive family F {\displaystyle {\mathcal {F}}} , we can construct new families G {\displaystyle {\mathcal {G}}} by either the AND-construction or OR-construction of F {\displaystyle {\mathcal {F}}} . To create an AND-construction, we define a new family G {\displaystyle {\mathcal {G}}} of hash functions g, where each function g is constructed from k random functions h 1 , … , h k {\displaystyle h_{1},\ldots ,h_{k}} from F {\displaystyle {\mathcal {F}}} . We then say that for a hash function g ∈ G {\displaystyle g\in {\mathcal {G}}} , g ( x ) = g ( y ) {\displaystyle g(x)=g(y)} if and only if all h i ( x ) = h i ( y ) {\displaystyle h_{i}(x)=h_{i}(y)} for i = 1 , 2 , … , k {\displaystyle i=1,2,\ldots ,k} . Since the members of F {\displaystyle {\mathcal {F}}} are independently chosen for any g ∈ G {\displaystyle g\in {\mathcal {G}}} , G {\displaystyle {\mathcal {G}}} is a ( d 1 , d 2 , p 1 k , p 2 k ) {\displaystyle (d_{1},d_{2},p_{1}^{k},p_{2}^{k})} -sensitive family. To create an OR-construction, we define a new family G {\displaystyle {\mathcal {G}}} of hash functions g, where each function g is constructed from k random functions h 1 , … , h k {\displaystyle h_{1},\ldots ,h_{k}} from F {\displaystyle {\mathcal {F}}} . We then say that for a hash function g ∈ G {\displaystyle g\in {\mathcal {G}}} , g ( x ) = g ( y ) {\displaystyle g(x)=g(y)} if and only if h i ( x ) = h i ( y ) {\displaystyle h_{i}(x)=h_{i}(y)} for one or more values of i. Since the members of F {\displaystyle {\mathcal {F}}} are independently chosen for any g ∈ G {\displaystyle g\in {\mathcal {G}}} , G {\displaystyle {\mathcal {G}}} is a ( d 1 , d 2 , 1 − ( 1 − p 1 ) k , 1 − ( 1 − p 2 ) k ) {\displaystyle (d_{1},d_{2},1-(1-p_{1})^{k},1-(1-p_{2})^{k})} -sensitive family. == Applications == LSH has been applied to several problem domains, including: Near-duplicate detection Hierarchical clustering Genome-wide association study Image similarity identification VisualRank Gene expression similarity identification Audio similarity identification Nearest neighbor search Audio fingerprint Digital video fingerprinting Shared memory organization in parallel computing Physical data organization in database management systems Training fully connected neural networks Computer security Machine learning == Methods == === Bit sampling for Hamming distance === One of the easiest ways to construct an LSH family is by bit sampling. This approach works for the Hamming distance over d-dimensional vectors { 0 , 1 } d {\displaystyle \{0,1\}^{d}} . Here, the family F {\displaystyle {\mathcal {F}}} of hash functions is simply the family of all the projections of points on one of the d {\displaystyle d} coordinates, i.e., F = { h : { 0 , 1 } d → { 0 , 1 } ∣ h ( x ) = x i for some i ∈ { 1 , … , d } } {\displaystyle {\mathcal {F}}=\{h\colon \{0,1\}^{d}\to \{0,1\}\mid h(x)=x_{i}{\text{ for some }}i\in \{1,\ldots ,d\}\}} , where x i {\displaystyle x_{i}} is the i {\displaystyle i} th coordinate of x {\displaystyle x} . A random function h {\displaystyle h} from F {\displaystyle {\mathcal {F}}} simply selects a random bit from the input point. This family has the following parameters: P 1 = 1 − R / d {\displaystyle P_{1}=1-R/d} , P 2 = 1 − c R / d {\displaystyle P_{2}=1-cR/d} . That is, any two vectors x , y {\displaystyle x,y} with Hamming distance at most R {\displaystyle R} collide under a random h {\displaystyle h} with probability at least P 1 {\displaystyle P_{1}} . Any x , y {\displaystyle x,y} with Hamming distance at least c R {\displaystyle cR} collide with probability at most P 2 {\displaystyle P_{2}} . === Min-wise independent permutations === Suppose U is composed of subsets of some ground set of enumerable items S and the similarity function of interest is the Jaccard index J. If π is a permutation on the indices of S, for A ⊆ S {\displaystyle A\subseteq S} let h ( A ) = min a ∈ A { π ( a ) } {\displaystyle h(A)=\min _{a\in A}\{\pi (a)\}} . Each possible choice of π defines a single hash function h mapping input sets to elements of S. Define the function family H to be the set of all such functions and let D be the uniform distribution. Given two sets A , B ⊆ S {\displaystyle A,B\subseteq S} the event that h ( A ) = h ( B ) {\displaystyle h(A)=h(B)} corresponds exactly to the event that the minimizer of π over A ∪ B {\displaystyle A\cup B} lies inside A ∩ B {\displaystyle A\cap B} . As h was chosen uniformly at random, P r [ h ( A ) = h ( B ) ] = J ( A , B ) {\displaystyle Pr[h(A)=h(B)]=J(A,B)\,} and ( H , D ) {\displaystyle (H,D)\,} define an LSH scheme for the Jaccard index. Because the symmetric group on n elements has size n!, choosing a truly random permutation from the full symmetric group is infeasible for even moderately sized n. Because of this fact, there has been significant work on finding a family of permutations that is "min-wise independent" — a permutation family for which each element of the domain has equal probability of being the minimum under a randomly chosen π. It has been established that a min-wise independent family of permutations is at least of size lcm ⁡ { 1 , 2 , … , n } ≥ e n − o ( n ) {\displaystyle \operatorname {lcm} \{\,1,2,\ldots ,n\,\}\geq e^{n-o(n)}} , and that this bound is tight. Because min-wise independent families are too big for practical applications, two variant notions of min-wise independence are introduced: restricted min-wise independent permutations families, and approximate min-wise independent families. Restricted min-wise independence is the min-wise independence property restricted to certain sets of cardinality at most k. Approximate min-wise independence differs from the property by at most a fixed ε. === Open source methods === ==== Nilsimsa Hash ==== Nilsimsa is a locality-sensitive hashing algorithm used in anti-spam efforts. The goal of Nilsimsa is to generate a hash digest of an email message such that the digests of two similar messages are similar to each other. The paper suggests that the Nilsimsa satisfies three requirements: The digest identifying each message should not

Poop Map

Poop Map is a social app where users can track on a map where and when they defecate. In addition to logging location and time of each bowel movement, users can also add a photo, "like" other users' logs, and rate each account. The social elements of the app allow for groups of users to create a competitive league. Certain behaviors unlock achievements in-app. == Development == The app was created by app developer Nino Uzelac. It was launched in July 2013. == Popularity == The app charted at number one on the Apple App Store charts in 2021 after going viral on TikTok. As of September 2024, the app has a 4.8 rating on the App Store and more than 58,000 ratings. It also has more than one million downloads on the Google Play Store. Poop Map is notably popular among hikers, and has been written about in the outdoors magazine Outside.

Hardware-based encryption

Hardware-based encryption is the use of computer hardware to assist software, or sometimes replace software, in the process of data encryption. Typically, this is implemented as part of the processor's instruction set. For example, the AES encryption algorithm (a modern cipher) can be implemented using the AES instruction set on the ubiquitous x86 architecture. Such instructions also exist on the ARM architecture. However, more unusual systems exist where the cryptography module is separate from the central processor, instead being implemented as a coprocessor, in particular a secure cryptoprocessor or cryptographic accelerator, of which an example is the IBM 4758, or its successor, the IBM 4764. Hardware implementations can be faster and less prone to exploitation than traditional software implementations, and furthermore can be protected against tampering. == History == Prior to the use of computer hardware, cryptography could be performed through various mechanical or electro-mechanical means. An early example is the Scytale used by the Spartans. The Enigma machine was an electro-mechanical system cipher machine notably used by the Germans in World War II. After World War II, purely electronic systems were developed. In 1987 the ABYSS (A Basic Yorktown Security System) project was initiated. The aim of this project was to protect against software piracy. However, the application of computers to cryptography in general dates back to the 1940s and Bletchley Park, where the Colossus computer was used to break the encryption used by German High Command during World War II. The use of computers to encrypt, however, came later. In particular, until the development of the integrated circuit, of which the first was produced in 1960, computers were impractical for encryption, since, in comparison to the portable form factor of the Enigma machine, computers of the era took the space of an entire building. It was only with the development of the microcomputer that computer encryption became feasible, outside of niche applications. The development of the World Wide Web lead to the need for consumers to have access to encryption, as online shopping became prevalent. The key concerns for consumers were security and speed. This led to the eventual inclusion of the key algorithms into processors as a way of both increasing speed and security. == Implementations == === In the instruction set === ==== x86 ==== The X86 architecture, as a CISC (Complex Instruction Set Computer) Architecture, typically implements complex algorithms in hardware. Cryptographic algorithms are no exception. The x86 architecture implements significant components of the AES (Advanced Encryption Standard) algorithm, which can be used by the NSA for Top Secret information. The architecture also includes support for the SHA Hashing Algorithms through the Intel SHA extensions. Whereas AES is a cipher, which is useful for encrypting documents, hashing is used for verification, such as of passwords (see PBKDF2). ==== ARM ==== ARM processors can optionally support Security Extensions. Although ARM is a RISC (Reduced Instruction Set Computer) architecture, there are several optional extensions specified by ARM Holdings. === As a coprocessor === IBM 4758 – The predecessor to the IBM 4764. This includes its own specialised processor, memory and a Random Number Generator. IBM 4764 and IBM 4765, identical except for the connection used. The former uses PCI-X, while the latter uses PCI-e. Both are peripheral devices that plug into the motherboard. === Proliferation === Advanced Micro Devices (AMD) processors are also x86 devices, and have supported the AES instructions since the 2011 Bulldozer processor iteration. Due to the existence of encryption instructions on modern processors provided by both Intel and AMD, the instructions are present on most modern computers. They also exist on many tablets and smartphones due to their implementation in ARM processors. == Advantages == Implementing cryptography in hardware means that part of the processor is dedicated to the task. This can lead to a large increase in speed. In particular, modern processor architectures that support pipelining can often perform other instructions concurrently with the execution of the encryption instruction. Furthermore, hardware can have methods of protecting data from software. Consequently, even if the operating system is compromised, the data may still be secure (see Software Guard Extensions). == Disadvantages == If, however, the hardware implementation is compromised, major issues arise. Malicious software can retrieve the data from the (supposedly) secure hardware – a large class of method used is the timing attack. This is far more problematic to solve than a software bug, even within the operating system. Microsoft regularly deals with security issues through Windows Update. Similarly, regular security updates are released for Mac OS X and Linux, as well as mobile operating systems like iOS, Android, and Windows Phone. However, hardware is a different issue. Sometimes, the issue will be fixable through updates to the processor's microcode (a low level type of software). However, other issues may only be resolvable through replacing the hardware, or a workaround in the operating system which mitigates the performance benefit of the hardware implementation, such as in the Spectre exploit.

Data communication

Data communication is the transfer of data over a point-to-point or point-to-multipoint communication channel. Data communication comprises data transmission and data reception and can be classified as analog transmission and digital communications. Analog data communication conveys voice, data, image, signal or video information using a continuous signal, which varies in amplitude, phase, or some other property. In baseband analog transmission, messages are represented by a sequence of pulses by means of a line code; in passband analog transmission, they are communicated by a limited set of continuously varying waveforms, using a digital modulation method. Passband modulation and demodulation are carried out by modem equipment. Digital transmission and digital reception are the transfer of either a digitized analog signal or a born-digital bitstream. Baseband digital transmission is regarded as comprising part of a digital signal, whereas passband transmission of digital data may also or alternatively be considered a form of digital-to-analog conversion. Data communication channels include copper wires, optical fibers, wireless communication using radio spectrum, storage media and computer buses. The data are represented as an electromagnetic signal, such as an electrical voltage, radiowave, microwave, or infrared signal. == Distinction between related subjects == Digital transmission or data transmission traditionally belongs to telecommunications and electrical engineering. Basic principles of data transmission may also be covered within the computer science or computer engineering topic of data communications, which also includes computer networking applications and communication protocols, for example, routing, switching and inter-process communication. Although the Transmission Control Protocol (TCP) involves transmission, TCP and other transport layer protocols are covered in computer networking but not discussed in a textbook or course about data transmission. In most textbooks, the term analog transmission only refers to the transmission of an analog message signal (without digitization) by means of an analog signal, either as a non-modulated baseband signal or as a passband signal using an analog modulation method such as AM or FM. It may also include analog-over-analog pulse modulated baseband signals such as pulse-width modulation. In a few books within the computer networking tradition, analog transmission also refers to passband transmission of bit-streams using digital modulation methods such as FSK, PSK and ASK. The theoretical aspects of data transmission are covered by information theory and coding theory. == Protocol layers and sub-topics == Courses and textbooks in the field of data transmission typically deal with the following OSI model protocol layers and topics: Layer 1, the physical layer: Channel coding including Digital modulation schemes Line coding schemes Forward error correction (FEC) codes Bit synchronization Multiplexing Equalization Channel models Layer 2, the data link layer: Channel access schemes, media access control (MAC) Packet mode communication and Frame synchronization Error detection and automatic repeat request (ARQ) Flow control Layer 6, the presentation layer: Source coding (digitization and data compression), and information theory. Cryptography (may occur at any layer) It is also common to deal with the cross-layer design of those three layers. == Applications and history == Data (mainly but not exclusively informational) has been sent via non-electronic (e.g. optical, acoustic, mechanical) means since the advent of communication. Analog signal data has been sent electronically since the advent of the telephone. However, the first data electromagnetic transmission applications in modern time were electrical telegraphy (1809) and teletypewriters (1906), which are both digital signals. The fundamental theoretical work in data transmission and information theory by Harry Nyquist, Ralph Hartley, Claude Shannon and others during the early 20th century, was done with these applications in mind. In the early 1960s, Paul Baran invented distributed adaptive message block switching for digital communication of voice messages using switches that were low-cost electronics. Donald Davies invented and implemented modern data communication during 1965–7, including packet switching, high-speed routers, communication protocols, hierarchical computer networks and the essence of the end-to-end principle. Baran's work did not include routers with software switches and communication protocols, nor the idea that users, rather than the network itself, would provide the reliability. Both were seminal contributions that influenced the development of computer networks. Data transmission is utilized in computers in computer buses and for communication with peripheral equipment via parallel ports and serial ports such as RS-232 (1969), FireWire (1995) and USB (1996). The principles of data transmission are also utilized in storage media for error detection and correction since 1951. The first practical method to overcome the problem of receiving data accurately by the receiver using digital code was the Barker code invented by Ronald Hugh Barker in 1952 and published in 1953. Data transmission is utilized in computer networking equipment such as modems (1940), local area network (LAN) adapters (1964), repeaters, repeater hubs, microwave links, wireless network access points (1997), etc. In telephone networks, digital communication is utilized for transferring many phone calls over the same copper cable or fiber cable by means of pulse-code modulation (PCM) in combination with time-division multiplexing (TDM) (1962). Telephone exchanges have become digital and software controlled, facilitating many value-added services. For example, the first AXE telephone exchange was presented in 1976. Digital communication to the end user using Integrated Services Digital Network (ISDN) services became available in the late 1980s. Since the end of the 1990s, broadband access techniques such as ADSL, Cable modems, fiber-to-the-building (FTTB) and fiber-to-the-home (FTTH) have become widespread to small offices and homes. The current tendency is to replace traditional telecommunication services with packet mode communication such as IP telephony and IPTV. Transmitting analog signals digitally allows for greater signal processing capability. The ability to process a communications signal means that errors caused by random processes can be detected and corrected. Digital signals can also be sampled instead of continuously monitored. The multiplexing of multiple digital signals is much simpler compared to the multiplexing of analog signals. Because of all these advantages, because of the vast demand to transmit computer data and the ability of digital communications to do so and because recent advances in wideband communication channels and solid-state electronics have allowed engineers to realize these advantages fully, digital communications have grown quickly. The digital revolution has also resulted in many digital telecommunication applications where the principles of data transmission are applied. Examples include second-generation (1991) and later cellular telephony, video conferencing, digital TV (1998), digital radio (1999), and telemetry. Data transmission, digital transmission or digital communications is the transfer of data over a point-to-point or point-to-multipoint communication channel. Examples of such channels include copper wires, optical fibers, wireless communication channels, storage media and computer buses. The data are represented as an electromagnetic signal, such as an electrical voltage, radio wave, microwave, or infrared light. While analog transmission is the transfer of a continuously varying analog signal over an analog channel, digital communication is the transfer of discrete messages over a digital or an analog channel. The messages are either represented by a sequence of pulses by means of a line code (baseband transmission) or by a limited set of continuously varying waveforms (passband transmission), using a digital modulation method. The passband modulation and corresponding demodulation (also known as detection) are carried out by modem equipment. According to the most common definition of a digital signal, both baseband and passband signals representing bit-streams are considered as digital transmission, while an alternative definition only considers the baseband signal as digital, and passband transmission of digital data as a form of digital-to-analog conversion. Data transmitted may be digital messages originating from a data source, for example, a computer or a keyboard. It may also be an analog signal, such as a phone call or a video signal, digitized into a bit-stream, for example,e using pulse-code modulation (PCM) or more advanced source coding (analog-to-digital conversion and

X2 transceiver

The X2 transceiver format is a 10 gigabit per second modular fiber optic interface intended for use in routers, switches and optical transport platforms. It is an early generation 10 gigabit interface related to the similar XENPAK and XPAK formats. X2 may be used with 10 Gigabit Ethernet or OC-192/STM-64 speed SDH/SONET equipment. X2 modules are smaller and consume less power than first-generation XENPAK modules, but larger and consume more energy than the newer XFP transceiver standard and SFP+ standards. As of 2016 this format is relatively uncommon and has been replaced by 10 Gbit/s SFP+ in most new equipment.

Dark mode

A dark mode, dark theme, night mode, or light-on-dark color scheme is a color scheme that uses light-colored text, icons, and graphical user interface elements on a dark background. It is often discussed in terms of computer user interface design and web design. Many modern websites and operating systems offer the user an optional light-on-dark display mode. Some users find dark mode displays more visually appealing, and claim that it can reduce eye strain. Displaying white at full brightness uses roughly six times as much power as pure black on a 2016 Google Pixel, which has an OLED display. However, conventional LED displays may not benefit from reduced power consumption; but if a LED display has the partial dimming features, it still benefits from reduced power consumption. Most modern operating systems support an optional light-on-dark color scheme. == History == Microsoft introduced the high contrast themes in Windows 95. Later, Microsoft introduced a dark theme in the Anniversary Update of Windows 10 in 2016. In 2018, Apple followed in macOS Mojave. In September 2019, iOS 13 and Android 10 both introduced dark modes. Some operating systems provide tools to change the dark mode state automatically at sundown or sunrise. A "prefers-color-scheme" option was created for front-end web developers in 2019, being a CSS property that signals a user's choice for their system to use a light or dark color theme. Firefox and Chromium have optional dark theme for all internal screens. It is also possible for third-party developers to implement their own dark themes. There are also a variety of browser add-ons that can re-theme web sites with dark color schemes, also aligning with system theme. Wikipedia's mobile and desktop versions received a dark mode option in 2024. == Implementation == There is a prefers-color-scheme media query in CSS, to detect if the user has requested light or dark color scheme and serve the requested color scheme. It can be indicated from the user's operating system preference or a user agent. CSS example: JavaScript example: == Energy usage == Light on dark color schemes require less energy to display on OLED displays. This positively impacts battery life and reduces energy consumption. While an OLED will consume around 40% of the power of an LCD displaying an image that is primarily black, it can use more than three times as much power to display an image with a white background, such as a document or web site. This can lead to reduced battery life and higher energy usage unless a light-on-dark color scheme is used. The long-term reduced power usage may also prolong battery life or the useful life of the display and battery. The energy savings that can be achieved using a light-on-dark color scheme are because of how OLED screens work: in an OLED screen, each subpixel generates its own light and it only consumes power when generating light. This is in contrast to how an LCD works: in an LCD, subpixels either block or allow light from an always-on (lit) LED backlight to pass through. "AMOLED Black" color schemes (that use pure black instead of dark gray) do not necessarily save more energy than other light-on-dark color schemes that use dark gray instead of black, as the power consumption on an AMOLED screen decreases proportionately to the average brightness of the displayed pixels. Although it is true that AMOLED black does save more energy than dark gray, the additional energy savings are often negligible; AMOLED black will only give an additional energy saving of less than 1%, for instance, over the dark gray that's used in the dark theme for Google's official Android apps. In November 2018, Google confirmed that dark mode on Android saved battery life. == Web issues == Some argue that a color scheme with light text on a dark background is easier to read on the screen, because the lower overall brightness causes less eyestrain, while others argue to the contrary. Some pages on the web are designed for white backgrounds; Image assets (GIF, PNG, SVG, WOFF, etc) can be used improperly causing visual artifacts if dark mode is forced (instead of designed for) with a plugin like Dark Reader.

Digital entertainment

Digital entertainment Industry includes, but is not restricted to, any combination of the following industries (that themselves have a considerable degree of overlap): digital media new media video on demand video games interactive entertainment online gambling mobile entertainment social media streaming services "Digital entertainment", largely a hard to define marketing term, rests upon entertainment technology and ultimately on the enabling basic technologies computers, Internet/World Wide Web, digital rights management, multimedia and streaming media. Apart from pure entertainment, the term rests upon the observation that already in 2011 in the UK, for example, "nearly half of people’s waking hours are spent using media content and communications services" ("screen time"). Digital entertainment is inextricably connected with digital marketing. People who follow influencers on social media for entertainment will receive a fair share of advertising at the same time. Digital merchandise is distributed with every computer game and popup ads or similar are ubiquitous in the online (gaming) world.