AI Data Water

AI Data Water — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Audio-visual speech recognition

    Audio-visual speech recognition

    Audio visual speech recognition (AVSR) is a technique that uses image processing capabilities in lip reading to aid speech recognition systems in recognizing indeterministic phones or giving preponderance among near probability decisions. Each system of lip reading and speech recognition works separately, then their results are mixed at the stage of feature fusion. As the name suggests, it has two parts. First one is the audio part and second one is the visual part. In audio part we use features like log mel spectrogram, mfcc etc. from the raw audio samples and we build a model to get feature vector out of it . For visual part generally we use some variant of convolutional neural network to compress the image to a feature vector after that we concatenate these two vectors (audio and visual ) and try to predict the target object.

    Read more →
  • Baidu Fanyi

    Baidu Fanyi

    Baidu Fanyi is a service for translating text paragraphs and web pages provided by Baidu. In 2015, Baidu Translation won the second prize of China's National Science and Technology Progress Award. == Supported languages == Baidu translate has some languages that are missing from Google Translate, such as Cornish, albeit some of them are poor quality. As of June 2026, translation is available in 201 languages:

    Read more →
  • Ω-automaton

    Ω-automaton

    In automata theory, a branch of theoretical computer science, an ω-automaton (or stream automaton) is a variation of a finite automaton that runs on infinite, rather than finite, strings as input. Since ω-automata do not stop, they have a variety of acceptance conditions rather than simply a set of accepting states. ω-automata are useful for specifying behavior of systems that are not expected to terminate, such as hardware, operating systems and control systems. For such systems, one may want to specify a property such as "for every request, an acknowledge eventually follows", or its negation "there is a request that is not followed by an acknowledge". The former is a property of infinite words: one cannot say of a finite sequence that it satisfies this property. Classes of ω-automata include the Büchi automata, Rabin automata, Streett automata, parity automata and Muller automata, each deterministic or non-deterministic. These classes of ω-automata differ only in terms of acceptance condition. They all recognize precisely the regular ω-languages except for the deterministic Büchi automata, which is strictly weaker than all the others. Although all these types of automata recognize the same set of ω-languages, they nonetheless differ in succinctness of representation for a given ω-language. == Deterministic ω-automata == Formally, a deterministic ω-automaton is a tuple A = ( Q , Σ , δ , q 0 , A a c c ) {\textstyle A=(Q,\Sigma ,\delta ,q_{0},A_{acc})} , that consists of the following components: Q {\textstyle Q} , is a finite set. The elements of Q {\textstyle Q} are called the states of A {\textstyle A} . Σ {\textstyle \Sigma } , is a finite set called the alphabet of A {\textstyle A} . δ : Q × Σ → Q {\textstyle \delta \colon Q\times \Sigma \rightarrow Q} is a function, called the transition function of A {\textstyle A} . Q 0 {\textstyle Q_{0}} is an element of Q {\textstyle Q} , called the initial state. A a c c {\textstyle A_{acc}} is a set of accepting states of A {\textstyle A} , formally a subset of Q ω {\textstyle Q^{\omega }} . An input for A {\textstyle A} is an infinite string over the alphabet Σ {\textstyle \Sigma } , i.e. it is an infinite sequence α = ( a 1 , a 2 , a 3 , … ) {\textstyle \alpha =(a_{1},a_{2},a_{3},\ldots )} . The run of A {\textstyle A} on such an input is an infinite sequence ρ = ( r 0 , r 1 , r 2 , … ) {\textstyle \rho =(r_{0},r_{1},r_{2},\ldots )} of states, defined as follows: r 0 = q 0 {\textstyle r_{0}=q_{0}} . r 1 = δ ( r 0 , a 1 ) {\textstyle r_{1}=\delta (r_{0},a_{1})} . r 2 = δ ( r 1 , a 2 ) {\textstyle r_{2}=\delta (r_{1},a_{2})} . ... that is, for every i {\textstyle i} : r i = δ ( r i − 1 , a i ) {\textstyle r_{i}=\delta (r_{i-1},a_{i})} . The main purpose of an ω-automaton is to define a subset of the set of all inputs: The set of accepted inputs. Whereas in the case of an ordinary finite automaton every run ends with a state r n {\textstyle r_{n}} and the input is accepted if and only if r n {\textstyle r_{n}} is an accepting state, the definition of the set of accepted inputs is more complicated for ω-automata. Here we must look at the entire run ρ {\textstyle \rho } . The input is accepted if the corresponding run is in Acc {\textstyle {\text{Acc}}} . The set of accepted input ω-words is called the recognized ω-language by the automaton, which is denoted as L ( A ) {\textstyle L(A)} . The definition of Acc {\textstyle {\text{Acc}}} as a subset of Q ω {\textstyle Q^{\omega }} is purely formal and not suitable for practice because normally such sets are infinite. The difference between various types of ω-automata (Büchi, Rabin etc.) consists in how they encode certain subsets Acc {\textstyle {\text{Acc}}} of Q ω {\textstyle Q^{\omega }} as finite sets, and therefore in which such subsets they can encode. == Nondeterministic ω-automata == Formally, a nondeterministic ω-automaton is a tuple A = ( Q , Σ , Δ , Q 0 , Acc ) {\textstyle A=(Q,\Sigma ,\Delta ,Q_{0},{\text{Acc}})} that consists of the following components: Q {\textstyle Q} is a finite set. The elements of Q {\textstyle Q} are called the states of A {\textstyle A} . Σ {\textstyle \Sigma } is a finite set called the alphabet of A {\textstyle A} . Δ {\textstyle \Delta } is a subset of Q × Σ × Q {\textstyle Q\times \Sigma \times Q} and is called the transition relation of A {\textstyle A} . Q 0 {\textstyle Q_{0}} is a subset of Q {\textstyle Q} , called the initial set of states. Acc {\textstyle {\text{Acc}}} is the acceptance condition, a subset of Q ω {\textstyle Q^{\omega }} . Unlike a deterministic ω-automaton, which has a transition function δ {\textstyle \delta } , the non-deterministic version has a transition relation Δ {\textstyle \Delta } . Note that Δ {\textstyle \Delta } can be regarded as a function Q × Σ → P ( Q ) {\textstyle Q\times \Sigma \rightarrow {\mathcal {P}}(Q)} from Q × Σ {\textstyle Q\times \Sigma } to the power set P ( Q ) {\textstyle {\mathcal {P}}(Q)} . Thus, given a state q n {\textstyle q_{n}} and a symbol a n {\textstyle a_{n}} , the next state q n + 1 {\textstyle q_{n+1}} is not necessarily determined uniquely, rather there is a set of possible next states. A run of A {\textstyle A} on the input α = ( a 1 , a 2 , a 3 , … ) {\textstyle \alpha =(a_{1},a_{2},a_{3},\ldots )} is any infinite sequence ρ = ( r 0 , r 1 , r 2 , … ) {\textstyle \rho =(r_{0},r_{1},r_{2},\ldots )} of states that satisfies the following conditions: r 0 {\textstyle r_{0}} is an element of Q 0 {\textstyle Q_{0}} . r 1 {\textstyle r_{1}} is an element of Δ ( r 0 , a 1 ) {\textstyle \Delta (r_{0},a_{1})} . r 2 {\textstyle r_{2}} is an element of Δ ( r 1 , a 2 ) {\textstyle \Delta (r_{1},a_{2})} . ... that is, for every i {\textstyle i} : r i {\textstyle r_{i}} is an element of Δ ( r i − 1 , a i ) {\textstyle \Delta (r_{i-1},a_{i})} . A nondeterministic ω-automaton may admit many different runs on any given input, or none at all. The input is accepted if at least one of the possible runs is accepting. Whether a run is accepting depends only on Acc {\textstyle {\text{Acc}}} , as for deterministic ω-automata. Every deterministic ω-automaton can be regarded as a nondeterministic ω-automaton by taking Δ {\textstyle \Delta } to be the graph of δ {\textstyle \delta } . The definitions of runs and acceptance for deterministic ω-automata are then special cases of the nondeterministic cases. == Acceptance conditions == Acceptance conditions may be infinite sets of ω-words. However, people mostly study acceptance conditions that are finitely representable. The following lists a variety of popular acceptance conditions. Before discussing the list, let's make the following observation. In the case of infinitely running systems, one is often interested in whether certain behavior is repeated infinitely often. For example, if a network card receives infinitely many ping requests, then it may fail to respond to some of the requests but should respond to an infinite subset of received ping requests. This motivates the following definition: For any run ρ {\textstyle \rho } , let Inf ( ρ ) {\textstyle {\text{Inf}}(\rho )} be the set of states that occur infinitely often in ρ {\textstyle \rho } . This notion of certain states being visited infinitely often will be helpful in defining the following acceptance conditions. A Büchi automaton is an ω-automaton A {\textstyle A} that uses the following acceptance condition, for some subset F {\textstyle F} of Q {\textstyle Q} : Büchi condition A {\textstyle A} accepts exactly those runs ρ {\textstyle \rho } for which Inf ( ρ ) ∩ F ≠ ∅ {\textstyle {\text{Inf}}(\rho )\cap F\neq \emptyset } , i.e. there is an accepting state that occurs infinitely often in ρ {\textstyle \rho } . A Rabin automaton is an ω-automaton A {\textstyle A} that uses the following acceptance condition, for some set Ω {\textstyle \Omega } of pairs ( B i , G i ) {\textstyle (B_{i},G_{i})} of sets of states: Rabin condition A {\textstyle A} accepts exactly those runs ρ {\textstyle \rho } for which there exists a pair ( B i , G i ) {\textstyle (B_{i},G_{i})} in Ω {\textstyle \Omega } such that B i ∩ Inf ( ρ ) = ∅ {\textstyle B_{i}\cap {\text{Inf}}(\rho )=\emptyset } and G i ∩ Inf ( ρ ) ≠ ∅ {\textstyle G_{i}\cap {\text{Inf}}(\rho )\neq \emptyset } . A Streett automaton is an ω-automaton A {\textstyle A} that uses the following acceptance condition, for some set Ω {\textstyle \Omega } of pairs ( B i , G i ) {\textstyle (B_{i},G_{i})} of sets of states: Streett condition A {\textstyle A} accepts exactly those runs ρ {\textstyle \rho } such that for all pairs ( B i , G i ) {\textstyle (B_{i},G_{i})} in Ω {\textstyle \Omega } , B i ∩ Inf ( ρ ) ≠ ∅ {\textstyle B_{i}\cap {\text{Inf}}(\rho )\neq \emptyset } or G i ∩ Inf ( ρ ) = ∅ {\textstyle G_{i}\cap {\text{Inf}}(\rho )=\emptyset } . A parity automaton is an automaton A {\textstyle A} whose set of states is Q = { 0 , 1 , 2 , … , k } {\textstyle Q=\{0,1,2,\ldots ,k\}} for some natural number k {\textst

    Read more →
  • Salvatore J. Stolfo

    Salvatore J. Stolfo

    Salvatore J. Stolfo is an academic and professor of computer science at Columbia University, specializing in computer security. == Early life == Born in Brooklyn, New York, Stolfo received a Bachelor of Science degree in Computer Science and Mathematics from Brooklyn College in 1974. He received his Ph.D. from NYU Courant Institute in 1979 and has been on the faculty of Columbia ever since, where he's taught courses in Artificial Intelligence, Intrusion and Anomaly Detection Systems, Introduction to Programming, Fundamental Algorithms, Data Structures, and Knowledge-Based Expert Systems. == Academic research == While at Columbia, Stolfo has received close to $50M in funding for research that has broadly focused on Security, Intrusion Detection, Anomaly Detection, Machine Learning and includes early work in parallel computing and artificial intelligence. He has published or co-authored over 250 papers and has over 46,000 citations with an H-index of 102. In 1996 he proposed a project with DARPA that applies machine learning to behavioral patterns to detect fraud or intrusion in networks. DADO, developed by in part by Stolfo, introduced the parallel computing primitive: “Broadcast, Resolve, Report”, a hardwire implemented mechanism that today is called MapReduce. Among his earliest work, Stolfo along with colleague Greg Vesonder of Bell Labs, developed a large-scale expert data analysis system, called ACE (Automated Cable Expertise) for the nation's phone system. AT&T Bell Labs distributed ACE to a number of telephone wire centers to improve the management and scheduling of repairs in the local loop. Stolfo coined the term FOG computing (not to be confused with fog computing) where technology is used “to launch disinformation attacks against malicious insiders, preventing them from distinguishing the real sensitive customer data from fake worthless data.” In 2005 Stolfo received funding from the Army Research Office to conduct a workshop to bring together a group of researchers to help identify a research program to focus on insider threats. He was elevated to IEEE Fellow in 2018 "for his contributions to machine learning based cybersecurity." He was elected as an ACM Fellow in 2019 "for contributions to machine-learning-based cybersecurity and parallel hardware for database inference systems". == Career == Founded in 2011, Red Balloon Security (or RBS) is a cyber security company founded by Dr Sal Stolfo and Dr Ang Cui. A spinout from the IDS lab, RBS developed a symbiote technology called FRAK as a host defense for embedded systems under the sponsorship of DARPA's Cyber Fast Track program. Created based on their IDS lab research for the DARPA Active Authentication and the Anomaly Detection at Multiple Scales program, Dr Sal Stolfo and Dr. Angelos Keromytis founded Allure Security Technologies. Using active behavioral authentication and decoy technology Stolfo pioneered and patented in 1996. Founded in 2009, Allure Security Technology was created based on work done under DARPA sponsorship in Columbia's IDS lab based on DARPA prompts to research how to detect hackers once they are inside an organization's perimeter and how to continuously authenticate a user without a password. Stolfo's company Electronic Digital Documents produced a “DataBlade” technology, which Informix marketed during their strategy of acquisition and development in the mid 80's. Stolfo's patented merge/purge technology called EDD DataCleanser DataBlade was licensed by Informix. Since its acquisition by IBM in 2005, IBM Informix is one of the world's most widely used database servers, with users ranging from the world's largest corporations to startups. System Detection was one of the companies founded by Prof. Stolfo to commercialize the Anomaly Detection technology developed in the IDS lab. The company ultimately reorganized and was rebranded as Trusted Computer Solutions. That company was recently acquired by Raytheon. Recently a jury awarded Columbia University $185 million for patent infringement for one of Prof. Stolfo's inventions, the Application Communities technology. https://news.columbia.edu/news/columbia-university-awarded-185-million-patent-infringement-nortonlifelock-inc. The final order from the judge applied nearly treble damages: https://www.reuters.com/legal/litigation/gen-digital-owes-columbia-481-mln-us-patent-fight-judge-says-2023-10-02/

    Read more →
  • Pixel shift

    Pixel shift

    Pixel shift is a method in digital cameras for producing a super-resolution image. The method works by taking several images, after each such capture moving ("shifting") the sensor to a new position. In digital colour cameras that employ pixel shift, this avoids a major limitation inherent in using Bayer pattern for obtaining colour, and instead produces an image with increased colour resolution and, assuming a static subject or additional computational steps, an image free of colour moiré. Taking this idea further, sub-pixel shifting may increase the resolution of the final image beyond that suggested by the specified resolution of the image sensor. Additionally, assuming that the various individual captures are taken at the same sensitivity, the final combined image will have less image noise than a single capture. This can be thought of as an averaging effect (for instance, in a pixel shift image composed of four individual frames with a classic Bayer pattern, every pixel in the final colour image is based on two measurements of the green channel). == List of cameras implementing pixel shift == All of the following cameras are fabricated with one imaging sensor, thus any kind of pixel shift requires a movement of the whole sensor. === Canon === Canon R5: Contains a 45 Mpixel sensor. The High-Resolution Mode shifts the sensor by one pixel to obtain a sequence of nine images that are merged into a 400 Mpixel image. === Fujifilm === Fujifilm GFX50S II: contains a 51 Mpixel sensor. The Pixel Shift Multi-Shot mode shifts the imaging sensor by 0.5-pixel movements to obtain a sequence of 16 images that are subsequently merged into a 200 Mpixel image. Fujifilm GFX100, Fujifilm GFX100 II: contains a 102 Mpixel sensor. A sequence of 16 pixel shifted images are merged into a 400 Mpixel image. Fujifilm GFX100S, Fujifilm GFX100S II: contains a 102 Mpixel sensor. A sequence of 16 pixel shifted images are merged into a 400 Mpixel image Fujifilm GFX100IR: contains a 102 Mpixel sensor. A sequence of 16 pixel shifted images are merged into a 400 Mpixel image Fujifilm X-H2: contains a 40 Mpixel sensor. A sequence of 20 shifted images are merged into a 160 Mpixel image. Fujifilm X-T5: contains a 40 Mpixel sensor. A sequence of 20 shifted images are merged into a 160 Mpixel image. === Nikon === Nikon Z8: contains a 47.5 Mpixel sensor. The High Res shot mode shifts the imaging sensor by 0.5-pixel movements to obtain a sequence of up to 32 images that can be merged in Nikon's NX studio software. Nikon Zf: contains a 24 Mpixel sensor. The High Res shot mode shifts the imaging sensor by 0.5-pixel movements to obtain a sequence of up to 32 images that can be merged in Nikon's NX studio software. === Olympus === Olympus OM-D E-M1 Mark II: contains a 20.4 Mpixel sensor. The High Res shot mode produces a 50 Mpixel image. Olympus OM-D E-M5 Mark II: contains a 16 Mpixel sensor. The High Res shot mode shifts the imaging sensor by 0.5-pixel movements to obtain a sequence of 8 images that are subsequently merged into a 40 Mpixel image. Olympus OM-D E-M5 Mark III: contains a 20.4 Mpixel sensor. The High Res shot mode shifts the imaging sensor by 0.5-pixel movements to obtain a sequence of 8 images that are subsequently merged into a 50 Mpixel image. Olympus OM-D E-M1X: contains a 20.4 Mpixel sensor. The camera sports two pixel shift mode: (a) the 80Mp Tripod mode produces an 80 Mpixel image, (b) the Handheld High Res shot mode produces a 50 Mpixel image. Olympus PEN-F: contains a 20.4 Mpixel sensor. The High Res Shot mode takes multiple images, continually shifting the position of the sensor in sub-pixel increments. Combining these images results in either a 50MP JPEG or an 80MP Raw file. ==== OM System ==== OM System OM-1: contains a 20MPix sensor. The High Res Shot mode takes multiple images, and it can be used handheld or on a tripod. Handheld it will internally produce 50 Mpix files and 80 Mpix when mounted on a tripod. OM System OM-5: contains a 20MPix sensor. The High Res Shot mode takes multiple images, and it can be used handheld or on a tripod. Handheld it will internally produce 50 Mpix files and 80 Mpix when mounted on a tripod. === Panasonic === Panasonic Lumix DC-G9: contains a 20.3 Mpixel sensor. The High Resolution Mode takes a sequence of 8 shots in quick succession between which the sensor is shifted by 0.5 pixel for each image. These are subsequently merged into an 80 Mpixel image. Panasonic Lumix DC-S1: contains a 24.2 Mpixel sensor. The High Resolution Mode takes a sequence of shots in quick succession between which the sensor is shifted by a small amount. These are subsequently merged into a 96 Mpixel image. Panasonic Lumix DC-S1R: contains a 47.3 Mpixel sensor. The High Resolution Mode shifts the imaging sensor by a small increments to obtain a sequence of 8 images that are subsequently merged into a 187 Mpixel image. Panasonic Lumix DC-S1H Panasonic Lumix DC-S5 === Pentax === Pentax K-70: contains a 24.3 Mpixel sensor. The pixel shift mode takes a sequence of 4 shots between which the sensor is shifted by 1 pixel. These are subsequently merged into an image sporting 'all color data in each pixel to deliver super-high-resolution images'. Pentax KP: contains a 24.3 Mpixel sensor. The pixel shift mode takes a sequence of 4 shots between which the sensor is shifted by 1 pixel. These are subsequently merged into an image sporting 'high-resolution images with more accurate colours and much finer details'. Pentax K-3 II: contains a 24.3 Mpixel sensor. The pixel shift mode takes a sequence of 4 shots between which the sensor is shifted by 1 pixel. These are subsequently merged into an image sporting 'super-high-resolution images with far more truthful color reproduction and much finer details'. Pentax K-3 III: contains a 25.7 Mpixel sensor. The pixel shift mode takes a sequence of 4 shots between which the sensor is shifted by 1 pixel. These are subsequently merged into an image sporting 'a cancelling out of the Bayer pattern and removal of the need for sharpness-sapping demosaicing'. Pentax K-1: contains a 36.4 Mpixel sensor. The pixel shift mode takes a sequence of 4 shots between which the sensor is shifted by 1 pixel. These are subsequently merged into an image sporting 'improved detail and colour resolution'. Pentax K-1 II: contains a 36.4 Mpixel sensor. The camera sports two pixel shift mode: (a) a series of 4 tripod-stabilised images shifted by 1 pixel each are subsequently combined into a 47.3 Mpixel image, (b) a series of images taken in handheld mode are combined into a 47.3 Mpixel image that is, within limits, able to cope even with moving subjects. === Sony === Sony a6600: contains a 24.3 Mpixel sensor. The pixel shift mode takes a sequence of 4 shots between which the sensor is shifted by 1 pixel. These are subsequently merged into an image sporting 'all color data in each pixel to deliver super-high-resolution images'. Sony α7R III: contains a 42.4 Mpixel sensor. The pixel shift mode takes a sequence of 4 shots between which the sensor is shifted by 1 pixel. These are subsequently merged into a 42.4 Mpixel image with improved tonal resolution. Sony α7R IV: contains a 61 Mpixel sensor. The camera has two pixel shift modes, (a) the first takes a sequence of 4 shots between which the sensor is shifted by 1 pixel. These are subsequently merged into a 61 Mpixel image with improved tonal resolution, (b) the other takes a sequence of 16 shots between which the sensor is shifted by 0.5 pixel. These are subsequently merged into a 240 Mpixel image with both enhanced detail and improved tonal resolution. Sony α1: contains a 50 Mpixel sensor. The camera has two pixel shift modes, (a) the first takes a sequence of 4 shots between which the sensor is shifted by 1 pixel. These are subsequently merged into a 50 Mpixel image with improved tonal resolution, (b) the other takes a sequence of 16 shots between which the sensor is shifted by 0.5 pixel. These are subsequently merged into a 200 Mpixel image with both enhanced detail and improved tonal resolution. === Hasselblad === Hasselblad H3DII: the model H3DII-39 sports a 39 Mpixel sensor, the model H3DII-50 a 50 Mpixel sensor. Both enable a pixel shift mode which takes a sequence of 4 shots between which the sensor is shifted by 1 pixel. These are subsequently merged into a single image. Hasselblad H4D series: the model H4D-200MS contains a 50 Mpixel sensor. The sensor sports 3 different pixel shift modes which take (a) a sequence of 6 shots taken at slight offsets, (b) a sequence of 4 shots between which the sensor is shifted by 1 pixel, (c) a sequence of 4 shots between which the sensor is shifted by 0.5 pixels. Images obtained by all three modes are subsequently merged into 200 Mpixel images. Hasselblad H5D series: both models H5D-50c MS and H5D-200c MS contain a 50 Mpixel sensor. This sensor sports 2 different pixel shift modes which take (a) a sequence of 6 shots with full and half pixel moveme

    Read more →
  • OCR-A

    OCR-A

    OCR-A is a font issued in 1966 and first implemented in 1968. A special font was needed in the early days of computer optical character recognition, when there was a need for a font that could be recognized not only by the computers of that day, but also by humans. OCR-A uses simple, thick strokes to form recognizable characters. The font is monospaced (fixed-width), with the printer required to place glyphs 0.254 cm (0.10 inch) apart, and the reader required to accept any spacing between 0.2286 cm (0.09 inch) and 0.4572 cm (0.18 inch). == Standardization == The OCR-A font was standardized by the American National Standards Institute (ANSI) as ANSI X3.17-1981. X3.4 has since become the INCITS and the OCR-A standard is now called ISO 1073-1:1976. == Implementations == In 1968, American Type Founders produced OCR-A, one of the first optical character recognition typefaces to meet the criteria set by the U.S. Bureau of Standards. The design is simple so that it can be easily read by a machine, but it is more difficult for the human eye to read. As metal type gave way to computer-based typesetting, Tor Lillqvist used Metafont to describe the OCR-A font. That definition was subsequently improved by Richard B. Wales. Their work is available from CTAN. To make the free version of the font more accessible to users of Microsoft Windows, John Sauter converted the Metafont definitions to TrueType using potrace and FontForge in 2004. In 2007, Gürkan Sengün created a Debian package from this implementation. In 2008. Luc Devroye corrected the vertical positioning in John Sauter's implementation, and fixed the name of lower case z. Independently, Matthew Skala used mftrace to convert the Metafont definitions to TrueType format in 2006. In 2011 he released a new version created by rewriting the Metafont definitions to work with METATYPE1, generating outlines directly without an intermediate tracing step. On September 27, 2012, he updated his implementation to version 0.2. In addition to these free implementations of OCR-A, there are also implementations sold by several vendors. As a joke, Tobias Frere-Jones in 1995 created Estupido-Espezial, a redesign with swashes and a long s. It was used in a "technology"-themed section of Rolling Stone. Maxitype designed the OCR-X typeface—based on the OCR-A typeface with OpenType features, alien/technology-themed dingbats and available in six weights (Thin, Light, Regular, Medium, Bold, Black). Japanese typeface foundry Visual Design Laboratory (VDL) designed two typefaces based on the OCR-A typeface: one for Simplified Chinese characters named Jieyouti and one for Japanese characters named Yota G (ヨタG) , both available in five weights (Light, Regular, Medium, Semi Bold, Bold). == Use == Although optical character recognition technology has advanced to the point where such simple fonts are no longer necessary, the OCR-A font has remained in use. Its usage remains widespread in the encoding of checks around the world. Some lock box companies still insist that the account number and amount owed on a bill return form be printed in OCR-A. Also, because of its unusual look, it is sometimes used in advertising and display graphics. Notably, it is used for the subtitles in films and television series such as Blacklist and for the main titles in The Pretender. Additionally, OCR-A is used in the titles and subtitles for the films 13 Hours: The Secret Soldiers of Benghazi and Hoppers (film). It was also used for the logo, branding, and marketing material of the children's toy line Hexbug. == Code points == A font is a set of character shapes, or glyphs. For a computer to use a font, each glyph must be assigned a code point in a character set. When OCR-A was being standardized the usual character coding was the American Standard Code for Information Interchange or ASCII. Not all of the glyphs of OCR-A fit into ASCII, and for five of the characters there were alternate glyphs, which might have suggested the need for a second font. However, for convenience and efficiency all of the glyphs were expected to be accessible in a single font using ASCII coding, with the additional characters placed at coding points that would otherwise have been unused. The modern descendant of ASCII is Unicode, also known as ISO 10646. Unicode contains ASCII and has special provisions for OCR characters, so some implementations of OCR-A have looked to Unicode for guidance on character code assignments. === Pre-Unicode standard representation === The ISO standard ISO 2033:1983, and the corresponding Japanese Industrial Standard JIS X 9010:1984 (originally JIS C 6229–1984), define character encodings for OCR-A, OCR-B and E-13B. For OCR-A, they define a modified 7-bit ASCII set (also known by its ISO-IR number ISO-IR-91) including only uppercase letters, digits, a subset of the punctuation and symbols, and some additional symbols. Codes which are redefined relative to ASCII, as opposed to simply omitted, are listed below: Additionally, the long vertical mark () is encoded at 0x7C, corresponding to the ASCII vertical bar (|). === Dedicated OCR-A characters in Unicode === The following characters have been defined for control purposes and are now in the "Optical Character Recognition" Unicode range 2440–245F: === Space, digits, and unaccented letters === All implementations of OCR-A use U+0020 for space, U+0030 through U+0039 for the decimal digits, U+0041 through U+005A for the unaccented upper case letters, and U+0061 through U+007A for the unaccented lower case letters. === Regular characters === In addition to the digits and unaccented letters, many of the characters of OCR-A have obvious code points in ASCII. Of those that do not, most, including all of OCR-A's accented letters, have obvious code points in Unicode. === Remaining characters === Linotype coded the remaining characters of OCR-A as follows: === Additional characters === The fonts that descend from the work of Tor Lillqvist and Richard B. Wales define four characters not in OCR-A to fill out the ASCII character set. These shapes use the same style as the OCR-A character shapes. They are: Linotype also defines additional characters. === Exceptions === Some implementations do not use the above code point assignments for some characters. ==== PrecisionID ==== The PrecisionID implementation of OCR-A has the following non-standard code points: OCR Hook at U+007E OCR Chair at U+00C1 OCR Fork at U+00C2 Euro Sign at U+0080 ==== Barcodesoft ==== The Barcodesoft implementation of OCR-A has the following non-standard code points: OCR Hook at U+0060 OCR Chair at U+007E OCR Fork at U+005F Long Vertical Mark at U+007C (agrees with Linotype) Character Erase at U+0008 ==== Morovia ==== The Morovia implementation of OCR-A has the following non-standard code points: OCR Hook at U+007E (agrees with PrecisionID) OCR Chair at U+00F0 OCR Fork at U+005F (agrees with Barcodesoft) Long Vertical Mark at U+007C (agrees with Linotype) ==== IDAutomation ==== The IDAutomation implementation of OCR-A has the following non-standard code points: OCR Hook at U+007E (agrees with PrecisionID) OCR Chair at U+00C1 (agrees with PrecisionID) OCR Fork at U+00C2 (agrees with PrecisionID) OCR Belt Buckle at U+00C3 == Sellers of font standards == Hardcopy of ISO 1073-1:1976, distributed through ANSI, from Amazon.com ISO 1073-1 is also available from Techstreet, who distributes standards for ANSI and ISO

    Read more →
  • The Best Free AI Humanizer for Beginners

    The Best Free AI Humanizer for Beginners

    Comparing the best AI humanizer? An AI humanizer is software that uses machine learning to help you get more done — it lowers the barrier so anyone can produce professional output. Privacy matters too: check whether your data trains the model and whether a no-log or enterprise tier is available. Whether you are a beginner or a pro, the right AI humanizer slots into your workflow and pays for itself fast. Below we compare features, pricing, and real output so you can choose with confidence.

    Read more →
  • András Kornai

    András Kornai

    András Kornai (born 1957 in Budapest) is a mathematical linguist. == Education == Kornai is the son of economist János Kornai. He earned two PhDs with the first being in mathematics in 1983 from Eötvös Loránd University in Budapest, where his advisor was Miklós Ajtai. His second was in linguistics in 1991 from Stanford University, where his advisor was Paul Kiparsky. == Career == He is a professor in the Department of Algebra at the Budapest Institute of Technology, where he works on an open source Hungarian morphological analyzer. He was Chief Scientist at MetaCarta, where he worked on information extraction before the company was acquired by Nokia. Prior to MetaCarta, he was Chief Scientist at Northern Light. He is on the board of the journal Grammars and YourAmigo PLC. His research interests include all mathematical aspects of natural language processing, speech recognition, and OCR. As area editor he was responsible for the Mathematical Linguistics area of the Oxford International Encyclopedia of Linguistics, and his joint work with Geoffrey Pullum, "The X-bar Theory of Phrase Structure", formally reconstructed that then-popular linguistic theory. == Awards and honors == 2009: ACM Distinguished Member == Monographs == Semantics. Springer Nature, 2020. ISBN 978-3-319-65644-1 Mathematical Linguistics. Springer Verlag, in the series Advanced Information and Knowledge Processing, November 2007. ISBN 978-1-84628-985-9 Hardbound, approximately 300 pages. See description. Formal Phonology. In the series Outstanding Dissertations in Linguistics, Garland Publishing, 1994, ISBN 0-8153-1730-1, hardbound, 240 pages Contents, Preface, Introduction (20 pages) On Hungarian Morphology. In the series Linguistica, Hungarian Academy of Sciences, 1994, ISBN 963-8461-73-X, paperbound, 174 pages Contents, Preface, Introduction (10 pages) == Books edited == Oxford International Encyclopedia of Linguistics (Mathematical Linguistics Area Editor under Editor in Chief William Frawley). 4 volumes, Oxford University Press, 2003, ISBN 978-0-19-513977-8. Proceedings of the HLT-NAACL Workshop on the Analysis of Geographic References. Jointly with Beth Sundheim. Association for Computational Linguistics, 2003, ISBN 1-932432-04-3 (WS9), paperbound, vi+81 pages. See related material. Extended Finite State Models of Language (editor). In the series Studies in Natural Language Processing, Cambridge University Press, 1999, ISBN 0-521-63198-X, hardbound, x+278 pages Contents, Introduction (7 pages). == Selected papers == Digital Language Death. PLoS ONE 8(10): e77056, 2012. [1] Hunmorph: open source word analysis (Jointly with V. Tron, Gy. Gyepesi, P. Halacsy, L. Nemeth, and D. Varga). In Proc. ACL 2005 Software Workshop 77-85 [2] Leveraging the open source ispell codebase for minority language analysis (Jointly with P. Halacsy, L. Nemeth, A. Rung, I. Szakadat, and V. Tron). In J. Carson-Berndsen (ed): Proc. SALTMIL 2004 56-59 [3] Explicit Finitism, International Journal of Theoretical Physics 2003/2 301-307 [4] Mathematical Linguistics (Jointly with G.K. Pullum) In W. Frawley (ed): Oxford International Encyclopedia of Linguistics, Oxford University Press 2003, v3 17-20 [5] Optical Character Recognition, In W. Frawley (ed): Oxford International Encyclopedia of Linguistics, Oxford University Press 2003, v3 33-34 [6] How many words are there? Glottometrics 2002/4 61-86 [7] Zipf's law outside the middle range Proc. Sixth Meeting on Mathematics of Language University of Central Florida, 1999 347-356 [8] A Robust, Language-Independent OCR System. (Jointly with Z. Lu, I. Bazzi, J. Makhoul, P. Natarajan, and R. Schwartz) In: Robert J. Mericsko (ed): Proc. 27th AIPR Workshop: Advances in Computer-Assisted Recognition SPIE Proceedings 3584 1999 [9] Quantitative Comparison of Languages. Grammars 1998/2 155-165 [10] The generative power of feature geometry. Annals of Mathematics and Artificial Intelligence 8 1993 37-46 [11] The X-bar Theory of Phrase Structure. (Jointly with G.K. Pullum) Language 66 1990 24-50 [12]

    Read more →
  • ReRites

    ReRites

    ReRites (also known as RERITES, ReadingRites, Big Data Poetry) is a literary work of "Human + A.I. poetry" by David Jhave Johnston that used neural network models trained to generate poetry which the author then edited. ReRites won the Robert Coover Award for a Work of Electronic Literature in 2022. == About the project == The ReRites project began as a daily rite of writing with a neural network, expanded into a series of performances from which video documentation has been published online, and concluded with a set of 12 books and an accompanying book of essays published by Anteism Books in 2019. In Electronic Literature, Scott Rettberg describes the early phases of the project in 2016, when it bore the preliminary name Big Data Poetry. Jhave (the artist name that David Jhave Johnston goes by) describes the process of writing ReRites as a rite: "Every morning for 2 hours (normally 6:30–8:30am) I get up and edit the poetic output of a neural net. Deleting, weaving, conjugating, lineating, cohering. Re-writing. Re-wiring authorship: hybrid augmented enhanced evolutionary". There is video documentation of the writing process. The human editing of the neural network's output is fundamental to this project, and Jhave gives examples of both unedited text extracts and his edited versions in publications about the project. Kyle Booten describes ReRites as "simultaneously dusty and outrageously verdant, monotonously sublime and speckled with beautiful and rare specimens". === Performances === ReRites was first shared with an audience through a series of performances where audience members and poets would participate in reading the automatically generated texts, which appeared on screen so fast that human readers could barely keep up. This has been described as allowing participants to "re-discover[..] the peculiar pleasures of being embodied", or, in Jhave's own words, as a space where human participants were "playing their wits and voices against an evocative infinite deep-learning muse". The first performance was at Brown University's Interrupt Festival in 2019. It has been performed many times since, including at the Barbican Centre in London and Anteism Books. === Print publications === For a single year Jhave published one book of poetry from the ReRites project each month. These twelve volumes are accompanied by a book of essays, all published by Anteism Books. The accompanying essays provide critical responses to the project from poets and scholars including Allison Parrish, Johanna Drucker, Kyle Booten, Stephanie Strickland, John Cayley, Lai-Tze Fan, Nick Montfort, Mairéad Byrne, and Chris Funkhouser. Allison Parrish notes elsewhere that these paratexts to ReRites serve a legitimising function for a genre of poetry that is not yet institutionally acknowledged. === Technical details === Starting in 2016 under the name Big Data Poetry, Jhave generated poems using, in his own words, "neural network code (..) adapted from three corporate github-hosted machine-learning libraries: TensorFlow (Google), PyTorch (Facebook), and AWD-LSTM (SalesForce)". He explains that the "models were trained on a customised corpus of 600,000 lines of poetry ranging from the romantic epoch to the 20th century avant garde". Jhave maintains a GitHub repository with some of the code supporting ReRites. == Reception == ReRites is described by John Cayley as "one of the most thorough and beautiful" poetic responses to machine learning. The work's influence on the field of electronic literature was acknowledged in 2022, when the work won the Electronic Literature Organization's Robert Coover Award for a Work of Electronic Literature. The jury described ReRites as particularly poignant in the time of the pandemic, as it was "a documentation of the performance of the private ritual of writing and the obsessive-compulsive need for writers to communicate — even when no one else is reading". The question of authorship and voice in ReRites has been raised by several critics. Although generated poetry is an established genre in electronic literature, Cayley notes that unlike the combinatory poems created by authors like Nick Montfort, where the author explicitly defines which words and phrases will be recombined, ReRites has "not been directed by literary preconceptions inscribed in the program itself, but only by patterns and rhythms pre-existing in the corpora". In an essay for the Australian journal TEXT, David Thomas Henry Wright asks how to understand authorship and authority in ReRites: "Who or what is the authority of the work? The original data fed into the machine, that is not currently retrievable or discernible from the final works? The code that was taken and adapted for his purposes? Or Jhave, the human editor?" Wright concludes that Jhave is the only actor with any intentionality and therefore the authority of the work. The centrality of the human editor is also emphasised by other scholars. In a chapter analysing ReRites Malthe Stavning Erslev argues that the machine learning misrepresents the dataset it is trained on. While ReRites uses 21st century neural networks, it has been compared to earlier literary traditions. Poet Victoria Stanton, who read at one of the ReRites performances, has compared ReRites to found poetry, while David Thomas Henry Wright compares it to the Oulipo movement and Mark Amerika to the cut-up technique. Scholars also position ReRites firmly within the long tradition of generative poetry both in electronic literature and print, stretching from the I Ching, Queneau's Cent Mille Milliards de Poemes and Nabokov's Pale Fire to computer-generated poems like Christopher Strachey's Love Letter Generator (1952) and more contemporary examples. Jhave describes the process of working with the output from the neural network as "carving". In his book My Life as an Artificial Creative Intelligence, Mark Amerika writes that the "method of carving the digital outputs provided by the language model as part of a collaborative remix jam session with GPT-2, where the language artist and the language model play off each other’s unexpected outputs as if caught in a live postproduction set, is one I share with electronic literature composer David Jhave Johnston, whose AI poetry experiments precede my own investigations."

    Read more →
  • Collocation

    Collocation

    In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words that make it up. This contrasts with an idiom, where the meaning of the whole cannot be inferred from its parts, and may be completely unrelated. There are about seven main types of collocations: adjective + noun, noun + noun (such as collective nouns), noun + verb, verb + noun, adverb + adjective, verbs + prepositional phrase (phrasal verbs), and verb + adverb. Collocation extraction is a computational technique that finds collocations in a document or corpus, using various computational linguistics elements resembling data mining. == Expanded definition == Collocations are partly or fully fixed expressions that become established through repeated context-dependent use. Such terms as crystal clear, middle management, nuclear family, and cosmetic surgery are examples of collocated pairs of words. Collocations can be in a syntactic relation (such as verb–object: make and decision), lexical relation (such as antonymy), or they can be in no linguistically defined relation. Knowledge of collocations is vital for the competent use of a language: a grammatically correct sentence will stand out as awkward if collocational preferences are violated. This makes collocation a common focus for language teaching. Corpus linguists specify a key word in context (KWIC) and identify the words immediately surrounding them, to illustrate the way words are used in practice. The processing of collocations involves a number of parameters, the most important of which is the measure of association, which evaluates whether the co-occurrence is purely by chance or statistically significant. Due to the non-random nature of language, most collocations are classed as significant, and the association scores are simply used to rank the results. Commonly used measures of association include mutual information, t scores, and log-likelihood. Rather than select a single definition, Gledhill proposes that collocation involves at least three different perspectives: co-occurrence, a statistical view, which sees collocation as the recurrent appearance in a text of a node and its collocates; construction, which sees collocation either as a correlation between a lexeme and a lexical-grammatical pattern, or as a relation between a base and its collocative partners; and expression, a pragmatic view of collocation as a conventional unit of expression, regardless of form. These different perspectives contrast with the usual way of presenting collocation in phraseological studies. Traditionally speaking, collocation is explained in terms of all three perspectives at once, in a continuum: == In dictionaries == In 1933, Harold Palmer's Second Interim Report on English Collocations highlighted the importance of collocation as a key to producing natural-sounding language, for anyone learning a foreign language. Thus from the 1940s onwards, information about recurrent word combinations became a standard feature of monolingual learner's dictionaries. As these dictionaries became "less word-centred and more phrase-centred", more attention was paid to collocation. This trend was supported, from the beginning of the 21st century, by the availability of large text corpora and intelligent corpus-querying software, making it possible to provide a more systematic account of collocation in dictionaries. Using these tools, dictionaries such as the Macmillan English Dictionary and the Longman Dictionary of Contemporary English included boxes or panels with lists of frequent collocations. There are also a number of specialized dictionaries devoted to describing the frequent collocations in a language. These include (for Spanish) Redes: Diccionario combinatorio del español contemporaneo (2004), (for French) Le Robert: Dictionnaire des combinaisons de mots (2007), and (for English) the LTP Dictionary of Selected Collocations (1997) and the Macmillan Collocations Dictionary (2010). == Statistically significant collocation == Student's t-test can be used to determine whether the occurrence of a collocation in a corpus is statistically significant. For a bigram w 1 w 2 {\displaystyle w_{1}w_{2}} , let P ( w 1 ) = # w 1 N {\displaystyle P(w_{1})={\frac {\#w_{1}}{N}}} be the unconditional probability of occurrence of w 1 {\displaystyle w_{1}} in a corpus with size N {\displaystyle N} , and let P ( w 2 ) = # w 2 N {\displaystyle P(w_{2})={\frac {\#w_{2}}{N}}} be the unconditional probability of occurrence of w 2 {\displaystyle w_{2}} in the corpus. The t-score for the bigram w 1 w 2 {\displaystyle w_{1}w_{2}} is calculated as: where x ¯ = # w i w j N {\displaystyle {\bar {x}}={\frac {\#w_{i}w_{j}}{N}}} is the sample mean of the occurrence of w 1 w 2 {\displaystyle w_{1}w_{2}} , # w 1 w 2 {\displaystyle \#w_{1}w_{2}} is the number of occurrences of w 1 w 2 {\displaystyle w_{1}w_{2}} , μ = P ( w i ) P ( w j ) {\displaystyle \mu =P(w_{i})P(w_{j})} is the probability of w 1 w 2 {\displaystyle w_{1}w_{2}} under the null-hypothesis that w 1 {\displaystyle w_{1}} and w 2 {\displaystyle w_{2}} appear independently in the text, and s 2 = x ¯ ( 1 − x ¯ ) ≈ x ¯ {\displaystyle s^{2}={\bar {x}}(1-{\bar {x}})\approx {\bar {x}}} is the sample variance. With a large N {\displaystyle N} , the t-test is equivalent to a Z-test.

    Read more →
  • How to Choose an AI Humanizer

    How to Choose an AI Humanizer

    In search of the best AI humanizer? An AI humanizer is software that uses machine learning to help you get more done — it turns a rough idea into a polished result in seconds. When choosing one, weigh output quality, pricing, export formats, and how well it fits the tools you already use. Whether you are a beginner or a pro, the right AI humanizer slots into your workflow and pays for itself fast. We tested the leading options and ranked them by quality, value, and ease of use.

    Read more →
  • The Best Free Conversational AI Platform for Beginners

    The Best Free Conversational AI Platform for Beginners

    Curious about the best conversational AI platform? An conversational AI platform is software that uses machine learning to help you get more done — it combines speed, accuracy, and an interface that just works. Hands-on testing shows real-world results vary, so a short free trial is the smartest way to decide. Whether you are a beginner or a pro, the right conversational AI platform slots into your workflow and pays for itself fast. Read on for hands-on impressions, pricing tiers, and the standout features that matter.

    Read more →
  • Retrieval-augmented generation

    Retrieval-augmented generation

    Retrieval-augmented generation (RAG) is a technique that enables large language models (LLMs) to retrieve and incorporate new information from external data sources. With RAG, LLMs first refer to a specified set of documents, then respond to user queries. These documents supplement information from the LLM's pre-existing training data. This allows LLMs to use domain-specific and/or updated information that is not available in the training data. For example, this enables LLM-based chatbots to access internal company data or generate responses based on authoritative sources. RAG improves LLMs by incorporating information retrieval before generating responses. Unlike LLMs that rely on static training data, RAG pulls relevant text from databases, uploaded documents, or web sources. According to Ars Technica, "RAG is a way of improving LLM performance, in essence by blending the LLM process with a web search or other document look-up process to help LLMs stick to the facts." This method helps reduce AI hallucinations, which have caused chatbots to describe policies that don't exist, or recommend nonexistent legal cases to lawyers that are looking for citations to support their arguments. RAG also reduces the need to retrain LLMs with new data, saving on computational and financial costs. Beyond efficiency gains, RAG also allows LLMs to include sources in their responses, so users can verify the cited sources. This provides greater transparency, as users can cross-check retrieved content to ensure accuracy and relevance. The term retrieval-augmented generation (RAG) was introduced in a 2020 paper that described combining a parametric language model with a non-parametric external memory accessed through retrieval at inference time. == RAG and LLM limitations == LLMs can provide incorrect information. For example, when Google first demonstrated its LLM tool "Google Bard" (later re-branded to Gemini), the LLM provided incorrect information about the James Webb Space Telescope. This error contributed to a $100 billion decline in Google's stock value. RAG is used to prevent these errors, but it does not solve all the problems. For example, LLMs can generate misinformation even when pulling from factually correct sources if they misinterpret the context. MIT Technology Review gives the example of an AI-generated response stating, "The United States has had one Muslim president, Barack Hussein Obama." The model retrieved this from an academic book rhetorically titled Barack Hussein Obama: America's First Muslim President? The LLM did not "know" or "understand" the context of the title, generating a false statement. LLMs with RAG are programmed to prioritize new information. This technique has been called "prompt stuffing." Without prompt stuffing, the LLM's input is generated by a user; with prompt stuffing, additional relevant context is added to this input to guide the model's response. This approach provides the LLM with key information early in the prompt, encouraging it to prioritize the supplied data over pre-existing training knowledge. == Process == Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating an information-retrieval mechanism that allows models to access and utilize additional data beyond their original training set. Ars Technica notes that "when new information becomes available, rather than having to retrain the model, all that's needed is to augment the model's external knowledge base with the updated information" ("augmentation"). IBM states that "in the generative phase, the LLM draws from the augmented prompt and its internal representation of its training data to synthesize" an answer. === RAG key stages === Typically, the data to be referenced is converted into LLM embeddings, numerical representations in the form of a large vector space. RAG can be used on unstructured (usually text), semi-structured, or structured data (for example knowledge graphs). These embeddings are then stored in a vector database to allow for document retrieval. Given a user query, a document retriever is first called to select the most relevant documents that will be used to augment the query. This comparison can be done using a variety of methods, which depend in part on the type of indexing used. The model feeds this relevant retrieved information into the LLM via prompt engineering of the user's original query. Newer implementations (as of 2023) can also incorporate specific augmentation modules with abilities such as expanding queries into multiple domains and using memory and self-improvement to learn from previous retrievals. Finally, the LLM can generate output based on both the query and the retrieved documents. Some models incorporate extra steps to improve output, such as the re-ranking of retrieved information, context selection, and fine-tuning. == Applications == Retrieval-augmented generation is used in applications where generated responses need to be grounded in external or frequently updated information. Commonly cited use cases include search engines, question-answering systems, customer support chatbots, enterprise knowledge assistants, content generation, recommendation systems, retail and e-commerce, and industrial or manufacturing workflows. In healthcare, RAG has been studied as a way to ground large language model outputs in external medical knowledge sources, although reviews have noted continuing challenges around evaluation, ethics, and clinical reliability. == Improvements == Improvements to the basic process above can be applied at different stages in the RAG flow. === Encoder === These methods focus on the encoding of text as either dense or sparse vectors. Sparse vectors, which encode the identity of a word, are typically dictionary-length and contain mostly zeros. Dense vectors, which encode meaning, are more compact and contain fewer zeros. Various enhancements can improve the way similarities are calculated in the vector stores (databases). Performance improves by optimizing how vector similarities are calculated. Dot products enhance similarity scoring, while approximate nearest neighbor (ANN) searches improve retrieval efficiency over K-nearest neighbors (KNN) searches. Accuracy may be improved with Late Interactions, which allow the system to compare words more precisely after retrieval. This helps refine document ranking and improve search relevance. Hybrid vector approaches may be used to combine dense vector representations with sparse one-hot vectors, taking advantage of the computational efficiency of sparse dot products over dense vector operations. Other retrieval techniques focus on improving accuracy by refining how documents are selected. Some retrieval methods combine sparse representations, such as SPLADE, with query expansion strategies to improve search accuracy and recall. === Retriever-centric methods === These methods aim to enhance the quality of document retrieval in vector databases: Pre-training the retriever using the Inverse Cloze Task (ICT), a technique that helps the model learn retrieval patterns by predicting masked text within documents. Supervised retriever optimization aligns retrieval probabilities with the generator model's likelihood distribution. This involves retrieving the top-k vectors for a given prompt, scoring the generated response's perplexity, and minimizing KL divergence between the retriever's selections and the model's likelihoods to refine retrieval. Reranking techniques can refine retriever performance by prioritizing the most relevant retrieved documents during training. === Language model === By redesigning the language model with the retriever in mind, a 25-time smaller network can get comparable perplexity as its much larger counterparts. Because it is trained from scratch, this method (Retro) incurs the high cost of training runs that the original RAG scheme avoided. The hypothesis is that by giving domain knowledge during training, Retro needs less focus on the domain and can devote its smaller weight resources only to language semantics. The redesigned language model is shown here. It has been reported that Retro is not reproducible, so modifications were made to make it so. The more reproducible version is called Retro++ and includes in-context RAG. === Chunking === Chunking involves various strategies for breaking up the data into vectors so the retriever can find details in it. Three types of chunking strategies are: Fixed length with overlap. This is fast and easy. Overlapping consecutive chunks helps to maintain semantic context across chunks. Syntax-based chunks can break the document up into sentences. Libraries such as spaCy or NLTK can also help. File format-based chunking. Certain file types have natural chunks built in, and it's best to respect them. For example, code files are best chunked and vectorized as whole functions or classes. HTML files should leave

    or base64 encoded elements

    Read more →
  • Baum–Welch algorithm

    Baum–Welch algorithm

    In electrical engineering, statistical computing and bioinformatics, the Baum–Welch algorithm is a special case of the expectation–maximization algorithm used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the forward-backward algorithm to compute the statistics for the expectation step. The Baum–Welch algorithm, the primary method for inference in hidden Markov models, is numerically unstable due to its recursive calculation of joint probabilities. As the number of variables grows, these joint probabilities become increasingly small, leading to the forward recursions rapidly approaching values below machine precision. == History == The Baum–Welch algorithm was named after its inventors Leonard E. Baum and Lloyd R. Welch. The algorithm and the Hidden Markov models were first described in a series of articles by Baum and his peers at the IDA Center for Communications Research, Princeton in the late 1960s and early 1970s. One of the first major applications of HMMs was to the field of speech processing. In the 1980s, HMMs were emerging as a useful tool in the analysis of biological systems and information, and in particular genetic information. They have since become an important tool in the probabilistic modeling of genomic sequences. == Description == A hidden Markov model describes the joint probability of a collection of "hidden" and observed discrete random variables. It relies on the assumption that the i-th hidden variable given the (i − 1)-th hidden variable is independent of previous hidden variables, and the current observation variables depend only on the current hidden state. The Baum–Welch algorithm uses the well known EM algorithm to find the maximum likelihood estimate of the parameters of a hidden Markov model given a set of observed feature vectors. Let X t {\displaystyle X_{t}} be a discrete hidden random variable with N {\displaystyle N} possible values (i.e. We assume there are N {\displaystyle N} states in total). We assume the P ( X t ∣ X t − 1 ) {\displaystyle P(X_{t}\mid X_{t-1})} is independent of time t {\displaystyle t} , which leads to the definition of the time-independent stochastic transition matrix A = { a i j } = P ( X t = j ∣ X t − 1 = i ) . {\displaystyle A=\{a_{ij}\}=P(X_{t}=j\mid X_{t-1}=i).} The initial state distribution (i.e. when t = 1 {\displaystyle t=1} ) is given by π i = P ( X 1 = i ) . {\displaystyle \pi _{i}=P(X_{1}=i).} The observation variables Y t {\displaystyle Y_{t}} can take one of K {\displaystyle K} possible values. We also assume the observation given the "hidden" state is time independent. The probability of a certain observation y i {\displaystyle y_{i}} at time t {\displaystyle t} for state X t = j {\displaystyle X_{t}=j} is given by b j ( y i ) = P ( Y t = y i ∣ X t = j ) . {\displaystyle b_{j}(y_{i})=P(Y_{t}=y_{i}\mid X_{t}=j).} Taking into account all the possible values of Y t {\displaystyle Y_{t}} and X t {\displaystyle X_{t}} , we obtain the N × K {\displaystyle N\times K} matrix B = { b j ( y i ) } {\displaystyle B=\{b_{j}(y_{i})\}} where b j {\displaystyle b_{j}} belongs to all the possible states and y i {\displaystyle y_{i}} belongs to all the observations. An observation sequence is given by Y = ( Y 1 = y 1 , Y 2 = y 2 , … , Y T = y T ) {\displaystyle Y=(Y_{1}=y_{1},Y_{2}=y_{2},\ldots ,Y_{T}=y_{T})} . Thus we can describe a hidden Markov chain by θ = ( A , B , π ) {\displaystyle \theta =(A,B,\pi )} . The Baum–Welch algorithm finds a local maximum for θ ∗ = a r g m a x θ ⁡ P ( Y ∣ θ ) {\displaystyle \theta ^{}=\operatorname {arg\,max} _{\theta }P(Y\mid \theta )} (i.e. the HMM parameters θ {\displaystyle \theta } that maximize the probability of the observation). === Algorithm === Set θ = ( A , B , π ) {\displaystyle \theta =(A,B,\pi )} with random initial conditions. They can also be set using prior information about the parameters if it is available; this can speed up the algorithm and also steer it toward the desired local maximum. ==== Forward procedure ==== Let α i ( t ) = P ( Y 1 = y 1 , … , Y t = y t , X t = i ∣ θ ) {\displaystyle \alpha _{i}(t)=P(Y_{1}=y_{1},\ldots ,Y_{t}=y_{t},X_{t}=i\mid \theta )} , the probability of seeing the observations y 1 , y 2 , … , y t {\displaystyle y_{1},y_{2},\ldots ,y_{t}} and being in state i {\displaystyle i} at time t {\displaystyle t} . This is found recursively: α i ( 1 ) = π i b i ( y 1 ) , {\displaystyle \alpha _{i}(1)=\pi _{i}b_{i}(y_{1}),} α i ( t + 1 ) = b i ( y t + 1 ) ∑ j = 1 N α j ( t ) a j i . {\displaystyle \alpha _{i}(t+1)=b_{i}(y_{t+1})\sum _{j=1}^{N}\alpha _{j}(t)a_{ji}.} Since this series converges exponentially to zero, the algorithm will numerically underflow for longer sequences. However, this can be avoided in a slightly modified algorithm by scaling α {\displaystyle \alpha } in the forward and β {\displaystyle \beta } in the backward procedure below. ==== Backward procedure ==== Let β i ( t ) = P ( Y t + 1 = y t + 1 , … , Y T = y T ∣ X t = i , θ ) {\displaystyle \beta _{i}(t)=P(Y_{t+1}=y_{t+1},\ldots ,Y_{T}=y_{T}\mid X_{t}=i,\theta )} that is the probability of the ending partial sequence y t + 1 , … , y T {\displaystyle y_{t+1},\ldots ,y_{T}} given starting state i {\displaystyle i} at time t {\displaystyle t} . We calculate β i ( t ) {\displaystyle \beta _{i}(t)} as, β i ( T ) = 1 , {\displaystyle \beta _{i}(T)=1,} β i ( t ) = ∑ j = 1 N β j ( t + 1 ) a i j b j ( y t + 1 ) . {\displaystyle \beta _{i}(t)=\sum _{j=1}^{N}\beta _{j}(t+1)a_{ij}b_{j}(y_{t+1}).} ==== Update ==== We can now calculate the temporary variables, according to Bayes' theorem: γ i ( t ) = P ( X t = i ∣ Y , θ ) = P ( X t = i , Y ∣ θ ) P ( Y ∣ θ ) = α i ( t ) β i ( t ) ∑ j = 1 N α j ( t ) β j ( t ) , {\displaystyle \gamma _{i}(t)=P(X_{t}=i\mid Y,\theta )={\frac {P(X_{t}=i,Y\mid \theta )}{P(Y\mid \theta )}}={\frac {\alpha _{i}(t)\beta _{i}(t)}{\sum _{j=1}^{N}\alpha _{j}(t)\beta _{j}(t)}},} which is the probability of being in state i {\displaystyle i} at time t {\displaystyle t} given the observed sequence Y {\displaystyle Y} and the parameters θ {\displaystyle \theta } ξ i j ( t ) = P ( X t = i , X t + 1 = j ∣ Y , θ ) = P ( X t = i , X t + 1 = j , Y ∣ θ ) P ( Y ∣ θ ) = α i ( t ) a i j β j ( t + 1 ) b j ( y t + 1 ) ∑ k = 1 N ∑ w = 1 N α k ( t ) a k w β w ( t + 1 ) b w ( y t + 1 ) , {\displaystyle \xi _{ij}(t)=P(X_{t}=i,X_{t+1}=j\mid Y,\theta )={\frac {P(X_{t}=i,X_{t+1}=j,Y\mid \theta )}{P(Y\mid \theta )}}={\frac {\alpha _{i}(t)a_{ij}\beta _{j}(t+1)b_{j}(y_{t+1})}{\sum _{k=1}^{N}\sum _{w=1}^{N}\alpha _{k}(t)a_{kw}\beta _{w}(t+1)b_{w}(y_{t+1})}},} which is the probability of being in state i {\displaystyle i} and j {\displaystyle j} at times t {\displaystyle t} and t + 1 {\displaystyle t+1} respectively given the observed sequence Y {\displaystyle Y} and parameters θ {\displaystyle \theta } . The denominators of γ i ( t ) {\displaystyle \gamma _{i}(t)} and ξ i j ( t ) {\displaystyle \xi _{ij}(t)} are the same ; they represent the probability of making the observation Y {\displaystyle Y} given the parameters θ {\displaystyle \theta } . The parameters of the hidden Markov model θ {\displaystyle \theta } can now be updated: π i ∗ = γ i ( 1 ) , {\displaystyle \pi _{i}^{}=\gamma _{i}(1),} which is the expected frequency spent in state i {\displaystyle i} at time 1 {\displaystyle 1} . a i j ∗ = ∑ t = 1 T − 1 ξ i j ( t ) ∑ t = 1 T − 1 γ i ( t ) , {\displaystyle a_{ij}^{}={\frac {\sum _{t=1}^{T-1}\xi _{ij}(t)}{\sum _{t=1}^{T-1}\gamma _{i}(t)}},} which is the expected number of transitions from state i to state j compared to the expected total number of transitions starting in state i, including from state i to itself. The number of transitions starting in state i is equivalent to the number of times state i is observed in the sequence from t = 1 to t = T − 1. b i ∗ ( v k ) = ∑ t = 1 T 1 y t = v k γ i ( t ) ∑ t = 1 T γ i ( t ) , {\displaystyle b_{i}^{}(v_{k})={\frac {\sum _{t=1}^{T}1_{y_{t}=v_{k}}\gamma _{i}(t)}{\sum _{t=1}^{T}\gamma _{i}(t)}},} where 1 y t = v k = { 1 if y t = v k , 0 otherwise {\displaystyle 1_{y_{t}=v_{k}}={\begin{cases}1&{\text{if }}y_{t}=v_{k},\\0&{\text{otherwise}}\end{cases}}} is an indicator function, and b i ∗ ( v k ) {\displaystyle b_{i}^{}(v_{k})} is the expected number of times the output observations have been equal to v k {\displaystyle v_{k}} while in state i {\displaystyle i} over the expected total number of times in state i {\displaystyle i} . These steps are now repeated iteratively until a desired level of convergence. Note: It is possible to over-fit a particular data set. That is, P ( Y ∣ θ final ) > P ( Y ∣ θ true ) {\displaystyle P(Y\mid \theta _{\text{final}})>P(Y\mid \theta _{\text{true}})} . The algorithm also does not guarantee a global maximum. ==== Multiple sequences ==== The algorithm described thus far assumes a single observed sequence Y = y 1 , … , y T {\displaystyle Y=y_{1},\ldots ,y_{T}} . However, in many situations, there are several sequences observed: Y 1 ,

    Read more →
  • Julia Hirschberg

    Julia Hirschberg

    Julia Hirschberg is an American computer scientist noted for her research on computational linguistics and natural language processing. She received her first PhD in history from the University of Michigan and the second from the University of Pennsylvania in computer science doing research in Natural Language Processing. She worked at Bell Labs and AT&T Bell Labs from 1985 to 2002 and from 2002 at Columbia University where she is currently the Percy K. and Vida L. W. Hudson Professor of Computer Science. == Biography == Julia Linn Bell Hirschberg received her first Ph.D. degree in history (16th-century Mexico) from University of Michigan in 1976. She served on the History faculty of Smith College from 1974 to 1982. She subsequently shifted to Computer Science studies, receiving her M.S. in Computer and Information Science from University of Pennsylvania in 1982 and a Ph.D. in Computer and Information Science from University of Pennsylvania in 1985. Upon graduation from University of Pennsylvania in 1985, Hirschberg joined AT&T Bell Labs as a Member of Technical staff in the Linguistics Research Department, where she worked on improving prosody assignment for Text-to-Speech Synthesis (TTS) in the Bell Labs TTS system. She was promoted to Department Head in 1994 when she created a new Human Computer Interface Research Lab. She and her department remained at Bell Labs until 1996 when they moved to AT&T Labs Research as part of a corporate reorganization. In 2002, she joined the Columbia University faculty as a professor in the Department of Computer Science. She served as Chair of the Computer Science Department from 2012 to 2018. She still leads classes at Columbia in speech and natural language research and supervises PhD students and a large number of research project students. == Research == Hirschberg's research has included prosody, discourse structure, conversational implicature, text-to-speech synthesis, speech summarization, spoken dialogue systems, emotional speech, deceptive speech, charismatic speech, entrainment, empathetic speech and code-switching. Hirschberg was among the first to combine Natural Language Processing (NLP) approaches to discourse and dialogue with speech research. She pioneered techniques in text analysis for prosody assignment in Text-to-Speech synthesis at Bell laboratories in the 1980s and 1990s, developing corpus-based statistical models based upon syntactic and discourse information which are in general use today in TTS systems. With Janet Pierrehumbert, she developed a theoretical model of intonational meaning. She was a leader in the development of the ToBI conventions for intonational description, which have been extended to numerous languages and which today are the most widely used standard for intonational annotation. Hirschberg has been a pioneer together with Gregory Ward in much experimental work on intonational sources of language meaning and how these interact with pragmatic phenomena, particularly on the meaning of accent (intonational prominent) items and the meaning of intonational contours. She also has innovated in numerous other areas involving prosody and meaning, including the role of grammatical function and surface position in pitch accent location, the use of prosody in disambiguating cue phrases (discourse markers) with Diane Litman, the role of prosody in disambiguation in English, Italian, and Spanish with Cinzia Avesani and Pilar Prieto, and the automatic identification of speech recognition errors using prosodic information, At AT&T Labs she worked with Fernando Pereira, Steve Whittaker, and others on speech search and developing new interfaces for speech navigation. At Columbia, she and her students have continued and extended research on spoken dialogue systems (automatically detecting speech recognition errors and inappropriate system queries, modeling turn-taking behavior, dialogue entrainment, modeling and generating clarification dialogues); on the automatic classification of trust, charisma, deception and emotion from speech; on speech summarization; prosody translation, hedging behavior in text and speech, text-to-speech synthesis, and speech search in low resource languages. She also holds several patents in TTS and in speech search. Corpora she and collaborators have collected include the Boston Directions Corpus, the Columbia SRI Colorado Deception Corpus, and the Columbia Games Corpus. She has served on numerous technical boards and editorial committees. She has served as a member of the Computing Research Association's (CRA) Board of Directors and as co-chair of CRA-W. She is also noted for her leadership in broadening participation in computing. == Awards == Hirschberg's notable honors and awards include: Elected as a member of the National Academy of Artificial Intelligence Academy of Sciences and recipient of the NAAI Artificial Intelligence Exploration Award, 2025 Elected as a Fellow of Asia-Pacific Artificial Intelligence Association (AAIA), 2024. 2020 ISCA Special Service Medal Honorary Doctorate (eredoctoraat) from Tilburg University, Netherlands, 2018. American Academy of Arts and Sciences, 2018. IEEE Fellow, 2017 National Academy of Engineering, 2017 ACM Fellow in 2015 Elected member, American Philosophical Society, 2014. Honorary member, Association for Laboratory Phonology, 2014. Association for Computational Linguistics (ACL) (Founding) Fellow, 2011. International Speech Communication Association (ISCA) Medal for Scientific Achievement, 2011. IEEE James L. Flanagan Speech and Audio Processing Award, 2011. Honorary Doctorate (Hedersdoktorer), KTH (Royal Institute of Technology) Stockholm, Sweden, 2007. AAAI Fellow, 1994. == Publications == A social history of Puebla de Los Ángeles, 1531-60, 1976 Empirical studies on the disambiguation of cue phrases, 1991 Prosody and conversation, 1998 Most recent publications and other information, https://www.cs.columbia.edu/speech/.

    Read more →