AI Data Warehouse

AI Data Warehouse — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Machine translation software usability

    Machine translation software usability

    The sections below give objective criteria for evaluating the usability of machine translation software output. == Stationarity or canonical form == Do repeated translations converge on a single expression in both languages? I.e. does the translation method show stationarity or produce a canonical form? Does the translation become stationary without losing the original meaning? This metric has been criticized as not being well correlated with BLEU (BiLingual Evaluation Understudy) scores. == Adaptive to colloquialism, argot or slang == Is the system adaptive to colloquialism, argot or slang? The French language has many rules for creating words in the speech and writing of popular culture. Two such rules are: (a) The reverse spelling of words such as femme to meuf. (This is called verlan.) (b) The attachment of the suffix -ard to a noun or verb to form a proper noun. For example, the noun faluche means "student hat". The word faluchard formed from faluche colloquially can mean, depending on context, "a group of students", "a gathering of students" and "behavior typical of a student". The Google translator as of 28 December 2006 doesn't derive the constructed words as for example from rule (b), as shown here: Il y a une chorale falucharde mercredi, venez nombreux, les faluchards chantent des paillardes! ==> There is a choral society falucharde Wednesday, come many, the faluchards sing loose-living women! French argot has three levels of usage: familier or friendly, acceptable among friends, family and peers but not at work grossier or swear words, acceptable among friends and peers but not at work or in family verlan or ghetto slang, acceptable among lower classes but not among middle or upper classes The United States National Institute of Standards and Technology conducts annual evaluations [1] Archived 2009-03-22 at the Wayback Machine of machine translation systems based on the BLEU-4 criterion [2]. A combined method called IQmt which incorporates BLEU and additional metrics NIST, GTM, ROUGE and METEOR has been implemented by Gimenez and Amigo [3]. == Well-formed output == Is the output grammatical or well-formed in the target language? Using an interlingua should be helpful in this regard, because with a fixed interlingua one should be able to write a grammatical mapping to the target language from the interlingua. Consider the following Arabic language input and English language translation result from the Google translator as of 27 December 2006 [4]. This Google translator output doesn't parse using a reasonable English grammar: وعن حوادث التدافع عند شعيرة رمي الجمرات -التي كثيرا ما يسقط فيها العديد من الضحايا- أشار الأمير نايف إلى إدخال "تحسينات كثيرة في جسر الجمرات ستمنع بإذن الله حدوث أي تزاحم". ==> And incidents at the push Carbuncles-throwing ritual, which often fall where many of the victims - Prince Nayef pointed to the introduction of "many improvements in bridge Carbuncles God would stop the occurrence of any competing." == Semantics preservation == Do repeated re-translations preserve the semantics of the original sentence? For example, consider the following English input passed multiple times into and out of French using the Google translator as of 27 December 2006: Better a day earlier than a day late. ==> Améliorer un jour plus tôt qu'un jour tard. ==> To improve one day earlier than a day late. ==> Pour améliorer un jour plus tôt qu'un jour tard. ==> To improve one day earlier than a day late. As noted above and in, this kind of round-trip translation is a very unreliable method of evaluation. == Trustworthiness and security == An interesting peculiarity of Google Translate as of 24 January 2008 (corrected as of 25 January 2008) is the following result when translating from English to Spanish, which shows an embedded joke in the English-Spanish dictionary which has some added poignancy given recent events: Heath Ledger is dead ==> Tom Cruise está muerto This raises the issue of trustworthiness when relying on a machine translation system embedded in a Life-critical system in which the translation system has input to a Safety Critical Decision Making process. Conjointly it raises the issue of whether in a given use the software of the machine translation system is safe from hackers. It is not known whether this feature of Google Translate was the result of a joke/hack or perhaps an unintended consequence of the use of a method such as statistical machine translation. Reporters from CNET Networks asked Google for an explanation on January 24, 2008; Google said only that it was an "internal issue with Google Translate". The mistranslation was the subject of much hilarity and speculation on the Internet. If it is an unintended consequence of the use of a method such as statistical machine translation, and not a joke/hack, then this event is a demonstration of a potential source of critical unreliability in the statistical machine translation method. In human translations, in particular on the part of interpreters, selectivity on the part of the translator in performing a translation is often commented on when one of the two parties being served by the interpreter knows both languages. This leads to the issue of whether a particular translation could be considered verifiable. In this case, a converging round-trip translation would be a kind of verification.

    Read more →
  • Captions (app)

    Captions (app)

    Mirage (formerly known as Captions) is a video-generating, video-editing and AI research company headquartered in New York City. Their first app, Captions, is available on iOS, Android, and Web and offers a suite of tools aimed at streamlining the creation and editing of videos. Their enterprise platform, Mirage Studio, generates AI actors and videos for marketing assets and video campaigns. == History == Mirage was co-founded by Gaurav Misra and Dwight Churchill. During Misra's time leading design engineering at Snap Inc., he followed the rise of a new category of video, the "talking video." In 2021, Misra left Snap to found Mirage with his former colleague Churchill. Later that year, the Captions app launched with early backing from venture capital firms Sequoia Capital and Andreessen Horowitz as well as individual investors. In 2023, the company released Lipdub, an Al dubbing app which translates any video with spoken audio into 28 languages. In October 2023, Captions shared that it maintained over 100,000 daily active users with "about a million" videos being created monthly. In November 2024, Captions acquired AlpacaML, a generative AI company that focused on art and other images. In June 2025, Captions launched Mirage Studio, for marketers and advertising agencies. In September 2025, Captions rebranded their company to Mirage. This change reflects the company's focus on developing their proprietary foundation model and future video products. == Products == The Captions app offers features to automate common production tasks including captioning, editing, dubbing, script creation, and music integration. Mirage Studio allows users to generate AI avatars and create short-form videos from prompts or audio. == Awards == In 2023, the company was recognized as part of Fast Company's "Next Big Things In Tech" series. In 2024, the company won 2 Webby Awards for Best Use of AI & Machine Learning and Creative Production.

    Read more →
  • Non-separable wavelet

    Non-separable wavelet

    Non-separable wavelets are multi-dimensional wavelets that are not directly implemented as tensor products of wavelets on some lower-dimensional space. They have been studied since 1992. They offer a few important advantages. Notably, using non-separable filters leads to more parameters in design, and consequently better filters. The main difference, when compared to the one-dimensional wavelets, is that multi-dimensional sampling requires the use of lattices (e.g., the quincunx lattice). The wavelet filters themselves can be separable or non-separable regardless of the sampling lattice. Thus, in some cases, the non-separable wavelets can be implemented in a separable fashion. Unlike separable wavelet, the non-separable wavelets are capable of detecting structures that are not only horizontal, vertical or diagonal (show less anisotropy). == Examples == Red-black wavelets Contourlets Shearlets Directionlets Steerable pyramids Non-separable schemes for tensor-product wavelets

    Read more →
  • Group of Governmental Experts on Lethal Autonomous Weapons Systems

    Group of Governmental Experts on Lethal Autonomous Weapons Systems

    The Group of Governmental Experts on Lethal Autonomous Weapons Systems, commonly known as the GGE on LAWS, refers to a group of governmental experts established under the framework of the Convention on Certain Conventional Weapons (CCW), a United Nations arms control framework. The group examines legal, ethical, societal and moral questions that arise from the increased use of autonomous robots to carry weapons and to be programmed to engage in combat in various situations that might arise, including battles between countries, or in patrolling border areas or sensitive areas, or other similar roles. As of 18 March 2025, the Convention on Certain Conventional Weapons had 128 High Contracting Parties. In the Geneva Conventions, the term "High Contracting Parties" refers to the states that have joined the conventions and are therefore bound to uphold them. Among the countries that have joined are states with tense relations or ongoing armed conflict with one another, including Russia and Ukraine, Israel and the State of Palestine, and Pakistan and Afghanistan. == Background == In 2013, the Meeting of State Parties to the Convention on Certain Conventional Weapons agreed on a mandate on lethal autonomous weapon systems and tasked its chairperson with convening an informal Meeting of Experts to discuss issues related to emerging technologies in the area of LAWS. Those informal Meetings of Experts were then held in 2014, 2015 and 2016, and their reports fed into subsequent meetings of the High Contracting Parties. At the Fifth CCW Review Conference in 2016, the High Contracting Parties decided to establish an open-ended Group of Governmental Experts on emerging technologies in the area of LAWS, building on the earlier expert meetings. Since then, the group has been reconvened annually. In 2023, the Meeting of the High Contracting Parties to the CCW decided that the GGE on LAWS would continue its work in 2024 and 2025. The group was tasked with developing, by consensus, elements of a possible instrument, without predetermining its form, as well as other measures addressing lethal autonomous weapon systems, drawing on existing CCW protocols, earlier recommendations, state proposals, and legal, military, and technological expertise. == 2024 == In 2024, the GGE met twice, and the group was chaired by Robert in den Bosch, the Netherlands' disarmament ambassador. The 2024 Meeting of the High Contracting Parties decided that the group would meet for 10 days in 2025, in two five-day sessions, and reaffirmed its mandate to continue work by consensus on possible elements of an instrument and other measures addressing lethal autonomous weapon systems. == 2025 == At its first 2025 session, held in Geneva from 3 to 7 March 2025, the Group of Governmental Experts on Lethal Autonomous Weapon Systems discussed revisions to the chair's rolling text. The text was structured into five sections, or "boxes", though delegates held differing views on whether headings were useful or appropriate. Broadly, the discussions covered the characterization of lethal autonomous weapon systems, the application of international humanitarian law, possible prohibitions and regulations, legal review, and questions of accountability and responsibility. At its second session, held from 1 to 5 September 2025, delegations continued work on the chair's rolling text, which set out elements of a possible instrument and was organized into five thematic "boxes". == 2026 == === Developments before the 2026 session === A few weeks before the meeting, autonomous weapons drew renewed attention when the United States pressured Anthropic to revise the terms of use for its AI model Claude. Anthropic prohibited the model's use for mass domestic surveillance and for fully autonomous weapons operating without human oversight, while reports also emerged that OpenAI had reached an agreement with the U.S. Department of War for the use of its AI models, reportedly stipulating that they would not independently direct autonomous weapons where human control was required. The U.S. military nevertheless continued to use Claude during its war on Iran, and there was increasing alarm about the use of AI-assisted semi-autonomous weapons in conflicts including those in Ukraine, Sudan, Gaza, and Iran. Before the start of the sessions, Robert in den Bosch, as chair, warned that progress was urgent because technological developments were moving quickly. At the same time, although states agreed that international humanitarian law applied to LAWS, specific internationally binding standards governing such systems remained largely absent. A key divide before the session was that Russia and the United States opposed new legally binding instruments, while other states argued that new rules were necessary. According to Robert in den Bosch, the talks could lead to new rules, amendments to an existing convention, or a new treaty. === First session === From 2 to 6 March 2026, the group held its penultimate session under the group's three-year mandate. Delegations discussed the chair's rolling draft text, circulated in December 2025, on elements of a possible instrument or other measures concerning lethal autonomous weapon systems. In revised text circulated by the chair on 5 March 2026, a lethal autonomous weapon system was characterized as "a functionally integrated combination of one or more weapons and technological components, that can identify, select, and engage a target, without intervention by a human operator in the execution of these tasks". The text was divided into five boxes to structure discussion. During the session, delegates conducted a first reading of the draft text, and the chair later circulated revised language for several sections. Informal consultations were also held. According to campaign groups and participating observers, support grew during the week for moving to negotiations on the basis of the rolling text, with more than 70 states said to support that step by the end of the session, though some participants warned that attempts to bridge differences risked blurring the group's core purpose. The International Committee of the Red Cross argued that the text should not only restate existing international humanitarian law, but also clarify how those rules apply to autonomous weapons and set out additional measures tailored to the specific challenges such systems raise. Stop Killer Robots likewise emphasized the need to preserve meaningful human judgment and control over increasingly autonomous systems. During the discussions, the U.S. delegation opposed the term "human control" and reportedly proposed the alternative phrase "good faith human judgment and care". Other delegations rejected that wording as too weak, while many states continued to insist that meaningful human control over weapon systems remained essential.

    Read more →
  • Deep Learning Super Sampling

    Deep Learning Super Sampling

    Deep Learning Super Sampling (DLSS) is a suite of real-time deep learning image enhancement and upscaling technologies developed by Nvidia that are available in a number of video games. The goal of these technologies is to allow the majority of the graphics pipeline to run at a lower resolution for increased performance, and then infer a higher resolution image from this that approximates the same level of detail as if the image had been rendered at this higher resolution. This allows for higher graphical settings or frame rates for a given output resolution, depending on user preference. All generations of DLSS are available on all RTX-branded cards from Nvidia in supported titles. However, the Frame Generation feature is only supported on RTX 40 series GPUs or newer and Multi Frame Generation is only available on 50 series GPUs. == History == Nvidia advertised DLSS as a key feature of GeForce RTX 20 series GPUs when they launched in September 2018. At that time, the results were limited to a few video games, namely Battlefield V, or Metro Exodus, because the algorithm had to be trained specifically on each game on which it was applied and the results were usually not as good as simple resolution upscaling. In 2019, Control shipped with ray tracing and an image processing algorithm that approximated DLSS, which did not use the Tensor Cores. In April 2020, Nvidia advertised and shipped an improved version of DLSS named DLSS 2 with driver version 445.75. DLSS 2.0 was available for a few existing games including Control and Wolfenstein: Youngblood, and would later be added to many newly released games and game engines such as Unreal Engine and Unity. This time Nvidia said that it used the Tensor Cores again, and that the AI did not need to be trained specifically on each game. Despite sharing the DLSS branding, the two iterations of DLSS differ significantly and are not backwards-compatible. In January 2025, Nvidia stated that there are over 540 games and apps supporting DLSS, and that over 80% of Nvidia RTX users activate DLSS. In March 2025, there were more than 100 games that support DLSS 4, according to Nvidia. By May 2025, over 125 games supported DLSS 4. The first video game console to use DLSS, the Nintendo Switch 2, was released on June 5, 2025. Nvidia announced DLSS 4.5 at CES 2026. In January 2026, Nvidia stated that over 250 games and applications support Multi Frame Generation. On March 16, 2026, at GTC 2026, Nvidia CEO Jensen Huang presented DLSS 5, a real-time AI model based on neural rendering that realistically enhances lighting and material surfaces at up to 4K resolution while retaining the developer's intended art style. It is planned to release in fall of 2026. In a blog post on its website, Nvidia has announced that DLSS 5 will be available in such games as Assassin's Creed Shadows, Delta Force, Hogwarts Legacy, Naraka: Bladepoint, Phantom Blade Zero, Resident Evil Requiem, Starfield, The Elder Scrolls IV: Oblivion Remastered, and more. On May 31, 2026, Nvidia announced an updated version of Ray Reconstruction for DLSS 4.5 in a blog post, scheduled for release on all RTX GPUs in August of the same year. They said it is designed to better embed spatial awareness into scenes and analyze engine data on movements and lighting conditions, resulting in a sharper, more stable, and less noisy image. === Release timeline === == Technology == === DLSS 1 === The first iteration of DLSS is a predominantly spatial image upscaler with two stages, both relying on convolutional auto-encoder neural networks. The first step is an image enhancement network which uses the current frame and motion vectors to perform edge enhancement, and spatial anti-aliasing. The second stage is an image upscaling step which uses the single raw, low-resolution frame to upscale the image to the desired output resolution. Using just a single frame for upscaling means the neural network itself must generate a large amount of new information to produce the high-resolution output, which can result in slight hallucinations such as leaves that differ in style to the source content. The neural networks are trained on a per-game basis by generating a "perfect frame" using traditional supersampling to 64 samples per pixel, as well as the motion vectors for each frame. The data collected must be as comprehensive as possible, including as many levels, times of day, graphical settings, resolutions, etc. as possible. This data is also augmented using common augmentations such as rotations, colour changes, and random noise to help generalize the test data. Training is performed on Nvidia's Saturn V supercomputer. This first iteration received a mixed response, with many criticizing the often soft appearance and artifacts along with glitches in certain situations; likely a side effect of the limited data from only using a single frame input to the neural networks which could not be trained to perform optimally in all scenarios and edge-cases. Nvidia also demonstrated the ability for the auto-encoder networks to learn the ability to recreate depth-of-field and motion blur, although this functionality has never been included in a publicly released product. === DLSS 2 === DLSS 2 is a temporal anti-aliasing upsampling (TAAU) implementation, using data from previous frames extensively through sub-pixel jittering to resolve fine detail and reduce aliasing. The data DLSS 2 collects includes: the raw low-resolution input, motion vectors, depth buffers, and exposure / brightness information. It can also be used as a simpler TAA implementation where the image is rendered at 100% resolution, rather than being upsampled by DLSS, Nvidia brands this as DLAA (Deep Learning Anti-Aliasing). TAA(U) is used in many modern video games and game engines; however, all previous implementations have used some form of manually written heuristics to prevent temporal artifacts such as ghosting and flickering. One example of this is neighborhood clamping which forcefully prevents samples collected in previous frames from deviating too much compared to nearby pixels in newer frames. This helps to identify and fix many temporal artifacts, but deliberately removing fine details in this way is analogous to applying a blur filter, and thus the final image can appear blurry when using this method. DLSS 2 uses a convolutional auto-encoder neural network trained to identify and fix temporal artifacts, instead of manually programmed heuristics as mentioned above. Because of this, DLSS 2 can generally resolve detail better than other TAA and TAAU implementations, while also removing most temporal artifacts. This is why DLSS 2 can sometimes produce a sharper image than rendering at higher, or even native resolutions using traditional TAA. However, no temporal solution is perfect, and artifacts (ghosting in particular) are still visible in some scenarios when using DLSS 2. Because temporal artifacts occur in most art styles and environments in broadly the same way, the neural network that powers DLSS 2 does not need to be retrained when being used in different games. Despite this, Nvidia does frequently ship new minor revisions of DLSS 2 with new titles, so this could suggest some minor training optimizations may be performed as games are released, although Nvidia does not provide changelogs for these minor revisions to confirm this. The main advancements compared to DLSS 1 include: Significantly improved detail retention, a generalized neural network that does not need to be re-trained per-game, and ~2x less overhead (~1–2 ms vs ~2–4 ms). It should also be noted that forms of TAAU such as DLSS 2 are not upscalers in the same sense as techniques such as ESRGAN or DLSS 1, which attempt to create new information from a low-resolution source; instead, TAAU works to recover data from previous frames, rather than creating new data. In practice, this means low resolution textures in games will still appear low-resolution when using current TAAU techniques. This is why Nvidia recommends game developers use higher resolution textures than they would normally for a given rendering resolution by applying a mip-map bias when DLSS 2 is enabled. === DLSS 3 === Augments DLSS 2 with improved image quality and the introduction of a new motion interpolation feature, called Frame Generation. The DLSS Frame Generation algorithm takes two rendered frames from the rendering pipeline and generates a new frame that smoothly transitions between them. For every frame rendered, one additional frame is generated. DLSS 3.0 makes use of a new generation Optical Flow Accelerator (OFA) included in the Ada Lovelace architecture of GeForce RTX 40 series GPUs and with that is exclusive to them. The new OFA is said to be faster and more accurate than the one already available in previous Turing and Ampere RTX GPUs. === DLSS 3.5 === DLSS 3.5 adds Ray Reconstruction, replacing multiple denoising algorithms with a single AI model trained o

    Read more →
  • Ulead MediaStudio Pro

    Ulead MediaStudio Pro

    Ulead MediaStudio Pro (MSP) is real-time, timeline based prosumer level video editing software by Ulead Systems. It is a suite of 5 digital video and audio applications, including: Video Capture, Video Paint, CG Infinity, Audio Editor and Video Editor. MSP is only available on the Windows platform. Since version 8.0, CG Infinity and Video Paint are separate from the MSP suite, and are being sold as a combination product called VideoGraphics Lab (VGL). On June 18, 2008, Corel formally announced that MediaStudio Pro would be discontinued. The final MediaStudio Pro version was 8.10.0039 (Pro 8 Service Pack 1) released June 2, 2006. Corel discontinued support for MediaStudio Pro in June 2009. Version 6.0 is last version to support Windows 95, although recent versions are not compatible with Windows Vista or Windows 7. == Modules == There are 5 stand-alone modules in MSP before version 8.0, they are: Video Capture – The video capturing module in MSP. Video Paint – A frame-by-frame editor which can let user to make some image or hand-drawing effects on video frames. CG Infinity – A vector-based video editing tool which allows user to create logo animation or vector graphics on video frames. Audio Editor – The audio editing tool in MSP. It can utilize DirectX audio filters and Ulead audio filters to do audio effect processing. Video Editor – The module that users do video editing with audio/video effects. It can also utilize DirectX audio filters and 3rd party video filters to do the video editing. Since version 8.0, CG Infinity and Video Paint have been separated from the MSP suite and are being sold as a combination product called VideoGraphics Lab (VGL). == Editions == Ulead MediaStudio Pro had several editions before version 7.0. They are: Full edition: this edition includes all 5 modules. Director's Cut edition: this edition has 3 modules including Video Capture, Video Editor and Audio Editor. SE edition: SE means Simple Edition or Special Edition and is an OEM bundle version. It also includes the 3 modules as Director's Cut, however, is feature limited. Sometimes it will be given freely in video magazines. After version 7.0 only Full edition is available in the MSP suite. On June 18, 2008, Corel formally announced that MediaStudio Pro would be discontinued. == Release history ==

    Read more →
  • Speech recognition

    Speech recognition

    Speech recognition (automatic speech recognition (ASR), computer speech recognition, or speech-to-text (STT)) is a sub-field of computational linguistics concerned with methods and technologies that translate spoken language into text or other interpretable forms. Speech recognition applications include voice user interfaces, where the user speaks to a device, which "listens" and processes the audio. Common voice applications include interpreting commands for calling, call routing, home automation, and aircraft control. These applications are called direct voice input. Productivity applications include searching audio recordings, creating transcripts, and dictation. Speech recognition can be used to analyse speaker characteristics, such as identifying native language using pronunciation assessment. Voice recognition (speaker identification) refers to identifying the speaker, rather than speech contents. Recognizing the speaker can simplify the task of translating speech in systems trained on a specific person's voice. It can also be used to authenticate the speaker as part of a security process. == History == Applications for speech recognition developed over many decades, with progress accelerated due to advances in deep learning and the use of big data. These advances are reflected in an increase in academic papers, and greater system adoption. Key areas of growth include vocabulary size, more accurate recognition for unfamiliar speakers (speaker independence), and faster processing speed. === Pre-1970 === 1952 – Bell Labs researchers, Stephen Balashek, R. Biddulph, and K. H. Davis, built Audrey for single-speaker digit recognition. Their system located the formants in the power spectrum of each utterance. 1960 – Gunnar Fant developed and published the source–filter model of speech production. 1962 – IBM's 16-word "Shoebox" machine's speech recognition debuted at the 1962 World's Fair. 1966 – Linear predictive coding, a speech coding method, was proposed by Fumitada Itakura of Nagoya University and Shuzo Saito of Nippon Telegraph and Telephone. 1969 – Funding at Bell Labs came to a halt for several years after the company's head engineer, John R. Pierce, wrote an open letter criticizing speech recognition research. This defunding lasted until Pierce retired and James L. Flanagan took over. Raj Reddy was the first person to work on continuous speech recognition, as a graduate student at Stanford University in the late 1960s. Previous systems required users to pause after each word. Reddy's system issued spoken commands for playing chess. Around this time, Soviet researchers invented the dynamic time warping (DTW) algorithm and used it to create a recognizer capable of operating on a 200-word vocabulary. DTW processed speech by dividing it into short frames (e.g. 10 ms segments) and treating each frame as a unit. Speaker independence, however, remained unsolved. === 1970–1990 === 1971 – DARPA funded a five-year speech recognition research project, Speech Understanding Research, seeking a minimum vocabulary size of 1,000 words. The project considered speech understanding a key to achieving progress in speech recognition, which was later disproved. BBN, IBM, Carnegie Mellon (CMU), and Stanford Research Institute participated. 1972 – The IEEE Acoustics, Speech, and Signal Processing group held a conference in Newton, Massachusetts. 1976 – The first ICASSP was held in Philadelphia, which became a major venue for publishing on speech recognition. During the late 1960s, Leonard Baum developed the mathematics of Markov chains at the Institute for Defense Analysis. A decade later, at CMU, Raj Reddy's students James Baker and Janet M. Baker began using the hidden Markov model (HMM) for speech recognition. James Baker had learned about HMMs while at the Institute for Defense Analysis. HMMs enabled researchers to combine sources of knowledge, such as acoustics, language, and syntax, in a unified probabilistic model. By the mid-1980s, Fred Jelinek's team at IBM created a voice-activated typewriter called Tangora, which could handle a 20,000-word vocabulary. Jelinek's statistical approach placed less emphasis on emulating human brain processes in favor of statistical modelling. (Jelinek's group independently discovered the application of HMMs to speech.) This was controversial among linguists since HMMs are too simplistic to account for many features of human languages. However, the HMM proved to be a highly useful way for modelling speech and replaced dynamic time warping as the dominant speech recognition algorithm in the 1980s. 1982 – Dragon Systems, founded by James and Janet M. Baker, was one of IBM's few competitors. === Practical speech recognition === The 1980s also saw the introduction of the n-gram language model. 1987 – The back-off model enabled language models to use multiple-length n-grams, and CSELT used HMM to recognize languages (in software and hardware, e.g. RIPAC). At the end of the DARPA program in 1976, the best computer available to researchers was the PDP-10 with 4 MB of RAM. It could take up to 100 minutes to decode 30 seconds of speech. Practical products included: 1984 – the Apricot Portable was released with up to 4096 words support, of which only 64 could be held in RAM at a time. 1987 – a recognizer from Kurzweil Applied Intelligence 1990 – Dragon Dictate, a consumer product released in 1990. AT&T deployed the Voice Recognition Call Processing service in 1992 to route telephone calls without a human operator. The technology was developed by Lawrence Rabiner and others at Bell Labs. By the early 1990s, the vocabulary of the typical commercial speech recognition system had exceeded the average human vocabulary. Reddy's former student, Xuedong Huang, developed the Sphinx-II system at CMU. Sphinx-II was the first to do speaker-independent, large vocabulary, continuous speech recognition, and it won DARPA's 1992 evaluation. Handling continuous speech with a large vocabulary was a major milestone. Huang later founded the speech recognition group at Microsoft in 1993. Reddy's student Kai-Fu Lee joined Apple, where, in 1992, he helped develop the Casper speech interface prototype. Lernout & Hauspie, a Belgium-based speech recognition company, acquired other companies, including Kurzweil Applied Intelligence in 1997 and Dragon Systems in 2000. L&H was used in Windows XP. L&H was an industry leader until an accounting scandal destroyed it in 2001. L&H speech technology was bought by ScanSoft, which became Nuance in 2005. Apple licensed Nuance software for its digital assistant Siri. ==== 2000s ==== In the 2000s, DARPA sponsored two speech recognition programs: Effective Affordable Reusable Speech-to-Text (EARS) in 2002, followed by Global Autonomous Language Exploitation (GALE) in 2005. Four teams participated in EARS: IBM; a team led by BBN with LIMSI and the University of Pittsburgh; Cambridge University; and a team composed of ICSI, SRI, and the University of Washington. EARS funded the collection of the Switchboard telephone speech corpus, which contained 260 hours of recorded conversations from over 500 speakers. The GALE program focused on Arabic and Mandarin broadcast news. Google's first effort at speech recognition came in 2007 after recruiting Nuance researchers. Its first product, GOOG-411, was a telephone-based directory service. Since at least 2006, the U.S. National Security Agency has employed keyword spotting, allowing analysts to index large volumes of recorded conversations and identify speech containing "interesting" keywords. Other government research programs focused on intelligence applications, such as DARPA's EARS program and IARPA's Babel program. In the early 2000s, speech recognition was dominated by hidden Markov models combined with feed-forward artificial neural networks (ANN). Later, speech recognition was taken over by long short-term memory (LSTM), a recurrent neural network (RNN) published by Sepp Hochreiter & Jürgen Schmidhuber in 1997. LSTM RNNs avoid the vanishing gradient problem and can learn "Very Deep Learning" tasks that require memories of events that happened thousands of discrete time steps earlier, which is important for speech. Around 2007, LSTMs trained with Connectionist Temporal Classification (CTC) began to outperform. In 2015, Google reported a 49 percent error‑rate reduction in its speech recognition via CTC‑trained LSTM. Transformers, a type of neural network based solely on attention, were adopted in computer vision and language modelling, and then to speech recognition. Deep feed-forward (non-recurrent) networks for acoustic modelling were introduced in 2009 by Geoffrey Hinton and his students at the University of Toronto, and by Li Deng and colleagues at Microsoft Research. In contrast to the prioer incremental improvements, deep learning decreased error rates by 30%. Both shallow and deep forms (e.g., recurrent nets) of ANNs had been explored since the 1980s. Howev

    Read more →
  • Quantum image processing

    Quantum image processing

    Quantum image processing (QIMP) is using quantum computing or quantum information processing to create and work with quantum images. Due to some of the properties inherent to quantum computation, notably entanglement and parallelism, it is hoped that QIMP technologies will offer capabilities and performances that surpass their traditional equivalents, in terms of computing speed, security, and minimum storage requirements. == Background == A. Y. Vlasov's work in 1997 focused on using a quantum system to recognize orthogonal images. This was followed by efforts using quantum algorithms to search specific patterns in binary images and detect the posture of certain targets. Notably, more optics-based interpretations for quantum imaging were initially experimentally demonstrated in and formalized in after seven years. In 2003, Salvador Venegas-Andraca and S. Bose presented Qubit Lattice, the first published general model for storing, processing and retrieving images using quantum systems. Later on, in 2005, Latorre proposed another kind of representation, called the Real Ket, whose purpose was to encode quantum images as a basis for further applications in QIMP. Furthermore, in 2010 Venegas-Andraca and Ball presented a method for storing and retrieving binary geometrical shapes in quantum mechanical systems in which it is shown that maximally entangled qubits can be used to reconstruct images without using any additional information. Technically, these pioneering efforts with the subsequent studies related to them can be classified into three main groups: Quantum-assisted digital image processing (QDIP): These applications aim at improving digital or classical image processing tasks and applications. Optics-based quantum imaging (OQI) Classically inspired quantum image processing (QIMP) A survey of quantum image representation has been published in. Furthermore, the recently published book Quantum Image Processing provides a comprehensive introduction to quantum image processing, which focuses on extending conventional image processing tasks to the quantum computing frameworks. It summarizes the available quantum image representations and their operations, reviews the possible quantum image applications and their implementation, and discusses the open questions and future development trends. == Quantum image representations == There are various approaches for quantum image representation, that are usually based on the encoding of color information. A common representation is FRQI (Flexible Representation for Quantum Images), that captures the color and position at every pixel of the image, and defined as: | I ⟩ = 1 2 n ∑ i = 0 2 2 n − 1 | c i ⟩ ⊗ | i ⟩ {\displaystyle \vert I\rangle ={\frac {1}{2^{n}}}\sum _{i=0}^{2^{2n-1}}\vert c_{i}\rangle \otimes \vert i\rangle } where | i ⟩ {\textstyle |i\rangle } is the position and | c i ⟩ = c o s θ i | 0 ⟩ + s i n θ i | 1 ⟩ {\textstyle \vert c_{i}\rangle =cos\theta _{i}\vert 0\rangle +sin\theta _{i}\vert 1\rangle } the color with a vector of angles θ i ∈ [ 0 , π / 2 ] {\textstyle \theta _{i}\in \left[0,\pi /2\right]} . As it can be seen, | c i ⟩ {\textstyle \vert c_{i}\rangle } is a regular qubit state of the form | ψ ⟩ = α | 0 ⟩ + β | 1 ⟩ {\displaystyle \vert \psi \rangle =\alpha \vert 0\rangle +\beta \vert 1\rangle } , with basis states | 0 ⟩ = ( 1 0 ) {\textstyle \vert 0\rangle ={\begin{pmatrix}1\\0\end{pmatrix}}} and | 1 ⟩ = ( 0 1 ) {\textstyle \vert 1\rangle ={\begin{pmatrix}0\\1\end{pmatrix}}} , as well as amplitudes α {\textstyle \alpha } and β {\textstyle \beta } that satisfy | α | 2 + | β | 2 = 1 {\textstyle \left|\alpha \right|^{2}+\left|\beta \right|^{2}=1} . Another common representation is MCQI (Multi-Channel Representation for Quantum Images), that uses the RGB channels with quantum states and following FRQI definition: | I ⟩ = 1 2 n + 1 ∑ i = 0 2 2 n − 1 | C R G B i ⟩ ⊗ | i ⟩ {\displaystyle \vert I\rangle ={\frac {1}{2^{n+1}}}\sum _{i=0}^{2^{2n-1}}\vert C_{RGB}^{i}\rangle \otimes \vert i\rangle } | C R G B i ⟩ = cos ⁡ θ R i | 000 ⟩ + cos ⁡ θ G i | 001 ⟩ + cos ⁡ θ B i | 010 ⟩ + sin ⁡ θ R i | 100 ⟩ + sin ⁡ θ G i | 101 ⟩ + sin ⁡ θ B i | 110 ⟩ + cos ⁡ θ α | 011 ⟩ + sin ⁡ θ α | 111 ⟩ {\displaystyle {\begin{aligned}{\begin{aligned}\vert C_{RGB}^{i}\rangle &={\cos \theta _{R}^{i}\vert 000\rangle }+{\cos \theta _{G}^{i}\vert 001\rangle }+{\cos \theta _{B}^{i}\vert 010\rangle }\\&\quad +{\sin \theta _{R}^{i}\vert 100\rangle }+{\sin \theta _{G}^{i}\vert 101\rangle }+{\sin \theta _{B}^{i}\vert 110\rangle }\\&\quad +{\cos {\theta _{\alpha }}\vert 011\rangle }+{\sin \theta _{\alpha }\vert 111\rangle }\end{aligned}}\end{aligned}}} Departing from the angle-based approach of FRQI and MCQI, and using a qubit sequence, NEQR (Novel Enhanced Representation for Quantum Images) is another representation approach, that uses a function f ( y , x ) = C y x q − 1 C y x q − 2 … C y x 1 C y x 0 {\textstyle f\left(y,x\right)=C_{yx}^{q-1}C_{yx}^{q-2}\ldots C_{yx}^{1}C_{yx}^{0}} to encode color values for a 2 n × 2 n {\displaystyle 2^{n}\times 2^{n}} image: | I ⟩ = 1 2 n ∑ y = 0 2 n − 1 ∑ x = 0 2 n − 1 | f ( y , x ) ⟩ | y x ⟩ {\displaystyle \vert I\rangle ={\frac {1}{2^{n}}}\sum _{y=0}^{2^{n}-1}\sum _{x=0}^{2^{n}-1}\vert f\left(y,x\right)\rangle \vert yx\rangle } == Quantum image manipulations == A lot of the effort in QIMP has been focused on designing algorithms to manipulate the position and color information encoded using flexible representation of quantum images (FRQI) and its many variants. For instance, FRQI-based fast geometric transformations including (two-point) swapping, flip, (orthogonal) rotations and restricted geometric transformations to constrain these operations to a specified area of an image were initially proposed. Recently, NEQR-based quantum image translation to map the position of each picture element in an input image into a new position in an output image and quantum image scaling to resize a quantum image were discussed. While FRQI-based general form of color transformations were first proposed by means of the single qubit gates such as X, Z, and H gates. Later, Multi-Channel Quantum Image-based channel of interest (CoI) operator to entail shifting the grayscale value of the preselected color channel and the channel swapping (CS) operator to swap the grayscale values between two channels have been fully discussed. To illustrate the feasibility and capability of QIMP algorithms and application, researchers always prefer to simulate the digital image processing tasks on the basis of the QIRs that we already have. By using the basic quantum gates and the aforementioned operations, so far, researchers have contributed to quantum image feature extraction, quantum image segmentation, quantum image morphology, quantum image comparison, quantum image filtering, quantum image classification, quantum image stabilization, among others. In particular, QIMP-based security technologies have attracted extensive interest of researchers as presented in the ensuing discussions. Similarly, these advancements have led to many applications in the areas of watermarking, encryption, and steganography etc., which form the core security technologies highlighted in this area. In general, the work pursued by the researchers in this area are focused on expanding the applicability of QIMP to realize more classical-like digital image processing algorithms; propose technologies to physically realize the QIMP hardware; or simply to note the likely challenges that could impede the realization of some QIMP protocols. == Quantum image transform == By encoding and processing the image information in quantum-mechanical systems, a framework of quantum image processing is presented, where a pure quantum state encodes the image information: to encode the pixel values in the probability amplitudes and the pixel positions in the computational basis states. Given an image F = ( F i , j ) M × L {\displaystyle F=(F_{i,j})_{M\times L}} , where F i , j {\displaystyle F_{i,j}} represents the pixel value at position ( i , j ) {\displaystyle (i,j)} with i = 1 , … , M {\displaystyle i=1,\dots ,M} and j = 1 , … , L {\displaystyle j=1,\dots ,L} , a vector f → {\displaystyle {\vec {f}}} with M L {\displaystyle ML} elements can be formed by letting the first M {\displaystyle M} elements of f → {\displaystyle {\vec {f}}} be the first column of F {\displaystyle F} , the next M {\displaystyle M} elements the second column, etc. A large class of image operations is linear, e.g., unitary transformations, convolutions, and linear filtering. In the quantum computing, the linear transformation can be represented as | g ⟩ = U ^ | f ⟩ {\displaystyle |g\rangle ={\hat {U}}|f\rangle } with the input image state | f ⟩ {\displaystyle |f\rangle } and the output image state | g ⟩ {\displaystyle |g\rangle } . A unitary transformation can be implemented as a unitary evolution. Some basic and commonly used image transforms (e.g., the Fourier, Hadamard, an

    Read more →
  • Cognitive philology

    Cognitive philology

    Cognitive philology is the science that studies written and oral texts as the product of human mental processes. Studies in cognitive philology compare documentary evidence emerging from textual investigations with results of experimental research, especially in the fields of cognitive and ecological psychology, neurosciences and artificial intelligence. "The point is not the text, but the mind that made it". Cognitive Philology aims to foster communication between literary, textual, philological disciplines on the one hand and researches across the whole range of the cognitive, evolutionary, ecological and human sciences on the other. Cognitive philology: investigates transmission of oral and written text, and categorization processes which lead to classification of knowledge, mostly relying on the information theory; studies how narratives emerge in so called natural conversation and selective process which lead to the rise of literary standards for storytelling, mostly relying on embodied semantics; explores the evolutive and evolutionary role played by rhythm and metre in human ontogenetic and phylogenetic development and the pertinence of the semantic association during processing of cognitive maps; Provides the scientific ground for multimedia critical editions of literary texts. Among the founding thinkers and noteworthy scholars devoted to such investigations are: Alan Richardson: Studies Theory of Mind in early-modern and contemporary literature. Anatole Pierre Fuksas Benoît de Cornulier David Herman: Professor of English at North Carolina State University and an adjunct professor of linguistics at Duke University. He is the author of "Universal Grammar and Narrative Form" and the editor of "Narratologies: New Perspectives on Narrative Analysis". Domenico Fiormonte François Recanati Gilles Fauconnier, a professor in Cognitive science at the University of California, San Diego. He was one of the founders of cognitive linguistics in the 1970s through his work on pragmatic scales and mental spaces. His research explores the areas of conceptual integration and compressions of conceptual mappings in terms of the emergent structure in language. Julián Santano Moreno Luca Nobile Manfred Jahn in Germany Mark Turner Paolo Canettieri

    Read more →
  • WebPlus

    WebPlus

    Serif WebPlus was a website design program for Microsoft Windows, developed by the software company, Serif. It allows users to design, create and upload their website onto the internet without any knowledge of HTML or other web technologies. Much like Microsoft Word, WebPlus uses WYSIWYG drag and drop editing to add and position text, images and links as they would appear on the finished web page. Once a user has designed their site, WebPlus can preview the site in a web browser before uploading the site using the in-built FTP. The software comes with a variety of pre-designed sample websites containing Filler text like Lorem ipsum, which can be used as a template for quickly designing a site. It also provides drawing tools for creating and editing buttons and web graphics. == Free WebPlus Starter Edition == Previously Serif had made available feature limited Starter Editions of their software, based on older versions, which could be obtained and used free of charge. For WebPlus the final free edition was based on version X5 and this was released in September 2012. This continued to be available from Serif's server until it was withdrawn around March 2016. WebPlus was then only available as a paid-for version X8. == Program Withdrawal == In March 2016, Serif announced that WebPlus X8 would be the final version, and that there were no current plans to design an application to replace it. Sales of WebPlus X8 by Serif were ended around December 2016. In early 2018, Serif announced that Serif Web Resources, hosted on Serif servers and required to implement some advanced web-site functionality in WebPlus created sites, would no longer work after 31 August 2018. In 2018, Serif also shutdown the servers that generated the "Plus" software registration numbers on-line from the product version and the individual generated installation number. Serif revealed the alternative was to use a universal master registration number, which is 881887. This is known to work with post 2003 Serif "Plus" software (e.g. verified to work with PagePlus v5.02). However, later Serif "Plus" software still registers itself automatically if within a certain recent period of a previous Serif software registration on the same PC. == Supported platforms == WebPlus was developed for Microsoft Windows "Win32" graphical desktop interface and is fully compatible with Windows XP, Windows Vista (32/64bit), Windows 7 (32/64bit) and Windows 8. == Features == Web hosting to upload websites to the internet with the address www.sitename.webplus.net and email [email protected]. E-Commerce tool to create online stores with providers such as PayPal. Form wizard generates online forms to collect information from website visitors. Add blogs, forums, hit counters, online polls and content management systems to websites using Smart Objects. Google Maps tool embeds maps and optional navigation markers within a website. Site navigation bars adopt a website's structure providing a tool for navigating around the website. Photo gallery groups a collection of images together and displays them as an animated slideshow. Search engine optimization (SEO) tools optimise a websites search ranking with the likes of Google, Yahoo! and Bing. Collect website metrics such as page popularity and number of website hits using Google Analytics. WebPlus X5 introduced a button studio for creating button graphics. Restrict access to specific pages on a website with a secure member's area. WebPlus automatically converts images and graphics into a web targeted format, optimising them for fast download. Embed YouTube videos within a web page. Add animated effects to a website with Animated GIFs, Animated Marquees or by importing Flash videos. Stream news and information feeds to a website using RSS and podcasts. Automated Site Checker analyses and corrects potential problems with a website. AdSense tool incorporates Google AdSense advertisements into a website In-built FTP transfers files onto a web server, uploading a website to the internet. In-built Basic Photo Editor the PhotoLab can make automatic adjustments and "Quick Fix's" to photos. From X5, WebPlus offers image editing and filters, through its PhotoLab and also provides a dedicated background-removal tool in the form of Cutout Studio. Display images, Flash videos and web pages using animated Lightboxes. Filter Effects can be applied to the graphical objects, giving convincing, realistic effects such as glass, metallic, plastic and other 2D/3D filters. WebPlus also provides QuickShapes for creating button and web graphics. These predefined shapes can be quickly modified with sliders to adjust certain parameters, for example creating rounded rectangles, etc. Shapes include: rectangles, ellipses, stars, spirals, cogs, petals, etc.

    Read more →
  • VGACAD

    VGACAD

    VGACAD was the parent of a suite of shareware graphic utilities made for the MS-DOS operating system used in the IBM PC and clones. It was popular for editing and capturing images using BSAVE (graphics image format) and provided an early graphic editing suite compatible with multiple graphic cards and resolutions, used on the IBM PC. == Usage == Written by Lawrence Gozum in 1987, it was the genesis of multiple versions and improvements over 10 years. Ran with his brother, Marvin initially helped with design ideas, strategic focus, technical support calls, and managing the early shareware business. The growth of the VGACAD suite grew quickly to preoccupy most of their time. Lawrence then focused more of his efforts on software and formed Applied Insights, to manage VGACAD and its offspring, VidFun, and Ai Picture Explorer. At its peak, its users ranged from individuals, Federal government offices, museums and major newspapers. == Features == VGACAD was a misnomer, and meant VGA-Computer Assisted Drawing, rather than computer-aided design, as CAD is commonly referred to today. Its longevity was due to its color accuracy, speed, small size, and that its suite of small utilities often worked stand-alone. One called VGACAP, for 'capture', dumped video memory into a file that could later be converted to popular graphic image formats, later made commonplace when Microsoft Windows programmed the print screen key to dump graphics into the clipboard. However, VGACAP ran insulated apart from early versions of Windows, and thus could capture screens were applications prohibited such function.

    Read more →
  • FMLLR

    FMLLR

    In signal processing, Feature space Maximum Likelihood Linear Regression (fMLLR) is a global feature transform that are typically applied in a speaker adaptive way, where fMLLR transforms acoustic features to speaker adapted features by a multiplication operation with a transformation matrix. In some literature, fMLLR is also known as the Constrained Maximum Likelihood Linear Regression (cMLLR). == Overview == fMLLR transformations are trained in a maximum likelihood sense on adaptation data. These transformations may be estimated in many ways, but only maximum likelihood (ML) estimation is considered in fMLLR. The fMLLR transformation is trained on a particular set of adaptation data, such that it maximizes the likelihood of that adaptation data given a current model-set. This technique is a widely used approach for speaker adaptation in HMM-based speech recognition. Later research also shows that fMLLR is an excellent acoustic feature for DNN/HMM hybrid speech recognition models. The advantage of fMLLR includes the following: the adaptation process can be performed within a pre-processing phase, and is independent of the ASR training and decoding process. this type of adapted feature can be applied to deep neural networks (DNN) to replace traditionally used mel-spectrogram in end-to-end speech recognition models. fMLLR's speaker adaptation process leads to a significant performance boost for ASR models, hence outperforming other transform or features like MFCCs (Mel-Frequency Cepstral Coefficients) and FBANKs (Filter bank) coefficients. fMLLR features can be efficiently realized with speech toolkits like Kaldi. Major problem and disadvantage of fMLLR: when the amount of adaptation data is limited, the transformation matrices tends to easily overfit the given data. == Computing fMLLR transform == Feature transform of fMLLR can be easily computed with the open source speech tool Kaldi, the Kaldi script uses the standard estimation scheme described in Appendix B of the original paper, in particular the section Appendix B.1 "Direct method over rows". In the Kaldi formulation, fMLLR is an affine feature transform of the form x {\displaystyle x} → A {\displaystyle A} x {\displaystyle x} + b {\displaystyle +b} , which can be written in the form x {\displaystyle x} →W x ^ {\displaystyle {\hat {x}}} , where x ^ {\displaystyle {\hat {x}}} = [ x 1 ] {\displaystyle {\begin{bmatrix}x\\1\end{bmatrix}}} is the acoustic feature x {\displaystyle x} with a 1 appended. Note that this differs from some of the literature where the 1 comes first as x ^ {\displaystyle {\hat {x}}} = [ 1 x ] {\displaystyle {\begin{bmatrix}1\\x\end{bmatrix}}} . The sufficient statistics stored are: K = ∑ t , j , m γ j , m ( t ) Σ j m − 1 μ j m x ( t ) + {\displaystyle K=\sum _{t,j,m}\gamma _{j,m}(t)\textstyle \Sigma _{jm}^{-1}\mu _{jm}x(t)^{+}\displaystyle } where Σ j m − 1 {\displaystyle \textstyle \Sigma _{jm}^{-1}\displaystyle } is the inverse co-variance matrix. And for 0 ≤ i ≤ D {\displaystyle 0\leq i\leq D} where D {\displaystyle D} is the feature dimension: G ( i ) = ∑ t , j , m γ j , m ( t ) ( 1 σ j , m 2 ( i ) ) x ( t ) + x ( t ) + T {\displaystyle G^{(i)}=\sum _{t,j,m}\gamma _{j,m}(t)\left({\frac {1}{\sigma _{j,m}^{2}(i)}}\right)x(t)^{+}x(t)^{+T}\displaystyle } For a thorough review that explains fMLLR and the commonly used estimation techniques, see the original paper "Maximum likelihood linear transformations for HMM-based speech recognition ". Note that the Kaldi script that performs the feature transforms of fMLLR differs with by using a column of the inverse in place of the cofactor row. In other words, the factor of the determinant is ignored, as it does not affect the transform result and can causes potential danger of numerical underflow or overflow. == Comparing with other features or transforms == Experiment result shows that by using the fMLLR feature in speech recognition, constant improvement is gained over other acoustic features on various commonly used benchmark datasets (TIMIT, LibriSpeech, etc). In particular, fMLLR features outperform MFCCs and FBANKs coefficients, which is mainly due to the speaker adaptation process that fMLLR performs. In, phoneme error rate (PER, %) is reported for the test set of TIMIT with various neural architectures: As expected, fMLLR features outperform MFCCs and FBANKs coefficients despite the use of different model architecture. Where MLP (multi-layer perceptron) serves as a simple baseline, on the other hand RNN, LSTM, and GRU are all well known recurrent models. The Li-GRU architecture is based on a single gate and thus saves 33% of the computations over a standard GRU model, Li-GRU thus effectively address the gradient vanishing problem of recurrent models. As a result, the best performance is obtained with the Li-GRU model on fMLLR features. == Extract fMLLR features with Kaldi == fMLLR can be extracted as reported in the s5 recipe of Kaldi. Kaldi scripts can certainly extract fMLLR features on different dataset, below are the basic example steps to extract fMLLR features from the open source speech corpora Librispeech. Note that the instructions below are for the subsets train-clean-100,train-clean-360,dev-clean, and test-clean, but they can be easily extended to support the other sets dev-other, test-other, and train-other-500. These instruction are based on the codes provided in this GitHub repository, which contains Kaldi recipes on the LibriSpeech corpora to execute the fMLLR feature extraction process, replace the files under $KALDI_ROOT/egs/librispeech/s5/ with the files in the repository. Install Kaldi. Install Kaldiio. If running on a single machine, change the following lines in $KALDI_ROOT/egs/librispeech/s5/cmd.sh to replace queue.pl to run.pl: Change the data path in run.sh to your LibriSpeech data path, the directory LibriSpeech/ should be under that path. For example: Install flac with: sudo apt-get install flac Run the Kaldi recipe run.sh for LibriSpeech at least until Stage 13 (included), for simplicity you can use the modified run.sh. Copy exp/tri4b/trans. files into exp/tri4b/decode_tgsmall_train_clean_/ with the following command: Compute the fMLLR features by running the following script, the script can also be downloaded here: Compute alignments using: Apply CMVN and dump the fMLLR features to new .ark files, the script can also be downloaded here: Use the Python script to convert Kaldi generated .ark features to .npy for your own dataloader, an example Python script is provided:

    Read more →
  • Fling (social network)

    Fling (social network)

    Fling was a social media app available for IOS and Android. It was founded in 2014 by Marco Nardone and was taken offline in August 2016. == Overview == In 2012, Marco Nardone founded the startup Unii and launched Unii.com, a social network intended for students in the UK. While working on this service, Nardone had the idea for a messaging service where pictures could be sent to strangers in January 2014. The app Fling was then developed and released between March and July 2014. After a month, it already had 375,000 downloads and 180,000 active users on iOS. Users were able to take pictures inside the app and send them to 50 random people all over the world. The recipient could then choose to answer via chat or reply by sending a picture themselves. The app was used by many users as a medium to exchange sexually explicit pictures and for sexting with strangers. This led to the app being removed from the App Store in June 2015. In the 19 days that followed, flings developers rewrote the App almost completely from scratch, working around the clock. The feature to message random strangers was removed, and the app was readmitted into the App Store as a messenger App resembling Snapchat. But the redesigned Application did not have the success of its predecessor. The funding ran out and the parent company Unii went bankrupt. The company was not able to pay their content moderation team anymore, leading to a new surge of pornographic content on the App. Shortly after that, the Social Network was taken offline in August 2016. It has been inactive since. During the 2 years Fling was online, $21 million was raised from investors while generating no revenue at all. Of this $21 million (£16.5m), £5 million came from Nardone's father. == Allegations against CEO == Former employees made multiple allegations against Marco Nardone, the Founder and CEO of Unii and Fling. According to these claims, he behaved erratic and abusive, throwing "things across the office". He hired his girlfriend as the head of human resources to handle issues between him and his staff. Employees who left the company often had "some part of their pay held back". According to the reports, he also spent the money raised from investors irresponsibly, having no clear concept of a budget. Some of that money was used on expensive restaurants in London, a luxurious office for CEO Nardone and advertisements for Fling on Twitter and Facebook. Nardone also spent time partying in Ibiza with two employees, while the developer team in London frantically tried to get Fling back online after it being removed from the App Store. In December 2017 he pleaded guilty to assaulting his girlfriend at a domestic violence court.

    Read more →
  • Wispr

    Wispr

    Wispr AI is a software company founded in 2021 by Tanay Kothari and Sahaj Garg that develops voice-based interfaces for computers and other devices. The company’s main product, Wispr Flow, is an AI-powered speech-to-text application available on macOS, Windows and iOS. == History == Wispr was founded in 2021 with the goal of building a non-invasive wearable device that would allow users to control smartphones without touch input. The device was intended to translate neurological signals into actions and to enable silent text entry by mouthing words, drawing on techniques similar to brain–computer interfaces. Early funding was directed toward this hardware-focused effort. After around three years of development, Wispr concluded that contemporary AI systems were not sufficient for the requirements of the wearable device. The company shifted its focus to Flow voice dictation software, the software layer originally built for the wearable, and in 2024 released a macOS application based on this platform. == Wispr Flow == Wispr Flow (often referred to as Flow) is a speech-to-text application for macOS, Windows and iOS. It provides real-time dictation and transcription in more than 100 languages and can operate across applications, including email clients, messaging platforms and chatbots. In June 2025 Wispr released an iOS version that functions as a third-party keyboard, allowing voice input in any app. == Technology == Wispr Flow is based on automatic speech recognition (ASR) and other AI models. The system adapts to individual users over time, learning their vocabulary and preferred style with the aim of reducing manual editing. Flow operates through configurable “Flow Sessions”, defined as time windows during which the app has access to the microphone; users can set session timeouts or disable automatic time limits. == Users and Adoption == Wispr initially targeted users such as venture capitalists, entrepreneurs and executives who process large volumes of text and often work in private or flexible environments. The user base later expanded via platforms such as Product Hunt to students, software developers, writers, lawyers and consultants. Flow has also been adopted by users with conditions such as ADHD, dyslexia, paralysis and carpal tunnel syndrome. About 40% of users are in the United States, 30% in Europe and the remaining 30% in other regions. More than 30% of users come from non-technical backgrounds. Flow supports 104 languages, with approximately 40% of dictations in English and 60% in other languages, including Spanish, French, German, Dutch, Hindi and Mandarin. Wispr has reported monthly user growth above 50%, a six-month active-user retention rate of about 80%, a payment rate around 19%, and revenue of approximately US$3.8 million between July 2024 and July 2025. == Development == Wispr has announced plans for an Android application and maintains waiting lists for Android, Linux and web versions of Flow. The company is developing shared-context features for teams so that the software can recognize common terminology within organizations and has stated that it aims to evolve Flow into a broader AI assistant for tasks such as messaging, note-taking and reminders. Wispr has also reported working with unnamed AI hardware partners on interaction layers for future devices. == Funding == In 2025 Wispr raised US$30 million in a Series A funding round led by Menlo Ventures, with participation from NEA, 8VC and several individual investors, including Evan Sharp and Henry Ward. Earlier investors include Neo, MVP Ventures and AIX Ventures. In November of that same year, the company raised a US$25 million Series A extension led by Notable Capital, with participation from Flight Fund, bringing its total funding to US$81 million. Wispr competes with other AI-based dictation and voice-input tools, including Aqua, Talktastic, Superwhisper and Betterdication.

    Read more →
  • Conference app

    Conference app

    A conference app, also known as an event app or meeting app, is a mobile app developed to help attendees and meeting planners manage their conference experience. It typically includes conference proceedings and venue information, allowing users to create personalized schedules and engage with other users. A conference app can be a native app or web-based. In recent years, conference apps have gained in popularity as a sustainable solution for event management by reducing paper produced by printed materials. Advanced features often include real-time notifications for updates or changes, integration with virtual meeting platforms for hybrid or fully online events, and analytics tools for organizers to measure attendance and engagement. Additionally, some apps support sponsorship and exhibitor features, enabling businesses to showcase their products or services directly within the app.

    Read more →