AI Chatbot Questionnaire

AI Chatbot Questionnaire — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Friendica

    Friendica

    Friendica (formerly Friendika, originally Mistpark) is a free and open-source software distributed social network. It forms one part of the Fediverse, an interconnected and decentralized network of independently operated servers. == Features == Friendica users can connect with others via their own Friendica server, but may also fully integrate contacts from other platforms including Diaspora, Pump.io, GNU social, email, Discourse and more recently ActivityPub (including Mastodon, Pleroma and Pixelfed) and Bluesky into their 'newsfeed'. In addition to these two way connections, users can also use Friendica as a publishing platform to post content to WordPress, Tumblr, Insanejournal and Libertree. Posting to Google+ was also supported until that service was shut down. In addition, RSS feeds can be ingested. Because users are distributed across many servers, their "addresses" consist of a username, the "@" symbol, and the domain name of the Friendica instance in the same manner email addresses are formed. Twitter support was available but was deprecated due to API changes under Elon Musk's leadership rendering it unusable. Most of the functionality from major microblogging and social networking platforms are available in Friendica; for example, tagging users and groups via "@ mentions"; direct messages; hashtags; photo albums; "likes"; "dislikes"; comments; and re-shares of publicly visible posts. Published items can be edited and updated across the network. Comprehensive settings for privacy and the public visibility of posts allow users to regulate who can read which contributions, or see specific information about the user. Users can also create multiple profiles, allowing different groups of people (such as friends, or work mates) to see a different profile entirely when viewing the same page. User accounts can be downloaded or deleted, and can be imported to a different Friendica server if so required. Public forums can be created under different accounts, which can be switched between if the accounts are registered with the same email address. == Development == There is no corporation behind Friendica. The developers work on a voluntary basis and the project is run informally; the platform itself is used for the communication between the developers. There are different forums within Friendica, such as "Friendica Developers" and "Friendica Support". The source code of Friendica is hosted on GitHub. == Installation == The developers aim to make installation of the software as simple as possible for technical laymen. They argue that decentralization on small servers is a key condition for the freedom of users and their self-determination. The difficulty level is similar to an installation of WordPress. However, the installing on shared hosting is sometimes difficult because of missing PHP5 modules. Some volunteers also run public servers so that newcomers can also avoid the installation of their own software. == List of clients == Friendica implements multiple client-server API variants simultaneously. Along with endpoints needed to use enhanced Friendica features, it also implements the API used by GNU social, Twitter and since version 2021.06 also the one used by Mastodon. As a result, most GNU social and Mastodon clients can be used for Friendica. Examples of Friendica compatible clients include: Raccoon for Friendica, Friendiqa, Fedilab, AndStatus, Twidere and DiCa for Android, friendly for Sailfish OS, friclicli (CLI client), choqok and Friendiqa for Linux and Friendica Mobile for Windows 10. == Reception == Friendica was cited in January 2012 by Infoshop News as an "alternative to Google+ and Facebook" to be used on the Occupy Nigeria movement. In January 2012 Free Software Foundation Europe's blog cited Friendica as a reasonable alternative to centralized and controlled social networks such as Facebook or Google+. Biblical Notes writer J. Randal Matheny described Friendica in January 2012 as "One social networking option flying under the radar until recently deserves consideration as an already stable platform with a wide range of options, applications, plug-ins, and possibilities for opening up the Internet." In February 2012, the German computer magazine c't wrote: "Friendica demonstrates how decentralized social networks can become widely accepted." Another German publication, the professional magazine t3n listed Friendica as a Facebook rival in an online article in March 2012 about Facebook alternatives. It compared Friendica with similar social networks like Diaspora and identi.ca. MSN Tech & Gadgets contributor Emma Boyes wrote about Friendica in May 2012: "why you'll love it: you can use it to access all the other social networks and get recommendations of new friends and groups to join. Friendica is open source and decentralised. There's no corporation behind it and there are extensive privacy settings. You can choose from a variety of user interfaces and it boasts some cool features—for instance, being able to key in a list of your interests and use the 'profile match' feature to recommend other users who share them with you. A word of warning, though, the site is not as user-friendly as the others on this list, so it may be this one is one for the geeks." == Later reviews == Acquisition of Twitter by Elon Musk had revitalized public interest in Fediverse technologies in April 2022. Friendica received favorable reviews, with a PCMag article describing it as "mostly comparable to Facebook", drawing a parallel to Google+ and highlighting using it "for planning events, and its multiple profile feature means you can show a different face to your friends, coworkers, and family". The September 2022 issue of Linux Magazine contains a detailed comparison and walk-through of registering to and using basic functions of Diaspora, Friendica and Mastodon. They describe Friendica as "intuitive" and highlight the "huge choice of account settings" and that "Friendica does not require any specific hardware, so you can use an old computer system as a server." == Vulnerabilities == In September 2020, a hotfix was released to patch a security vulnerability that could leak sensitive information from the server environment since versions released in April 2019 (develop branch) and June 2019 (stable).

    Read more →
  • Direct voice input

    Direct voice input

    Direct voice input (DVI), sometimes called voice input control (VIC), is a style of human–machine interaction "HMI" in which the user makes voice commands to issue instructions to the machine through speech recognition. In the field of military aviation, DVI has been introduced into the cockpits of several modern military aircraft, such as the Eurofighter Typhoon, the Lockheed Martin F-35 Lightning II, the Dassault Rafale, the KF-21 Boramae and the Saab JAS 39 Gripen. Such systems have also been used for various other purposes, including industry control systems and speech recognition assistance for impaired individuals. == Overview == DVI systems can be divided into two major categories of functionality: "user-dependent" or "user-independent". A user-dependent system requires that a personal voice template to be generated for a specific person; the template for this individual has to be loaded onto their assigned machine prior to use of the DVI system for it to function properly. In contrast, a user-independent system does not require any personal voice template, being intended to respond correctly to the voice of any user. They can also be categorised between "discrete recognition" and "continuous recognition". Users of a discrete recognition system must pause between each word so that the DVI system can identify the separations between each word, while a continuous speech recognition system is capable of understanding a normal rate of speech. During the mid-2000s, researchers at the National Aerospace Laboratory in the Netherlands examined the use of DVI in the "GRACE" simulator; a total of twelve pilots participated in the ensuing experiment. The tests performed reportedly revealed that, while the hardware itself functioned well, several improvements were desirable prior to real-world deployment on aircraft since DVI operations actually consumed more time in comparison to traditional existing methods. Recommendations for improvements included the adoption of simpler syntax, the achievement of a greater recognition rate, and a decrease in response times; all of the issues encountered were determined to be of a technological nature, and were deemed feasible to resolve. The researchers concluded that in cockpits, especially during emergencies where pilots have to operate entirely on their own, a DVI system could be highly relevant, but that it was not of crucial importance during most other conceivable scenarios. Around the same time, evaluations of DVI systems for civil aviation purposes were conducted within the framework of Project SafeSound, coordinated by the European Union. It involved the observation of pilot workloads in real-world cockpits and contrasting them against pilot activity in flight simulators using both conventional systems and DVI assistance. The project aimed to enhance aviation safety and to decrease the workload in both ground and flight operations via the application of enhanced audio functions. == Applications == === Aviation === Prior to its widespread deployment, a handful of conventional military aircraft were converted to trial DVI systems; examples include the Harrier AV-8B and F-16 VISTA. In another case, a General Dynamics F-16 Fighting Falcon simulator was modified with DVI for a voice control study that was undertaken by the Royal Netherlands Air Force. DVI trials have also been conducted on helicopters, including the Boeing AH-64 Apache, showing the potential to improve flight safety and mission effectiveness. Numerous modern fighter aircraft have been outfitted with DVI systems, often in combination with various other man-machine interface schemes, such as HOTAS-compliant controls and other advanced control technologies. The combination of Voice and HOTAS control schemes has sometimes been referred to as the "V-TAS" concept. A prominent fighter aircraft to be furnished with a V-TAS cockpit is the Eurofighter Typhoon. The Lockheed Martin F-35 Lightning II also features a DVI system, which was developed by Adacel. Other examples includes the Dassault Rafale and the Saab JAS 39 Gripen. Numerous aircraft have been planned to use DVI. At one stage, the United States Air Force had sought to integrate DVI upon the Lockheed Martin F-22 Raptor; however, the technology was eventually judged to pose too many technical risks at that point in time, and thus such efforts were abandoned. === Personal === By 1990, working prototypes of speech recognition systems were being demonstrated; these were being promoted for the purpose of providing an effective man-machine interface for individuals with impaired speech. Techniques employed included time-encoded digital speech and automatic token set selection. Investigations of these early DVI systems reportedly included the use of automatic diagnostic routines and limited-scale trials using volunteers. During the 2010s, various companies were offering voice recognition systems to the general public in the form of personal digital assistants. One example is the Google Voice service, which allows users to pose questions via a DVI package installed on either a personal computer, tablet, or mobile phone. Numerous digital assistants have been developed, such as Amazon Echo, Siri, and Cortana, that use DVI to interact with users. === Commercial === DVI technology has enabled automated telephone systems to be widely deployed. Many companies commonly use centralised phone systems that route callers to the correct department via such methods. Various car manufacturers have also furnished their road vehicles with DVI systems; these typically allow drivers to control infotainment systems and interact with mobile phones with more convenience than legacy methods. During the late 1980s, investigations into the use of DVI systems for controlling CNC machines and other manufacturing apparatus were underway. During the 2010s, such systems were being used for logistics and warehouse management purposes.

    Read more →
  • Digital Darkroom

    Digital Darkroom

    Digital Darkroom was a graphics program for editing gray-scale photos, published by Silicon Beach Software for the Macintosh in 1987. It was programmed by Ed Bomke and Don Cone. Digital Darkroom was the first Macintosh program to incorporate a plug-in architecture. Silicon Beach and Ed Bomke are credited with having coined the term "plug-in". Another innovation of Digital Darkroom was the Magic Wand tool, which also appeared later in Photoshop. When Silicon Beach Software was acquired by Aldus Corporation, Digital Darkroom continued to be published by the Aldus Consumer Division, but was never updated to include color. The trademark "Digital Darkroom" was acquired by MicroFrontier in 1997 and used for a completely new image-editing program that does work with color. The software was acquired by Digimage Arts in 2002 and was sold for both Windows and Mac systems.

    Read more →
  • GamePigeon

    GamePigeon

    GamePigeon is a mobile app for iOS devices, developed by Vitalii Zlotskii and released on September 13, 2016. The game takes advantage of the iOS 10 update, which expanded how users could interact with Apple's Messages app. GamePigeon is only available through the Messages app, which allows players to start and respond to different party games in conversations. == Release == The app was first released on September 13, 2016, coinciding with the launch of iOS 10. The app was released for free, although it includes in-app purchases to unlock additional items, such as cosmetic skins, avatar items, new game modes, and an option to remove ads. == Games in the app == The following is a list of games that users can play within GamePigeon: Sources: Poker was one of the games included in GamePigeon at launch, although it has since been removed and is no longer listed on the game's App Store description. == Reception == GamePigeon has enjoyed commercial success, with VentureBeat noting that GamePigeon was ranked number-one in the "Top Free" category of the iMessage App Store, six months after its release. Critically, GamePigeon has been generally well received, being highlighted by online media publications early on shortly after the iOS 10 launch. It has since been included on many "best iMessage apps" lists. Based on over 162,000 ratings, the game holds a 4.0 out of 5 rating on the App Store. Julian Chokkattu of Digital Trends wrote "GamePigeon should be like the pre-installed versions of Solitaire and Minesweeper that used to come with older iterations of Windows." On its launch day, Boy Genius Report included it on a list of "10 of the best iMessage apps, games and stickers for iOS 10 on launch day." The Daily Dot wrote, "GamePigeon is easily the best current gaming option within iMessages." 8-ball and cup pong have been particularly well received by media outlets. The Daily Dot had specific praise for the app's billiards game: "8-Ball controls shockingly smoothly with your fingers, and there’s nothing quite like destroying a dear friend in poker." During his 2020 U.S. presidential campaign, Cory Booker was cited as playing the game with his family. In 2017, CNBC cited one teenager who expressed that GamePigeon was one of just a few reasons that those in her age range use the iMessage app. The game has received particular positive reception for allowing introverted individuals to exercise a form social activity; similarly, the game was highlighted as a way to maintain social distancing guidelines during the COVID-19 pandemic. As an April Fools' Day joke in 2020, The Chronicle, a Duke University newspaper, published that Duke's athletic program adopted GamePigeon's Cup Pong as an official varsity sport.

    Read more →
  • 17LIVE

    17LIVE

    17LIVE is an international entertainment platform. As of 2024, 17LIVE is the #3 live broadcasting platform globally, formed by its flagship live stream app 17LIVE (LIVIT in English markets), MEME Live and live stream e-commerce platforms HandsUP and OrderPally. == History == 17LIVE was first founded in Taiwan in 2015 by Jeffery Huang. The company has maintained its leading position since its entry into the Japan market in 2017, becoming the biggest platform for live entertainment in Japan, Taiwan, Hong Kong, and other countries. In 2017, 17 closed out US$33M in series B round to merge with dating software Paktor, with Joseph Phua (Co-founder of Paktor) taking over the leadership of 17LIVE as CEO and Co-founder, as well as to enter the Japan and Hong Kong market. Within one year, 17 Media became the #1 market leader in Japan. In 2018, the company raised $25M in series C round as it got ready for US IPO, which failed to materialize. 17LIVE had an unsuccessful US IPO attempt in 2018. Since then, the company reformed and transformed the business. Some key initiatives include the hiring of current CEO Hirofumi Ono, spin-off of Paktor (dating software business unit), full buy-out of founder Jeffery Huang, acquisition of MEME and HandsUp, and more. Despite the failed IPO attempt, the company continued to push for international expansion, including creating ‘LIVIT’ for the English-speaking markets to enter US, India, and North Africa. In 2019, 17's flagship live streaming app reached 10M downloads in Japan, and the business continues to push for both organic and inorganic expansion. Some key M&A highlights in the year include the acquisition of MEME Live in Southeast Asia, as well as HandsUp, a live e-commerce platform. In 2020, M17 closed out $26.5M in Series D round to continue organic growth in Japan, US and Middle East. In the same year, the company also sold its dating app business, Parktor, to rationalise M17 into a live-stream pure play business, followed by the appointment of its current Chairman, Joseph Phua, and previous Global CEO, Hirofumi Ono. With the buy-out and departure of founder Jeff Huang, the parent holding company M17 Entertainment Limited was officially renamed as 17 LIVE Group. An estimated 60 million users registered in 154 countries and territories in April 2022. In 2022, September, 17LIVE announced Group CEO Hirofumi Ono steps down. Alex Lien takes over the leadership as new Group COO; Jing Shen Ng appointed Group CTO. In 2023, March, 17LIVE announced Alex Lien promoted to Global CEO. Kenta Masuda appointed as Global CFO. === Collaboration with Ayumi Hamasaki === To celebrate its 4th anniversary, 17LIVE collaborated with Japanese singer-songwriter Ayumi Hamasaki, who led the 17LIVE 4th Anniversary meets Ayumi Hamasaki series starting October 18, 2021. Along with composer and arranger Yuta Nakano, Hamasaki judged auditioning artists competing for the chance to work with her and her production team for a debut single. The series was streamed live on the 17LIVE website, the final airing on November 11. The eventual winner was named as Yoshitaka_song. When asked why she collaborated with 17LIVE as a producer, Hamasaki commented: "Although the world has become like this (during COVID-19), I believe that the art of entertainment can give people dreams, hope, courage, and strength. I hope that kind of light will continue to shine through the entertainment industry." == Features == On 17LIVE, artists (LIVERs) are able to broadcast live, and post photos and videos from their album. The app has been designed for LIVERs to simply open the App, and start sharing contents without the need to edit or professionally curate their videos. The platform cultivates LIVERs, supports them with a local content management team, and provides artists with various functions, such as real time chatting, gifting, fan clubs, interactive competition and events. Today, 17LIVE has 46 thousands contracted artists and more than 2.3 million MAU, who spend 44 minutes on the platform every day. 17LIVE continues to advocate content-driven philosophy and delivers diverse topics, from politics and music to entertainment, to broaden its audience groups. 17LIVE also hosts offline flash events and concerts to attract new users and support LIVERs better connect with their fans. == Operation == 17LIVE has over 700 employees globally. The app provides few monetization models for LIVERs on the platform, including: Gifting: user / fans buy virtual gifts on the app to send to their favored LIVERs. Subscription: monthly subscription fan club service for access to exclusive content Pay-per-view: ticket service for online streaming concerts E-commerce: live e-commerce platform In the past, 17LIVE has encountered some regulatory headwinds with reported incidents of inappropriate livestream content on the platform. The incidents were direct results of the lack of oversight and supervision capability in place in the business at the time. Over the years, 17LIVE claims to have put in tremendous manpower and effort into improving, monitoring and maintaining control over both the live stream content and the KYC procedures and systems.

    Read more →
  • Transcription software

    Transcription software

    Transcription software assists in the conversion of human speech into a text transcript. Audio or video files can be transcribed manually or automatically. Transcriptionists can replay a recording several times in a transcription editor and type what they hear. By using transcription hot keys, the manual transcription can be accelerated, the sound filtered, equalized or have the tempo adjusted when the clarity is not great. With speech recognition technology, transcriptionists can automatically convert recordings to text transcripts by opening recordings in a PC and uploading them to a cloud for automatic transcription, or transcribe recordings in real-time by using digital dictation. Depending on quality of recordings, machine generated transcripts may still need to be manually verified. The accuracy rate of the automatic transcription depends on several factors such as background noises, speakers' distance to the microphone, and accents. Transcription software, as with transcription services, is often used for business, legal, or medical purposes. Compared with audio content, a text transcript is searchable, takes up less computer memory, and can be used as an alternate method of communication, such as for subtitles and closed captions. Some clinical environments also use digital tools to support transcription workflows, including ambient documentation systems that employ Speech recognition to capture portions of clinical encounters and generate draft notes for later review. These tools are typically used alongside conventional transcription methods. The definition of transcription "software", as compared with transcription "service", is that the former is sufficiently automated that a user can run the entire system without engaging outside personnel. New software-as-a-service and cloud computing models use artificial intelligence, machine learning and natural language processing to convert speech to text and continuously learn new phrases and accents. AI transcription can, however, lead to hallucinations and other errors. == Development == Research at Google released a free android app Google Live Transcribe, it runs on Google Cloud. Google Chrome developed and has an available built in English Live Caption. Google Docs, Google Translate, Google Assistant, GBoard Google Text to Speech engine support transcription tool too. OpenAI launched Whisper, an open-source speech recognition deep learning model in September 2022. In 2024, an AI-powered transcription platform, Transkriptor, was launched, enabling the automatic conversion of audio and video recordings into text using speech recognition technology, with support for transcription in 100 languages and processing of content uploaded via a web interface as well as mobile and browser extensions. It is part of the Tor.app suite of AI-based language processing tools.

    Read more →
  • Motor theory of speech perception

    Motor theory of speech perception

    The motor theory of speech perception is the hypothesis that people perceive spoken words by identifying the vocal tract gestures with which they are pronounced rather than by identifying the sound patterns that speech generates. It originally claimed that speech perception is done through a specialized module that is innate and human-specific. Though the idea of a module has been qualified in more recent versions of the theory, the idea remains that the role of the speech motor system is not only to produce speech articulations but also to detect them. The hypothesis has gained more interest outside the field of speech perception than inside. This has increased particularly since the discovery of mirror neurons that link the production and perception of motor movements, including those made by the vocal tract. The theory was initially proposed in the Haskins Laboratories in the 1950s by Alvin Liberman and Franklin S. Cooper, and developed further by Donald Shankweiler, Michael Studdert-Kennedy, Ignatius Mattingly, Carol Fowler and Douglas Whalen. == Origins and development == The hypothesis has its origins in research using pattern playback to create reading machines for the blind that would substitute sounds for orthographic letters. This led to a close examination of how spoken sounds correspond to the acoustic spectrogram of them as a sequence of auditory sounds. This found that successive consonants and vowels overlap in time with one another (a phenomenon known as coarticulation). This suggested that speech is not heard like an acoustic "alphabet" or "cipher," but as a "code" of overlapping speech gestures. === Associationist approach === Initially, the theory was associationist: infants mimic the speech they hear and that this leads to behavioristic associations between articulation and its sensory consequences. Later, this overt mimicry would be short-circuited and become speech perception. This aspect of the theory was dropped, however, with the discovery that prelinguistic infants could already detect most of the phonetic contrasts used to separate different speech sounds. === Cognitivist approach === The behavioristic approach was replaced by a cognitivist one in which there was a speech module. The module detected speech in terms of hidden distal objects rather than at the proximal or immediate level of their input. The evidence for this was the research finding that speech processing was special such as duplex perception. === Changing distal objects === Initially, speech perception was assumed to link to speech objects that were both the invariant movements of speech articulators the invariant motor commands sent to muscles to move the vocal tract articulators This was later revised to include the phonetic gestures rather than motor commands, and then the gestures intended by the speaker at a prevocal, linguistic level, rather than actual movements. === Modern revision === The "speech is special" claim has been dropped, as it was found that speech perception could occur for nonspeech sounds (for example, slamming doors for duplex perception). === Mirror neurons === The discovery of mirror neurons has led to renewed interest in the motor theory of speech perception, and the theory still has its advocates, although there are also critics. == Support == === Nonauditory gesture information === If speech is identified in terms of how it is physically made, then nonauditory information should be incorporated into speech percepts even if it is still subjectively heard as "sounds". This is, in fact, the case. The McGurk effect shows that seeing the production of a spoken syllable that differs from an auditory cue synchronized with it affects the perception of the auditory one. In other words, if someone hears "ba" but sees a video of someone pronouncing "ga", what they hear is different—some people believe they hear "da". People find it easier to hear speech in noise if they can see the speaker. People can hear syllables better when their production can be felt haptically. === Categorical perception === Using a speech synthesizer, speech sounds can be varied in place of articulation along a continuum from /bɑ/ to /dɑ/ to /ɡɑ/, or in voice onset time on a continuum from /dɑ/ to /tɑ/ (for example). When listeners are asked to discriminate between two different sounds, they perceive sounds as belonging to discrete categories, even though the sounds vary continuously. In other words, 10 sounds (with the sound on one extreme being /dɑ/ and the sound on the other extreme being /tɑ/, and the ones in the middle varying on a scale) may all be acoustically different from one another, but the listener will hear all of them as either /dɑ/ or /tɑ/. Likewise, the English consonant /d/ may vary in its acoustic details across different phonetic contexts (the /d/ in /du/ does not technically sound the same as the one in /di/, for example), but all /d/'s as perceived by a listener fall within one category (voiced alveolar plosive) and that is because "linguistic representations are abstract, canonical, phonetic segments or the gestures that underlie these segments." This suggests that humans identify speech using categorical perception, and thus that a specialized module, such as that proposed by the motor theory of speech perception, may be on the right track. === Speech imitation === If people can hear the gestures in speech, then the imitation of speech should be very fast, as in when words are repeated that are heard in headphones as in speech shadowing. People can repeat heard syllables more quickly than they would be able to produce them normally. === Speech production === Hearing speech activates vocal tract muscles, and the motor cortex and premotor cortex. The integration of auditory and visual input in speech perception also involves such areas. Disrupting the premotor cortex disrupts the perception of speech units such as plosives. The activation of the motor areas occurs in terms of the phonemic features which link with the vocal track articulators that create speech gestures. The perception of a speech sound is aided by pre-emptively stimulating the motor representation of the articulators responsible for its pronunciation . Auditory and motor cortical coupling is restricted to a specific range of neuronal firing frequency. === Perception-action meshing === Evidence exists that perception and production are generally coupled in the motor system. This is supported by the existence of mirror neurons that are activated both by seeing (or hearing) an action and when that action is carried out. Another source of evidence is that for common coding theory between the representations used for perception and action. == Criticisms == The motor theory of speech perception is not widely held in the field of speech perception, though it is more popular in other fields, such as theoretical linguistics. As three of its advocates have noted, "it has few proponents within the field of speech perception, and many authors cite it primarily to offer critical commentary".p. 361 Several critiques of it exist. === Multiple sources === Speech perception is affected by nonproduction sources of information, such as context. Individual words are hard to understand in isolation but easy when heard in sentence context. It therefore seems that speech perception uses multiple sources that are integrated together in an optimal way. === Production === The motor theory of speech perception would predict that speech motor abilities in infants predict their speech perception abilities, but in actuality it is the other way around. It would also predict that defects in speech production would impair speech perception, but they do not. However, this only affects the first and already superseded behaviorist version of the theory, where infants were supposed to learn all production-perception patterns by imitation early in childhood. This is no longer the mainstream view of motor-speech theorists. === Speech module === Several sources of evidence for a specialized speech module have failed to be supported. Duplex perception can be observed with door slams. The McGurk effect can also be achieved with nonlinguistic stimuli, such as showing someone a video of a basketball bouncing but playing the sound of a ping-pong ball bouncing. As for categorical perception, listeners can be sensitive to acoustic differences within single phonetic categories. As a result, this part of the theory has been dropped by some researchers. === Sublexical tasks === The evidence provided for the motor theory of speech perception is limited to tasks such as syllable discrimination that use speech units not full spoken words or spoken sentences. As a result, "speech perception is sometimes interpreted as referring to the perception of speech at the sublexical level. However, the ultimate goal of these studies is presumably to understand the neural processes supporting the ability to process spee

    Read more →
  • Vegas Pro

    Vegas Pro

    Vegas Pro (formerly known as Sony Vegas) is a professional video editing software package for non-linear editing (NLE), designed to run on the Microsoft Windows operating system. The first release of Vegas Beta was on June 11, 1999. Vegas was originally developed as a non-linear audio editing application. Version 2.0 would split the program into audio and video editing variants, with the former being dropped by version 4.0, making the video offering the only variant available to consumers. Vegas Pro features real-time multi-track video and audio editing on unlimited tracks, resolution-independent video sequencing, complex effects, compositing tools, 24-bit/192 kHz audio support, VST and DirectX plug-in effect support, and Dolby Digital surround sound mixing. The software was originally published by Sonic Foundry until May 2003, when Sony purchased Sonic Foundry and formed Sony Creative Software. On May 24, 2016, Sony announced that Vegas was sold to MAGIX, which formed VEGAS Creative Software, to continue support and development of the software. As of the end of March 2026, it was publicly announced that Boris FX had taken ownership of Vegas Pro. Each release of Vegas is sold standalone; however, upgrade discounts are sometimes provided. == Features == Vegas does not require any specialized hardware to run properly, allowing it to operate on any Windows computer that meets the system requirements. == History == Vegas 1.0 was released after a brief public beta by Sonic Foundry on July 23, 1999 at the NAMM Show in Nashville, Tennessee as an audio-only tool with a particular focus on re-scaling and resampling audio. It supported formats like DivX and Real Networks RealSystem G2 file formats. Martin Walker from Sound on Sound described working in Vegas 1.0 as a "very pleasurable experience, especially since so many functions are highly intuitive" though also criticizing some features as hard to figure out due to the lack of a central help file. Later, on June 12, 2000, Vegas Video and Audio 2.0 (also referred to as just Vegas 2.0) was released, with its beta releasing earlier that year on April 10. This was the first version of Vegas to include video-editing tools and was also the first to have a low-cost "LE" version alongside the regular release. The LE releases would continue through version 3.0 of Vegas but would be discontinued by the release of Vegas 4.0. Vegas 3.0 was released the next year on December 3, and added new video effects, features for ease-of-use with DV, and support for editing Windows Media files. Vegas 4.0 was released on 6 February 2003 and added application scripting, advanced color correction, 5.1 surround sound mixing, and Steinberg ASIO support. This was the last release under the Sonic Foundry name after it sold much of its software suite, including Sound Forge and Acid Pro, to Sony Pictures Digital for $18 million later in 2003. Under Sony's ownership, Vegas 5.0 was released on April 19, 2004, bringing 3D track motion, compositing, reversing, envelope automation, etc. 7.0 also added an improved video preview, enhanced layout management, improved snapping, and more customization. With the release of 8.0, Sony opted to go back to the original "Vegas Pro" branding that the first version released with. It added the ability to burn Blu-ray and DVD optical media, support for 32-bit floating point audio, support for tempo-based audio effects, and more. It also moved the timeline to the bottom of the window by default with the option of moving it back to the top if the user wished to. Sony was also experimenting with 64-bit at this time and ported Vegas Pro 8.0 to 64-bit systems under the name "Vegas Pro 8.1". Vegas Pro 9.0 added support for 4K resolution and pro camcorder formats like Red and XDCAM EX. In 2009, Sony Creative Software purchased the Velvetmatter Radiance suite of video FX plug-ins which were included in Sony Vegas Pro 9.0. As a result, they were no longer available as a separate product from Velvetmatter. Vegas Pro 10 was released in 2010 with stereoscopic 3D editing, image stabilization, OpenFX plugin support, real-time audio event effects, and a few UI changes. This was the last release to include support for Windows XP. Vegas Pro 11 was released the next year on 17 October, with GPGPU video acceleration, enhanced text tools, enhanced stereoscopic/3D features, RAW photo support, and new event synchronization mechanisms. In addition, Vegas Pro 11 comes pre-loaded with "NewBlue" Titler Pro, a 2D and 3D titling plug-in. Vegas Pro 12 would add two new configurations: Vegas Pro 12 Edit, for "Professional Video and Audio Production"; and Vegas Pro 12 Suite, for "Professional Editing, Disc Authoring, and Visual Effects Design". Vegas Pro 13 would be the last version released with Sony branding after the acquisition of much of Sony Creative Software's library by Magix. After they acquired Vegas, Magix released version 14 on September 20, 2016. It featured advanced 4K upscaling as well as many bug fixes, a higher video velocity limit, RED camera support, and a variety of other features. This was also the last version to have the light theme enabled by default. Released on August 28, 2017, Vegas Pro 15 features major UI changes that claim to bring usability improvements and customization. It was the first version of VEGAS Pro to have a dark theme; it also allows more efficient editing speeds, including adding new shortcuts to speed the video editing process. Vegas Pro 15 includes support for Intel Quick Sync Video (QSV) and other technologies, as well as various other features. It introduced a new VEGAS Pro icon as a V. Vegas Pro 16 has some new features including file backup, motion tracking, improved video stabilization, 360° editing and HDR support. Magix has continued to improve Vegas through version 21 with support for reading Matroska files, a more detailed render dialogue, live streaming, VST3 support, a VST 32-bit bridge, and a selective Paste Event Attributes menu. Magix would later release a subscription model for using Vegas named "Vegas Pro 365" on January 17, 2018, although the perpetual licence is still an option for customers. This version includes cloud-based speech synthesis among other features not included in the mainline Vegas release. == Version history == Each release of Vegas is sold standalone, however upgrade discounts are sometimes provided. === Vegas Beta === Sonic Foundry introduced a sneak preview version of Vegas Pro on June 11, 1999. It is called a "Multitrack Media Editing System". === Vegas 1.0 === Released on July 23, 1999 at the NAMM Show in Nashville, Tennessee, Vegas was an audio-only tool with a particular focus on rescaling and resampling audio. It supported formats like DivX and Real Networks RealSystem G2 file formats. Version 1.0 is the final Vegas release to include Windows 95 support. === Vegas Video beta (Vegas 2.0 beta) === Released on April 10, 2000, this was the first version of Vegas to include video-editing tools. === Vegas Video (Vegas 2.0) === Released on June 12, 2000. Version 2.0 is the final Vegas Video release to include Windows NT 4.0 support. === Vegas Video 3.0 === Released on December 3, 2001. This release added: New Video Effects – Lens Flare, Light Rays, Film FX, Color Curves, Mirror, Remap, Deform, Convolution, Linear Blur, Black Restore, Levels, Unsharp Mask, Color Grading, and Timecode Burn filter. Batch Capture with Automatic Scene Detection – Captures DV with automatic scene detection, batch capture, tape logging, still image capture and thumbnail previews. Red Book Audio CD Mastering with CD Architect (TM) Technology – Used for burning Red Book audio CD masters directly from the Vegas timeline with ISRC, UPC, and PQ list support. New Sonic Foundry DV Codec – Introduces a DV codec developed by Sonic Foundry that offers artifact-free compositing and DV chromakeying. DV Print-to-Tape from the Timeline – Prints projects to DV cameras and decks from the Vegas timeline. Windows Media (TM) File Editing – Creates and edits Windows Media (TM) files. New MPEG Encoding Tools – Used for producing MPEG-2 files for DVD productions. Dynamic RAM Previewing – Temporary RAM/render-free previews for analysis and tweaking of complex video FX without rendering. VideoCD and Data CD Burning – Burning projects directly to VideoCD for playback on most DVD players or data CDs for playback computers' CD-ROMs. === Vegas 4.0 === Released on February 6, 2003. This release added: Advanced Color Correction Tools Searchable Media Pool Bins Vectorscope, Histogram, Parade and Waveform Monitoring Application Scripting Improved Ripple Editing Motion Blur and Super-Sampling Envelopes 5.1 Surround Mixing Dolby® Digital AC-3 Encoding certified and tested by Dolby Laboratories DirectX® Audio Plug-In Effects Automation ASIO Driver Support Windows Media™ 9 Support, including Surround Encoding DVD Authoring with AC-3 File Import Capabilities Integration with DVD Architect via Chap

    Read more →
  • Pixel-art scaling algorithms

    Pixel-art scaling algorithms

    Pixel art scaling algorithms are graphical filters that attempt to enhance the appearance of hand-drawn 2D pixel art graphics. These algorithms are a form of automatic image enhancement. Pixel art scaling algorithms employ methods significantly different than the common methods of image rescaling, which have the goal of preserving the appearance of images. As pixel art graphics are commonly used at very low resolutions, they employ careful coloring of individual pixels. This results in graphics that rely on a high amount of stylized visual cues to define complex shapes. Several specialized algorithms have been developed to handle re-scaling of such graphics. These specialized algorithms can improve the appearance of pixel-art graphics, but in doing so they introduce changes. Such changes may be undesirable, especially if the goal is to faithfully reproduce the original appearance. Since a typical application of this technology is improving the appearance of fourth-generation and earlier video games on arcade and console emulators, many pixel art scaling algorithms are designed to run in real-time for sufficiently small input images at 60-frames per second. This places constraints on the type of programming techniques that can be used for this sort of real-time processing. Many work only on specific scale factors. 2× is the most common scale factor, while 3×, 4×, 5×, and 6× exist but are less used. == Algorithms == === SAA5050 'Diagonal Smoothing' === The Mullard SAA5050 Teletext character generator chip (1980) used a primitive pixel scaling algorithm to generate higher-resolution characters on the screen from a lower-resolution representation from its internal ROM. Internally, each character shape was defined on a 5 × 9 pixel grid, which was then interpolated by smoothing diagonals to give a 10 × 18 pixel character, with a characteristically angular shape, surrounded to the top and the left by two pixels of blank space. The algorithm only works on monochrome source data, and assumes the source pixels will be logically true or false depending on whether they are 'on' or 'off'. Pixels 'outside the grid pattern' are assumed to be off. The algorithm works as follows: A B C --\ 1 2 D E F --/ 3 4 1 = B | (A & E & !B & !D) 2 = B | (C & E & !B & !F) 3 = E | (!A & !E & B & D) 4 = E | (!C & !E & B & F) Note that this algorithm, like the Eagle algorithm below, has a flaw: If a pattern of 4 pixels in a hollow diamond shape appears, the hollow will be obliterated by the expansion. The SAA5050's internal character ROM carefully avoids ever using this pattern. The degenerate case: becomes: === EPX/Scale2×/AdvMAME2× === Eric's Pixel Expansion (EPX) is an algorithm developed by Eric Johnston at LucasArts around 1992, when porting the SCUMM engine games from the IBM PC (which ran at 320 × 200 × 256 colors) to the early color Macintosh computers, which ran at more or less double that resolution. The algorithm works as follows, expanding P into 4 new pixels based on P's surroundings: 1=P; 2=P; 3=P; 4=P; IF C==A => 1=A IF A==B => 2=B IF D==C => 3=C IF B==D => 4=D IF of A, B, C, D, three or more are identical: 1=2=3=4=P Later implementations of this same algorithm (as AdvMAME2× and Scale2×, developed around 2001) are slightly more efficient but functionally identical: 1=P; 2=P; 3=P; 4=P; IF C==A AND C!=D AND A!=B => 1=A IF A==B AND A!=C AND B!=D => 2=B IF D==C AND D!=B AND C!=A => 3=C IF B==D AND B!=A AND D!=C => 4=D AdvMAME2× is available in DOSBox via the scaler=advmame2x dosbox.conf option. The AdvMAME4×/Scale4× algorithm is just EPX applied twice to get 4× resolution. ==== Scale3×/AdvMAME3× and ScaleFX ==== The AdvMAME3×/Scale3× algorithm (available in DOSBox via the scaler=advmame3x dosbox.conf option) can be thought of as a generalization of EPX to the 3× case. The corner pixels are calculated identically to EPX. 1=E; 2=E; 3=E; 4=E; 5=E; 6=E; 7=E; 8=E; 9=E; IF D==B AND D!=H AND B!=F => 1=D IF (D==B AND D!=H AND B!=F AND E!=C) OR (B==F AND B!=D AND F!=H AND E!=A) => 2=B IF B==F AND B!=D AND F!=H => 3=F IF (H==D AND H!=F AND D!=B AND E!=A) OR (D==B AND D!=H AND B!=F AND E!=G) => 4=D 5=E IF (B==F AND B!=D AND F!=H AND E!=I) OR (F==H AND F!=B AND H!=D AND E!=C) => 6=F IF H==D AND H!=F AND D!=B => 7=D IF (F==H AND F!=B AND H!=D AND E!=G) OR (H==D AND H!=F AND D!=B AND E!=I) => 8=H IF F==H AND F!=B AND H!=D => 9=F There is also a variant improved over Scale3× called ScaleFX, developed by Sp00kyFox, and a version combined with Reverse-AA called ScaleFX-Hybrid. === Eagle === Eagle works as follows: for every in pixel, we will generate 4 out pixels. First, set all 4 to the color of the pixel we are currently scaling (as nearest-neighbor). Next look at the three pixels above, to the left, and diagonally above left: if all three are the same color as each other, set the top left pixel of our output square to that color in preference to the nearest-neighbor color. Work similarly for all four pixels, and then move to the next one. Assume an input matrix of 3 × 3 pixels where the centermost pixel is the pixel to be scaled, and an output matrix of 2 × 2 pixels (i.e., the scaled pixel) first: |Then . . . --\ CC |S T U --\ 1 2 . C . --/ CC |V C W --/ 3 4 . . . |X Y Z | IF V==S==T => 1=S | IF T==U==W => 2=U | IF V==X==Y => 3=X | IF W==Z==Y => 4=Z Thus if we have a single black pixel on a white background it will vanish. This is a bug in the Eagle algorithm but is solved by other algorithms such as EPX, 2xSaI, and HQ2x. === 2×SaI === 2×SaI, short for 2× Scale and Interpolation engine, was inspired by Eagle. It was designed by Derek Liauw Kie Fa, also known as Kreed, primarily for use in console and computer emulators, and it has remained fairly popular in this niche. Many of the most popular emulators, including ZSNES and VisualBoyAdvance, offer this scaling algorithm as a feature. Several slightly different versions of the scaling algorithm are available, and these are often referred to as Super 2×SaI and Super Eagle. The 2xSaI family works on a 4 × 4 matrix of pixels where the pixel marked A below is scaled: I E F J G A B K --\ W X H C D L --/ Y Z M N O P For 16-bit pixels, they use pixel masks which change based on whether the 16-bit pixel format is 565 or 555. The constants colorMask, lowPixelMask, qColorMask, qLowPixelMask, redBlueMask, and greenMask are 16-bit masks. The lower 8 bits are identical in either pixel format. Two interpolation functions are described: INTERPOLATE(uint32 A, UINT32 B). -- linear midpoint of A and B if (A == B) return A; return ( ((A & colorMask) >> 1) + ((B & colorMask) >> 1) + (A & B & lowPixelMask) ); Q_INTERPOLATE(uint32 A, uint32 B, uint32 C, uint32 D) -- bilinear interpolation; A, B, C, and D's average x = ((A & qColorMask) >> 2) + ((B & qColorMask) >> 2) + ((C & qColorMask) >> 2) + ((D & qColorMask) >> 2); y = (A & qLowPixelMask) + (B & qLowPixelMask) + (C & qLowPixelMask) + (D & qLowPixelMask); y = (y >> 2) & qLowPixelMask; return x + y; The algorithm checks A, B, C, and D for a diagonal match such that A==D and B!=C, or the other way around, or if they are both diagonals or if there is no diagonal match. Within these, it checks for three or four identical pixels. Based on these conditions, the algorithm decides whether to use one of A, B, C, or D, or an interpolation among only these four, for each output pixel. The 2xSaI arbitrary scaler can enlarge any image to any resolution and uses bilinear filtering to interpolate pixels. Since Kreed released the source code under the GNU General Public License, it is freely available to anyone wishing to utilize it in a project released under that license. Developers wishing to use it in a non-GPL project would be required to rewrite the algorithm without using any of Kreed's existing code. It is available in DOSBox via scaler=2xsai option. === hqnx family === Maxim Stepin's hq2x, hq3x, and hq4x are for scale factors of 2:1, 3:1, and 4:1 respectively. Each work by comparing the color value of each pixel to those of its eight immediate neighbors, marking the neighbors as close or distant, and using a pre-generated lookup table to find the proper proportion of input pixels' values for each of the 4, 9 or 16 corresponding output pixels. The hq3x family will perfectly smooth any diagonal line whose slope is ±0.5, ±1, or ±2 and which is not anti-aliased in the input; one with any other slope will alternate between two slopes in the output. It will also smooth very tight curves. Unlike 2xSaI, it anti-aliases the output. hqnx was initially created for the Super NES emulator ZSNES. The author of bsnes has released a space-efficient implementation of hq2x to the public domain. A port to shaders, which has comparable quality to the early versions of xBR, is available. Before the port, a shader called "scalehq" has often been confused for hqx. === xBR family === There are 6 filters in this family: xBR , xBRZ, xBR-Hybrid, Super xBR, xBR+3D and Super xBR+3D. xBR ("scale by rules"), cre

    Read more →
  • FMLLR

    FMLLR

    In signal processing, Feature space Maximum Likelihood Linear Regression (fMLLR) is a global feature transform that are typically applied in a speaker adaptive way, where fMLLR transforms acoustic features to speaker adapted features by a multiplication operation with a transformation matrix. In some literature, fMLLR is also known as the Constrained Maximum Likelihood Linear Regression (cMLLR). == Overview == fMLLR transformations are trained in a maximum likelihood sense on adaptation data. These transformations may be estimated in many ways, but only maximum likelihood (ML) estimation is considered in fMLLR. The fMLLR transformation is trained on a particular set of adaptation data, such that it maximizes the likelihood of that adaptation data given a current model-set. This technique is a widely used approach for speaker adaptation in HMM-based speech recognition. Later research also shows that fMLLR is an excellent acoustic feature for DNN/HMM hybrid speech recognition models. The advantage of fMLLR includes the following: the adaptation process can be performed within a pre-processing phase, and is independent of the ASR training and decoding process. this type of adapted feature can be applied to deep neural networks (DNN) to replace traditionally used mel-spectrogram in end-to-end speech recognition models. fMLLR's speaker adaptation process leads to a significant performance boost for ASR models, hence outperforming other transform or features like MFCCs (Mel-Frequency Cepstral Coefficients) and FBANKs (Filter bank) coefficients. fMLLR features can be efficiently realized with speech toolkits like Kaldi. Major problem and disadvantage of fMLLR: when the amount of adaptation data is limited, the transformation matrices tends to easily overfit the given data. == Computing fMLLR transform == Feature transform of fMLLR can be easily computed with the open source speech tool Kaldi, the Kaldi script uses the standard estimation scheme described in Appendix B of the original paper, in particular the section Appendix B.1 "Direct method over rows". In the Kaldi formulation, fMLLR is an affine feature transform of the form x {\displaystyle x} → A {\displaystyle A} x {\displaystyle x} + b {\displaystyle +b} , which can be written in the form x {\displaystyle x} →W x ^ {\displaystyle {\hat {x}}} , where x ^ {\displaystyle {\hat {x}}} = [ x 1 ] {\displaystyle {\begin{bmatrix}x\\1\end{bmatrix}}} is the acoustic feature x {\displaystyle x} with a 1 appended. Note that this differs from some of the literature where the 1 comes first as x ^ {\displaystyle {\hat {x}}} = [ 1 x ] {\displaystyle {\begin{bmatrix}1\\x\end{bmatrix}}} . The sufficient statistics stored are: K = ∑ t , j , m γ j , m ( t ) Σ j m − 1 μ j m x ( t ) + {\displaystyle K=\sum _{t,j,m}\gamma _{j,m}(t)\textstyle \Sigma _{jm}^{-1}\mu _{jm}x(t)^{+}\displaystyle } where Σ j m − 1 {\displaystyle \textstyle \Sigma _{jm}^{-1}\displaystyle } is the inverse co-variance matrix. And for 0 ≤ i ≤ D {\displaystyle 0\leq i\leq D} where D {\displaystyle D} is the feature dimension: G ( i ) = ∑ t , j , m γ j , m ( t ) ( 1 σ j , m 2 ( i ) ) x ( t ) + x ( t ) + T {\displaystyle G^{(i)}=\sum _{t,j,m}\gamma _{j,m}(t)\left({\frac {1}{\sigma _{j,m}^{2}(i)}}\right)x(t)^{+}x(t)^{+T}\displaystyle } For a thorough review that explains fMLLR and the commonly used estimation techniques, see the original paper "Maximum likelihood linear transformations for HMM-based speech recognition ". Note that the Kaldi script that performs the feature transforms of fMLLR differs with by using a column of the inverse in place of the cofactor row. In other words, the factor of the determinant is ignored, as it does not affect the transform result and can causes potential danger of numerical underflow or overflow. == Comparing with other features or transforms == Experiment result shows that by using the fMLLR feature in speech recognition, constant improvement is gained over other acoustic features on various commonly used benchmark datasets (TIMIT, LibriSpeech, etc). In particular, fMLLR features outperform MFCCs and FBANKs coefficients, which is mainly due to the speaker adaptation process that fMLLR performs. In, phoneme error rate (PER, %) is reported for the test set of TIMIT with various neural architectures: As expected, fMLLR features outperform MFCCs and FBANKs coefficients despite the use of different model architecture. Where MLP (multi-layer perceptron) serves as a simple baseline, on the other hand RNN, LSTM, and GRU are all well known recurrent models. The Li-GRU architecture is based on a single gate and thus saves 33% of the computations over a standard GRU model, Li-GRU thus effectively address the gradient vanishing problem of recurrent models. As a result, the best performance is obtained with the Li-GRU model on fMLLR features. == Extract fMLLR features with Kaldi == fMLLR can be extracted as reported in the s5 recipe of Kaldi. Kaldi scripts can certainly extract fMLLR features on different dataset, below are the basic example steps to extract fMLLR features from the open source speech corpora Librispeech. Note that the instructions below are for the subsets train-clean-100,train-clean-360,dev-clean, and test-clean, but they can be easily extended to support the other sets dev-other, test-other, and train-other-500. These instruction are based on the codes provided in this GitHub repository, which contains Kaldi recipes on the LibriSpeech corpora to execute the fMLLR feature extraction process, replace the files under $KALDI_ROOT/egs/librispeech/s5/ with the files in the repository. Install Kaldi. Install Kaldiio. If running on a single machine, change the following lines in $KALDI_ROOT/egs/librispeech/s5/cmd.sh to replace queue.pl to run.pl: Change the data path in run.sh to your LibriSpeech data path, the directory LibriSpeech/ should be under that path. For example: Install flac with: sudo apt-get install flac Run the Kaldi recipe run.sh for LibriSpeech at least until Stage 13 (included), for simplicity you can use the modified run.sh. Copy exp/tri4b/trans. files into exp/tri4b/decode_tgsmall_train_clean_/ with the following command: Compute the fMLLR features by running the following script, the script can also be downloaded here: Compute alignments using: Apply CMVN and dump the fMLLR features to new .ark files, the script can also be downloaded here: Use the Python script to convert Kaldi generated .ark features to .npy for your own dataloader, an example Python script is provided:

    Read more →
  • Human–robot collaboration

    Human–robot collaboration

    Human-Robot Collaboration is the study of collaborative processes in human and robot agents work together to achieve shared goals. Many new applications for robots require them to work alongside people as capable members of human-robot teams. These include robots for homes, hospitals, and offices, space exploration and manufacturing. Human-Robot Collaboration (HRC) is an interdisciplinary research area comprising classical robotics, human-computer interaction, artificial intelligence, process design, layout planning, ergonomics, cognitive sciences, and psychology. Industrial applications of human-robot collaboration involve Collaborative Robots, or cobots, that physically interact with humans in a shared workspace to complete tasks such as collaborative manipulation or object handovers. == Collaborative Activity == Collaboration is defined as a special type of coordinated activity, one in which two or more agents work jointly with each other, together performing a task or carrying out the activities needed to satisfy a shared goal. The process typically involves shared plans, shared norms and mutually beneficial interactions. Although collaboration and cooperation are often used interchangeably, collaboration differs from cooperation as it involves a shared goal and joint action where the success of both parties depend on each other. For effective human-robot collaboration, it is imperative that the robot is capable of understanding and interpreting several communication mechanisms similar to the mechanisms involved in human-human interaction. The robot must also communicate its own set of intents and goals to establish and maintain a set of shared beliefs and to coordinate its actions to execute the shared plan. In addition, all team members demonstrate commitment to doing their own part, to the others doing theirs, and to the success of the overall task. == Theories Informing Human-Robot Collaboration == Human-human collaborative activities are studied in depth in order to identify the characteristics that enable humans to successfully work together. These activity models usually aim to understand how people work together in teams, how they form intentions and achieve a joint goal. Theories on collaboration inform human-robot collaboration research to develop efficient and fluent collaborative agents. === Belief Desire Intention Model === The belief-desire-intention (BDI) model is a model of human practical reasoning that was originally developed by Michael Bratman. The approach is used in intelligent agents research to describe and model intelligent agents. The BDI model is characterized by the implementation of an agent's beliefs (the knowledge of the world, state of the world), desires (the objective to accomplish, desired end state) and intentions (the course of actions currently under execution to achieve the desire of the agent) in order to deliberate their decision-making processes. BDI agents are able to deliberate about plans, select plans and execute plans. === Shared Cooperative Activity === Shared Cooperative Activity defines certain prerequisites for an activity to be considered shared and cooperative: mutual responsiveness, commitment to the joint activity and commitment to mutual support. An example case to illustrate these concepts would be a collaborative activity where agents are moving a table out the door, mutual responsiveness ensures that movements of the agents are synchronized; a commitment to the joint activity reassures each team member that the other will not at some point drop his side; and a commitment to mutual support deals with possible breakdowns due to one team member's inability to perform part of the plan. === Joint Intention Theory === Joint Intention Theory proposes that for joint action to emerge, team members must communicate to maintain a set of shared beliefs and to coordinate their actions towards the shared plan. In collaborative work, agents should be able to count on the commitment of other members, therefore each agent should inform the others when they reach the conclusion that a goal is achievable, impossible, or irrelevant. == Approaches to Human-Robot Collaboration == The approaches to human-robot collaboration include human emulation (HE) and human complementary (HC) approaches. Although these approaches have differences, there are research efforts to develop a unified approach stemming from potential convergences such as Collaborative Control. === Human Emulation === The human emulation approach aims to enable computers to act like humans or have human-like abilities in order to collaborate with humans. It focuses on developing formal models of human-human collaboration and applying these models to human-computer collaboration. In this approach, humans are viewed as rational agents who form and execute plans for achieving their goals and infer other people's plans. Agents are required to infer the goals and plans of other agents, and collaborative behavior consists of helping other agents to achieve their goals. === Human Complementary === The human complementary approach seeks to improve human-computer interaction by making the computer a more intelligent partner that complements and collaborates with humans. The premise is that the computer and humans have fundamentally asymmetric abilities. Therefore, researchers invent interaction paradigms that divide responsibility between human users and computer systems by assigning distinct roles that exploit the strengths and overcome the weaknesses of both partners. == Key Aspects == Specialization of Roles: Based on the level of autonomy and intervention, there are several human-robot relationships including master-slave, supervisor–subordinate, partner–partner, teacher–learner and fully autonomous robot. In addition to these roles, homotopy (a weighting function that allows a continuous change between leader and follower behaviors) was introduced as a flexible role distribution. Establishing shared goal(s): Through direct discussion about goals or inference from statements and actions, agents must determine the shared goals they are trying to achieve. Allocation of Responsibility and Coordination: Agents must decide how to achieve their goals, determine what actions will be done by each agent, and how to coordinate the actions of individual agents and integrate their results. Shared context: Agents must be able to track progress toward their goals. They must keep track of what has been achieved and what remains to be done. They must evaluate the effects of actions and determine whether an acceptable solution has been achieved. Communication: Any collaboration requires communication to define goals, negotiate over how to proceed and who will do what, and evaluate progress and results. Adaptation and learning: Collaboration over time require partners to adapt themselves to each other and learn from one's partner both directly or indirectly. Time and space: The time-space taxonomy divides human-robot interaction into four categories based on whether the humans and robots are using computing systems at the same time (synchronous) or different times (asynchronous) and while in the same place (collocated) or in different places (non-collocated). Ergonomics: Human factors and ergonomics are one of the key aspects for a sustainable human-robot collaboration. The robot control system can use biomechanical models and sensors to optimize various ergonomic metrics, such as muscle fatigue.

    Read more →
  • Video renderer

    Video renderer

    A video renderer is software that processes a video file and sends it sequentially to the video display controller card for display on a computer screen. An example of a video renderer, is the VMR-7 that was used by Microsoft's DirectShow. An example of a UNIX video renderer is the one container within GStreamer. Commonly used video renderers are: Enhanced Video Renderer VMR9 Renderless Haali's Video Renderer Madvr Video Renderer JRVR, a part of JRiver Media Center

    Read more →
  • AFNLP

    AFNLP

    AFNLP (Asian Federation of Natural Language Processing Associations) is the organization for coordinating the natural language processing related activities and events in the Asia-Pacific region. == Foundation == AFNLP was founded on 4 October 2000. == Member Associations == ALTA – Australasian Language Technology Association ANLP Japan Association of Natural Language Processing ROCLING Taiwan ROC Computational Linguistics Society SIG-KLC Korea SIG-Korean Language Computing of Korea Information Science Society == Existing Asian Initiatives == NLPRS: Natural Language Processing Pacific Rim Symposium IRAL: International Workshop on Information Retrieval with Asian Languages PACLING: Pacific Association for Computational Linguistics PACLIC: Pacific Asia Conference on Language, Information and Computation PRICAI: Pacific Rim International Conference on AI ICCPOL: International Conference on Computer Processing of Oriental Languages ROCLING: Research on Computational Linguistics Conference == Conferences == IJCNLP-04: The 1st International Joint Conference on Natural Language Processing in Hainan Island, China IJCNLP-05: The 2nd International Joint Conference on Natural Language Processing in Jeju Island, Korea IJCNLP-08: The 3rd International Joint Conference on Natural Language Processing in Hyderabad, India ACL-IJCNLP-2009: Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics (ACL) and 4th International Joint Conference on Natural Language Processing (IJCNLP) in Singapore IJNCLP-11: The 5th International Joint Conference on Natural Language Processing in Chiang Mai, Thailand

    Read more →
  • Multi-focus image fusion

    Multi-focus image fusion

    Multi-focus image fusion is a multiple image compression technique using input images with different focus depths to make one output image that preserves all information. == Overview == The main idea of image fusion is gathering important and the essential information from the input images into one single image which ideally has all of the information of the input images. The research history of image fusion spans over 30 years and many scientific papers. Image fusion generally has two aspects: image fusion methods and objective evaluation metrics. In visual sensor networks (VSN), sensors are cameras which record images and video sequences. In many applications of VSN, a camera can't give a perfect illustration including all details of the scene. This is because of the limited depth of focus of the optical lens of cameras. Therefore, just the object located in the focal length of camera is focused and clear, and other parts of the image are blurred. VSN captures images with different depths of focus using several cameras. Due to the large amount of data generated by cameras compared to other sensors such as pressure and temperature sensors and some limitations of bandwidth, energy consumption and processing time, it is essential to process the local input images to decrease the amount of transmitted data. == Multi-Focus image fusion in the spatial domain == Huang and Jing have reviewed and applied several focus measurements in the spatial domain for the multi-focus image fusion process, suitable for real-time applications. They mentioned some focus measurements including variance, energy of image gradient (EOG), Tenenbaum's algorithm (Tenengrad), energy of Laplacian (EOL), sum-modified-Laplacian (SML), and spatial frequency (SF). Their experiments showed that EOL gave better results than other methods like variance and spatial frequency. == Multi-Focus image fusion in multi-scale transform and DCT domain == Image fusion based on the multi-scale transform is the most commonly used and promising technique. Laplacian pyramid transform, gradient pyramid-based transform, morphological pyramid transform and the premier ones, discrete wavelet transform, shift-invariant wavelet transform (SIDWT), and discrete cosine harmonic wavelet transform (DCHWT) are some examples of image fusion methods based on multi-scale transform. These methods are complex and have some limitations e.g. processing time and energy consumption. For example, multi-focus image fusion methods based on DWT require a lot of convolution operations, so they take more time and energy to process. Therefore, most methods in multi-scale transform are not suitable for real-time applications. Moreover, these methods are not very successful along edges, due to the wavelet transform process missing the edges of the image. They create ringing artefacts in the output image and reduce its quality. Due to the aforementioned problems in the multi-scale transform methods, researchers are interested in multi-focus image fusion in the DCT domain. DCT-based methods are more efficient in terms of transmission and archiving images coded in Joint Photographic Experts Group (JPEG) standard to the upper node in the VSN agent. A JPEG system consists of a pair of an encoder and a decoder. In the encoder, images are divided into non-overlapping 8×8 blocks, and the DCT coefficients are calculated for each. Since the quantization of DCT coefficients is a lossy process, many of the small-valued DCT coefficients are quantized to zero, which corresponds to high frequencies. DCT-based image fusion algorithms work better when the multi-focus image fusion methods are applied in the compressed domain. In addition, in the spatial-based methods, the input images must be decoded and then transferred to the spatial domain. After implementation of the image fusion operations, the output fused images must again be encoded. DCT domain-based methods do not require complex and time-consuming consecutive decoding and encoding operations. Therefore, the image fusion methods based on DCT domain operate with much less energy and processing time. Recently, a lot of research has been carried out in the DCT domain. DCT+Variance, DCT+Corr_Eng, DCT+EOL, and DCT+VOL are some prominent examples of DCT based methods.

    Read more →
  • Fyuse

    Fyuse

    Fyuse is a spatial photography app which lets users capture and share interactive 3D images. By tilting or swiping one's smartphone, one can view such "fyuses" from various angles — as if one were walking around an object or subject. The app blends photography and video to create an interactive medium and was first published for iOS in April 2014. The Android version was released at the end of 2014. == The app == Fyuse lets users capture panoramas, selfies, and full 360° views of objects and allows one to view captured moments from different angles. It has its own personal gallery, social network and standalone web integration. With the app, Fyusion also created a social networking platform similar to Instagram. Fyuses can be shared, commented on, liked and re-shared to one's followers (called Echoes). One can build a network of followers and with engagement tracking, one can see how many times an image has been interacted with The images can also be saved for private, offline view, or shared to other social networks, like Facebook or Twitter, or embedded on a website where the images can be interacted with by desktop users via dragging the mouse. Furthermore, in the compass tab other fyuses can be discovered using the app's system of tags and categories. One's Fyuse feed is prepopulated with top users, and one can follow people to see when they post a new fyuse. The app will also find one's friends if one signs up with Facebook or connects it with one's Twitter account. To create a fyuse one moves around a person or object with one's phone's camera in one direction or moving/tilting one's phone around while holding one's finger on the screen. By combining photography and video the app allows one to capture moments that one may not have otherwise been able to capture by recording not one moment in time but stitched together little moments. According to Fyusion CEO Radu Rusu, a photo freezes a moment in time, while a video captures moments in a linear timeline — both still flat, when viewed. A fyuse image captures a moment in space, where one can not only see one side of something, but also around it. When it is done rendering, fyuses can also be edited – one can trim the fyuse for length and edit the brightness, contrast, exposure, saturation and sharpness. One can also add a vignette and apply a filters, with options to adjust their intensity. After editing, one can write a description, add hashtags, and tag parts of the fyuse before one can (voluntarily) publish and share it. Version 1.0 has been described as "alpha prototype" and version 2.0 was released on 17 December 2014. Version 3.0 introduced 3D tagging by which users can layer 3D graphic that animate accordingly with each interaction to add some context to the content. Version 4.0 was released on December 21, 2016 for iOS. Since January 2016 (v3.2) the app allows the export of fyuses as Live Photos. The app has also been described as a more sophisticated version of 3D stickers and flip images. == Applications == The app has many applications for e-commerce such as for fashion designers who want to showcase a garment from every angle, or real estate listings and Airbnb-type sites that want to make their rental properties seem as enticing as possible. The app can also be used for interactive art, 360° panoramas and selfies. == History == San Francisco-based Fyusion Inc.'s three founders — Radu B. Rusu, CTO Stefan Holzer, and VP of Engineering Stephen Miller — worked together at Willow Garage, the robotics research lab started by early Google employee Scott Hassan in the area of "personal robotics" — Hassan decided to turn the lab into more of an incubator, suggesting that the members spin off their technologies into consumer-facing enterprises. Rusu first set out with an open-source 3D perception software startup called Open Perception. Fyusion was officially founded in 2013, and soon after Rusu and his cofounders patented the technology for spatial photography. The company closed a seed funding round at the end of May, raising $3.35 million from investors, including an angel investment from Sun Microsystems cofounder Andreas Bechtolsheim. In 2014 the Fyuse team consisted of 13 employees, mostly engineers and designers, recruited from around the globe. In March 2015 the team displayed their app at Katy Perry's premiere for the movie "Prismatic World Tour on Epix" where Perry also took Fyuse for a test run. == Augmented reality == In September 2016 Fyusion unveiled its platform for creating augmented reality content using ones smartphone. It takes the images from ones smartphone and converts them into 3D holographic images, which one can then view on an AR headset. According to Rusu "by making it easy for people to capture their surroundings on any mobile device, [Fyusion is] revolutionizing the way that people view the world around them" and also states that for "AR to be successful, anyone should be able to create content for it" opposed to the current "small number of content creators and an even smaller number of hardware players". According to him "the applications of [Fyusion's] technology for consumers and businesses are incredibly limitless". The platform uses the company's patented 3D spatio-temporal platform that uses advanced sensor fusion, machine learning and computer vision algorithms and part of the platform is built into the Fyuse app. Before committing to releasing a separate consumer product the company intends to wait until the HoloLens device becomes available to the public. Until then any Fyuse representation created using Fyuse is AR ready and will be able to be shown in HoloLens in the future. == Fyuse - Point of No Return == Fyuse - Point of No Return is a science fiction short advert for Fyuse 3.0 in which Fyuse's digital medium is extrapolated into the future. In the film a woman uses a mini scanning-drone to 3D scan a tree with Fyuse and later recreate it as an augmented reality object at another place.

    Read more →