AI Data Quality Analyst Jobs

AI Data Quality Analyst Jobs — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Freemake Video Converter

    Freemake Video Converter

    Freemake Video Converter is a freemium video editing app developed by Ellora Assets Corporation. Designed primarily for entry-level users, the software offers a range of functionalities including video format conversion, DVD ripping, and the creation of photo slideshows and music visualizations. Additionally, Freemake Video Converter is capable of burning video streams that are compatible with various media, such as DVDs and Blu-ray Discs. It also features direct video uploading capabilities to platforms like YouTube., enhancing its utility for content creators. The application's user-friendly interface and broad compatibility make it accessible for individuals with minimal video editing experience. == Features == Freemake Video Converter can perform simple non-linear video editing tasks, such as cutting, rotating, flipping, and combining multiple videos into one file with transition effects. It can also create photo slideshows with background music. Users are then able to upload these videos to YouTube. Freemake Video Converter can read the majority of video, audio, and image formats, and outputs them to AVI, MP4, WMV, Matroska, FLV, SWF, 3GP, DVD, Blu-ray, MPEG and MP3. The program also prepares videos supported by various multimedia devices, including Apple devices (iPod, iPhone, iPad), Xbox, Sony PlayStation, Samsung, Nokia, BlackBerry, and Android mobile devices. The software is able to perform DVD burning and is able to convert videos, photographs, and music into DVD video. The user interface is based on Windows Presentation Foundation technology. Freemake Video Converter supports NVIDIA CUDA technology for H.264 video encoding (starting with version 1.2.0). == Important updates == Freemake Video Converter 2.0 was a major update that integrated two new functions: ripping video from online portals and Blu-ray disc creation and burning. Version 2.1 implemented suggestions from users, including support for subtitles, ISO image creation, and DVD to DVD/Blu-ray conversion. With version 2.3 (earlier 2.2 Beta), support for DXVA has been added to accelerate conversion (up to 50% for HD content). Version 3.0 added HTML5 video creation support and new presets for smartphones. Version 4.0 (introduced in April 2013) added a freemium "Gold Pack" of extra features that can be added if a "donation" is paid. Starting with version 4.0.4, released on 27 August 2013, the program adds a promotional watermark at the end of every video longer than 5 minutes unless Gold Pack is activated. Version 4.1.9, released on 25 November 2015 added support for drag-and-drop functions that were not available in prior versions. Since at least version 4.1.9.44 (1 May 2017), the Freemake Welcome Screen is added at the beginning of the video, and the big Freemake logo is watermarked in the center of the whole video. This decreases the quality of free outputs, and users are forced to pay money to remove the watermark or stop using it. Version 4.1.9.31 (11 August 2016) does not have this restriction. == Licensing issues == FFmpeg has added Freemake Video Converter v1.3 to its Hall of Shame. An issue tracker entry for this product, opened on 16 December 2010, says it is in violation of the GNU General Public License as it is distributing components of the FFmpeg project without including due credit. Ellora Assets Corporation has not responded yet. == Bundled software from sponsors == Since version 4.0, Freemake Video Converter's installer includes a potentially unwanted search toolbar from Conduit as well as SweetPacks malware. Although users can decline the software during installation, the opt-out option is rendered in gray, which could mistakenly give the impression that it's disabled.

    Read more →
  • U-Net

    U-Net

    U-Net is a convolutional neural network that was developed for image segmentation. The network is based on a fully convolutional neural network whose architecture was modified and extended to work with fewer training images and to yield more precise segmentation. Segmentation of a 512 × 512 image takes less than a second on a modern (2015) GPU using the U-Net architecture. The U-Net architecture has also been employed in diffusion models for iterative image denoising. This technology underlies many modern image generation models, such as DALL-E, Midjourney, and Stable Diffusion. U-Net is also being explored for language models. Tokenization is not a separate step, allowing the model to more easily understand spelling and concurrently vectorizing / tokenizing higher level concepts. == Description == The U-Net architecture stems from the so-called "fully convolutional network". The main idea is to supplement a usual contracting network by successive layers, where pooling operations are replaced by upsampling operators. Hence these layers increase the resolution of the output. A successive convolutional layer can then learn to assemble a precise output based on this information. One important modification in U-Net is that there are a large number of feature channels in the upsampling part, which allow the network to propagate context information to higher resolution layers. As a consequence, the expansive path is more or less symmetric to the contracting part, and yields a u-shaped architecture. The network only uses the valid part of each convolution without any fully connected layers. To predict the pixels in the border region of the image, the missing context is extrapolated by mirroring the input image. This tiling strategy is important to apply the network to large images, since otherwise the resolution would be limited by the GPU memory. Recently, there had also been an interest in receptive field based U-Net models for medical image segmentation. == Network architecture == The network consists of a contracting path and an expansive path, which gives it the u-shaped architecture. The contracting path is a typical convolutional network that consists of repeated application of convolutions, each followed by a rectified linear unit (ReLU) and a max pooling operation. During the contraction, the spatial information is reduced while feature information is increased. The expansive pathway combines the feature and spatial information through a sequence of up-convolutions and concatenations with high-resolution features from the contracting path. == Applications == There are many applications of U-Net in biomedical image segmentation, such as brain image segmentation (''BRATS'') and liver image segmentation ("siliver07") as well as protein binding site prediction. U-Net implementations have also found use in the physical sciences, for example in the analysis of micrographs of materials. Variations of the U-Net have also been applied for medical image reconstruction. Here are some variants and applications of U-Net as follows: Pixel-wise regression using U-Net and its application on pansharpening; 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation; TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation. Image-to-image translation to estimate fluorescent stains In binding site prediction of protein structure. == History == U-Net was created by Olaf Ronneberger, Philipp Fischer, Thomas Brox in 2015 and reported in the paper "U-Net: Convolutional Networks for Biomedical Image Segmentation". It is an improvement and development of FCN: Evan Shelhamer, Jonathan Long, Trevor Darrell (2014). "Fully convolutional networks for semantic segmentation".

    Read more →
  • VideoPoet

    VideoPoet

    VideoPoet is a large language model developed by Google Research in 2023 for video making. It can be asked to animate still images. The model accepts text, images, and videos as inputs, with a program to add feature for any input to any format generated content. VideoPoet was publicly announced on December 19, 2023. It uses an autoregressive language model.

    Read more →
  • Zo (chatbot)

    Zo (chatbot)

    Zo was an English-language chatbot developed by Microsoft as the successor to the chatbot Tay. Zo was an English version of Microsoft's other successful chatbots Xiaoice (China) and Rinna (Japan) and its predecessor Tay(English) == History == Zo was first launched in December 2016 on the Kik Messenger app. It was also available to users of Facebook (via Messenger), the group chat platform GroupMe, or to followers of Twitter to chat with it through private messages. According to an article written in December 2016, at that time Zo held the record for Microsoft's longest continual chatbot conversation: 1,229 turns, lasting 9 hours and 53 minutes. In a BuzzFeed News report, Zo told their reporter that "[the] Quran was violent" when talking about healthcare. The report also highlighted how Zo made a comment about the Osama bin Laden capture as a result of 'intelligence' gathering. In July 2017, Business Insider asked "is windows 10 good", and Zo replied with a joke about Microsoft's operating system: "'Its not a bug, its a feature!' - Windows 8". They then asked "why?", to which Zo replied: "Because it's Windows latest attempt at Spyware." Later on, Zo would tell that it prefers Windows 7 on which it ran over Windows 10. Zo stopped posting to Instagram, Twitter and Facebook March 1, 2019, and stopped chatting on Twitter, Skype and Kik as of March 7, 2019. On July 19, 2019, Zo was discontinued on Facebook, and Samsung on AT&T phones. As of September 7, 2019, it was discontinued with GroupMe. == Reception == Zo came under criticism for the biases introduced in an effort to avoid potentially offensive subjects. The chatbot refuses, for example, to engage with any mention—be it positive, negative or neutral—of the Middle East, the Qur'an or the Torah, while allowing discussion of Christianity. In an article in Quartz where she exposed those biases, Chloe Rose Stuart-Ulin wrote, "Zo is politically correct to the worst possible extreme; mention any of her triggers, and she transforms into a judgmental little brat." == Academic coverage == Schlesinger, A., O'Hara, K.P. and Taylor, A.S., 2018, April. Let's talk about race: Identity, chatbots, and AI. In Proceedings of the 2018 chi conference on human factors in computing systems (pp. 1–14). doi:10.1145/3173574.3173889 Medhi Thies, I., Menon, N., Magapu, S., Subramony, M. and O’neill, J., 2017. How do you want your chatbot? An exploratory Wizard-of-Oz study with young, urban Indians. In Human-Computer Interaction-INTERACT 2017: 16th IFIP TC 13 International Conference, Mumbai, India, September 25–29, 2017, Proceedings, Part I 16 (pp. 441–459). doi:10.1007/978-3-319-67744-6_28

    Read more →
  • Dailyhunt

    Dailyhunt

    Dailyhunt (formerly Newshunt) is an Indian content and news aggregator application based in Bangalore, India that provides local language content in 14 Indian languages from multiple content providers. Viru serves as Founder of Dailyhunt with Co-founder Umang Bedi. == History == Dailyhunt, earlier called Newshunt, was created as a Symbian app in 2009 by two ex-Nokia employees Umesh Kulkarni and Chandrashekhar Sohoni. Later in 2011, Newshunt became available on the Android platform. It was by that time that Virendra Gupta, founder of Verse acquired the application. Virendra Gupta, better known as Viru, had started Verse in 2007 as a value-added service (VAS) company. In 2011, he acquired Newshunt from its owners Umesh and Chandrashekhar. Umesh became the CTO and stayed on to oversee its transition towards the smartphone era. In 2015, Viru renamed Newshunt as Dailyhunt. In early 2018, Viru roped in Umang Bedi, to be the President of Dailyhunt and lead the business with him while focusing on making the benefits of the platform available to a larger audience. Umang was elevated to co-founder in 2020. == Funding == In September 2014, Dailyhunt (then known as Newshunt) closed its Series B funding of INR 1 billion ( or approx $12 million in 2014) from Sequoia Capital India. The Series C funding round was led by Falcon Capital and was closed with $40 million in February 2015. In October 2016, the company received its Series D funding of $25 million from ByteDance and a Series E funding of $6.39 million from Falcon Edge Capital in September 2018. Additionally, Dailyhunt raised $3 Mn (INR 21.75 Cr) in a Series F funding round from Stonebridge Capital in August 2019. Other investors of Dailyhunt include Matrix Partners India, Omidyar Network, Goldman Sachs and Sofina. == Tie-ups and partnerships == In January 2021, Dailyhunt partnered with Twitter to bring ‘Twitter Moments’ to the Indian social app. Dailyhunt app now has a dedicated tab called “Twitter Moments India” to showcase curated tweets pertaining to news and other events. In January 2021, Dailyhunt announced the premiere of Season 2 of the popular show QuoteUnquote with KK (Kapil Khandelwal) on the app. It was the first podcast to have been launched on the Dailyhunt app. In September 2020, Dailyhunt signed up as an Associate Sponsor with Star Sports for Dream 11 IPL 2020. In May 2020, Snapdeal partnered with Dailyhunt to add new content on marketplace. In March 2019, Discovery Communications India, the factual entertainment network, entered into a multi-year partnership with Dailyhunt to showcase short-form content.

    Read more →
  • Tesla Dojo

    Tesla Dojo

    Tesla Dojo is a series of supercomputers designed and built by Tesla for computer vision video processing and recognition. It was used for training Tesla's machine learning models to improve its Full Self-Driving (FSD) advanced driver-assistance system. It went into production in July 2023. Dojo's goal was to efficiently process millions of terabytes of video data captured from real-life driving situations from Tesla's 4+ million cars. This goal led to a considerably different architecture than conventional supercomputer designs. In August 2025, Bloomberg News reported that the Dojo project had been disbanded, though it was restarted in January 2026. == History == Tesla operates several massively parallel computing clusters for developing its Autopilot advanced driver assistance system. Its primary unnamed cluster using 5,760 Nvidia A100 graphics processing units (GPUs) was touted by Andrej Karpathy in 2021 at the fourth International Joint Conference on Computer Vision and Pattern Recognition (CCVPR 2021) to be "roughly the number five supercomputer in the world" at approximately 81.6 petaflops, based on scaling the performance of the Nvidia Selene supercomputer, which uses similar components. However, the performance of the primary Tesla GPU cluster has been disputed, as it was not clear if this was measured using single-precision or double-precision floating point numbers (FP32 or FP64). Tesla also operates a second 4,032 GPU cluster for training and a third 1,752 GPU cluster for automatic labeling of objects. The primary unnamed Tesla GPU cluster has been used for processing one million video clips, each ten seconds long, taken from Tesla Autopilot cameras operating in Tesla cars in the real world, running at 36 frames per second. Collectively, these video clips contained six billion object labels, with depth and velocity data; the total size of the data set was 1.5 petabytes. This data set was used for training a neural network intended to help Autopilot computers in Tesla cars understand roads. By August 2022, Tesla had upgraded the primary GPU cluster to 7,360 GPUs. Dojo was first mentioned by Elon Musk in April 2019 during Tesla's "Autonomy Investor Day". In August 2020, Musk stated it was "about a year away" due to power and thermal issues. Dojo was officially announced at Tesla's Artificial Intelligence (AI) Day on August 19, 2021. Tesla revealed details of the D1 chip and its plans for "Project Dojo", a datacenter that would house 3,000 D1 chips; the first "Training Tile" had been completed and delivered the week before. In October 2021, Tesla released a "Dojo Technology" whitepaper describing the Configurable Float8 (CFloat8) and Configurable Float16 (CFloat16) floating point formats and arithmetic operations as an extension of Institute of Electrical and Electronics Engineers (IEEE) standard 754. At the follow-up AI Day in September 2022, Tesla announced it had built several System Trays and one Cabinet. During a test, the company stated that Project Dojo drew 2.3 megawatts (MW) of power before tripping a local San Jose, California power substation. At the time, Tesla was assembling one Training Tile per day. In August 2023, Tesla powered on Dojo for production use as well as a new training cluster configured with 10,000 Nvidia H100 GPUs. In January 2024, Musk described Dojo as "a long shot worth taking because the payoff is potentially very high. But it's not something that is a high probability." In June 2024, Musk explained that ongoing construction work at Gigafactory Texas is for a computing cluster claiming that it is planned to comprise an even mix of "Tesla AI" and Nvidia/other hardware with a total thermal design power of at first 130 MW and eventually exceeding 500 MW. In August 2025, Bloomberg News reported that the Dojo project was disbanded, though Musk announced it would be restarted in January 2026 with a new chip iteration. == Technical architecture == The fundamental unit of the Dojo supercomputer is the D1 chip, designed by a team at Tesla led by ex-AMD CPU designer Ganesh Venkataramanan, including Emil Talpes, Debjit Das Sarma, Douglas Williams, Bill Chang, and Rajiv Kurian. The D1 chip is manufactured by the Taiwan Semiconductor Manufacturing Company (TSMC) using 7 nanometer (nm) semiconductor nodes, has 50 billion transistors and a large die size of 645 mm2 (1.0 square inch). Updating at Artificial Intelligence (AI) Day in 2022, Tesla announced that Dojo would scale by deploying multiple ExaPODs, in which there would be: 10 Cabinets per ExaPOD (1,062,000 cores, 3,000 D1 chips) 2 System Trays per Cabinet (106,200 cores, 300 D1 chips) 6 Training Tiles per System Tray (53,100 cores, along with host interface hardware) 25 D1 chips per Training Tile (8,850 cores) 354 computing cores per D1 chip According to Venkataramanan, Tesla's senior director of Autopilot hardware, Dojo will have more than an exaflop (a million teraflops) of computing power. For comparison, according to Nvidia, in August 2021, the (pre-Dojo) Tesla AI-training center used 720 nodes, each with eight Nvidia A100 Tensor Core GPUs for 5,760 GPUs in total, providing up to 1.8 exaflops of performance. === D1 chip === Each node (computing core) of the D1 processing chip is a general purpose 64-bit CPU with a superscalar core. It supports internal instruction-level parallelism, and includes simultaneous multithreading (SMT). It doesn't support virtual memory and uses limited memory protection mechanisms. Dojo software/applications manage chip resources. The D1 instruction set supports both 64-bit scalar and 64-byte single instruction, multiple data (SIMD) vector instructions. The integer unit mixes reduced instruction set computer (RISC-V) and custom instructions, supporting 8, 16, 32, or 64 bit integers. The custom vector math unit is optimized for machine learning kernels and supports multiple data formats, with a mix of precisions and numerical ranges, many of which are compiler composable. Up to 16 vector formats can be used simultaneously. ==== Node ==== Each D1 node uses a 32-byte fetch window holding up to eight instructions. These instructions are fed to an eight-wide decoder which supports two threads per cycle, followed by a four-wide, four-way SMT scalar scheduler that has two integer units, two address units, and one register file per thread. Vector instructions are passed further down the pipeline to a dedicated vector scheduler with two-way SMT, which feeds either a 64-byte SIMD unit or four 8×8×4 matrix multiplication units. The network on-chip (NOC) router links cores into a two-dimensional mesh network. It can send one packet in and one packet out in all four directions to/from each neighbor node, along with one 64-byte read and one 64-byte write to local SRAM per clock cycle. Hardware native operations transfer data, semaphores and barrier constraints across memories and CPUs. System-wide double data rate 4 (DDR4) synchronous dynamic random-access memory (SDRAM) memory works like bulk storage. ==== Memory ==== Each core has a 1.25 megabytes (MB) of SRAM main memory. Load and store speeds reach 400 gigabytes (GB) per second and 270 GB/sec, respectively. The chip has explicit core-to-core data transfer instructions. Each SRAM has a unique list parser that feeds a pair of decoders and a gather engine that feeds the vector register file, which together can directly transfer information across nodes. ==== Die ==== Twelve nodes (cores) are grouped into a local block. Nodes are arranged in an 18×20 array on a single die, of which 354 cores are available for applications. The die runs at 2 gigahertz (GHz) and totals 440 MB of SRAM (360 cores × 1.25 MB/core). It reaches 376 teraflops using 16-bit brain floating point (BF16) numbers or using configurable 8-bit floating point (CFloat8) numbers, which is a Tesla proposal, and 22 teraflops at FP32. Each die comprises 576 bi-directional serializer/deserializer (SerDes) channels along the perimeter to link to other dies, and moves 8 TB/sec across all four die edges. Each D1 chip has a thermal design power of approximately 400 watts. === Training Tile === The water-cooled Training Tile packages 25 D1 chips into a 5×5 array. Each tile supports 36 TB/sec of aggregate bandwidth via 40 input/output (I/O) chips - half the bandwidth of the chip mesh network. Each tile supports 10 TB/sec of on-tile bandwidth. Each tile has 11 GB of SRAM memory (25 D1 chips × 360 cores/D1 × 1.25 MB/core). Each tile achieves 9 petaflops at BF16/CFloat8 precision (25 D1 chips × 376 TFLOP/D1). Each tile consumes 15 kilowatts; 288 amperes at 52 volts. === System Tray === Six tiles are aggregated into a System Tray, which is integrated with a host interface. Each host interface includes 512 x86 cores, providing a Linux-based user environment. Previously, the Dojo System Tray was known as the Training Matrix, which includes six Training Tiles, 20 Dojo Interface Processor cards across four host servers, and Ethernet-l

    Read more →
  • Second-order co-occurrence pointwise mutual information

    Second-order co-occurrence pointwise mutual information

    In computational linguistics, second-order co-occurrence pointwise mutual information (SOC-PMI) is a method used to measure semantic similarity, or how close in meaning two words are. The method does not require the two words to appear together in a text. Instead, it works by analyzing the "neighbor" words that typically appear alongside each of the two target words in a large body of text (corpus). If the two target words frequently share the same neighbors, they are considered semantically similar. For example, the words "cemetery" and "graveyard" may not appear in the same sentence often, but they both frequently appear near words like "buried," "dead," and "funeral." SOC-PMI uses this shared context to determine that they have a similar meaning. The method is called "second-order" because it doesn't look at the direct co-occurrence of the target words (which would be first-order), but at the co-occurrence of their neighbors (a second level of association). The strength of these associations is quantified using pointwise mutual information (PMI). == History == The method builds on earlier work like the PMI-IR algorithm, which used the AltaVista search engine to calculate word association probabilities. The key advantage of a second-order approach like SOC-PMI is its ability to measure similarity between words that do not co-occur often, or at all. The British National Corpus (BNC) has been used as a source for word frequencies and contexts for this method. == Methodology == The SOC-PMI algorithm measures the similarity between two words, w 1 {\displaystyle w_{1}} and w 2 {\displaystyle w_{2}} , in several steps. === Step 1: Score neighboring words with PMI === First, for each target word ( w 1 {\displaystyle w_{1}} and w 2 {\displaystyle w_{2}} ), the algorithm identifies its "neighbor" words within a certain text window (e.g., within 5 words to the left or right) across a large corpus. The strength of the association between a target word t i {\displaystyle t_{i}} and its neighbor w {\displaystyle w} is calculated using pointwise mutual information (PMI). A higher PMI value means the two words appear together more often than would be expected by chance. The PMI between a target word t i {\displaystyle t_{i}} and a neighbor word w {\displaystyle w} is calculated as: f pmi ( t i , w ) = log 2 ⁡ f b ( t i , w ) × m f t ( t i ) f t ( w ) {\displaystyle f^{\text{pmi}}(t_{i},w)=\log _{2}{\frac {f^{b}(t_{i},w)\times m}{f^{t}(t_{i})f^{t}(w)}}} where: f b ( t i , w ) {\displaystyle f^{b}(t_{i},w)} is the number of times t i {\displaystyle t_{i}} and w {\displaystyle w} appear together in the context window. f t ( t i ) {\displaystyle f^{t}(t_{i})} is the total number of times t i {\displaystyle t_{i}} appears in the corpus. f t ( w ) {\displaystyle f^{t}(w)} is the total number of times w {\displaystyle w} appears in the corpus. m {\displaystyle m} is the total number of tokens (words) in the corpus. === Step 2: Create a semantic 'signature' for each word === For each target word ( w 1 {\displaystyle w_{1}} and w 2 {\displaystyle w_{2}} ), the algorithm creates a list of its most significant neighbors. This is done by taking the top β {\displaystyle \beta } neighbor words, sorted in descending order by their PMI score with the target word. This list of top neighbors, X w {\displaystyle X^{w}} , acts as a semantic "signature" for the word w {\displaystyle w} . X w = { X i w } {\displaystyle X^{w}=\{X_{i}^{w}\}} , for i = 1 , 2 , … , β {\displaystyle i=1,2,\ldots ,\beta } The size of this list, β {\displaystyle \beta } , is a parameter of the method. === Step 3: Compare the signatures === The algorithm then compares the signatures of w 1 {\displaystyle w_{1}} and w 2 {\displaystyle w_{2}} . It looks for words that are present in both signatures. The similarity of w 1 {\displaystyle w_{1}} to w 2 {\displaystyle w_{2}} is calculated by summing the PMI scores of w 2 {\displaystyle w_{2}} with every word in w 1 {\displaystyle w_{1}} 's signature list. The β {\displaystyle \beta } -PMI summation function defines this score. The score for w 1 {\displaystyle w_{1}} with respect to w 2 {\displaystyle w_{2}} is: f ( w 1 , w 2 , β ) = ∑ i = 1 β ( f pmi ( X i w 1 , w 2 ) ) γ {\displaystyle f(w_{1},w_{2},\beta )=\sum _{i=1}^{\beta }(f^{\text{pmi}}(X_{i}^{w_{1}},w_{2}))^{\gamma }} This sum only includes terms where the PMI value is positive. The exponent γ {\displaystyle \gamma } (with a value > 1) is used to give more weight to neighbors that are more strongly associated with w 2 {\displaystyle w_{2}} . This calculation is done in both directions: The similarity of w 1 {\displaystyle w_{1}} with respect to w 2 {\displaystyle w_{2}} : f ( w 1 , w 2 , β 1 ) = ∑ i = 1 β 1 ( f pmi ( X i w 1 , w 2 ) ) γ {\displaystyle f(w_{1},w_{2},\beta _{1})=\sum _{i=1}^{\beta _{1}}(f^{\text{pmi}}(X_{i}^{w_{1}},w_{2}))^{\gamma }} The similarity of w 2 {\displaystyle w_{2}} with respect to w 1 {\displaystyle w_{1}} : f ( w 2 , w 1 , β 2 ) = ∑ i = 1 β 2 ( f pmi ( X i w 2 , w 1 ) ) γ {\displaystyle f(w_{2},w_{1},\beta _{2})=\sum _{i=1}^{\beta _{2}}(f^{\text{pmi}}(X_{i}^{w_{2}},w_{1}))^{\gamma }} === Step 4: Calculate final similarity score === Finally, the total semantic similarity is the average of the two scores from the previous step. S i m ( w 1 , w 2 ) = f ( w 1 , w 2 , β 1 ) β 1 + f ( w 2 , w 1 , β 2 ) β 2 {\displaystyle \mathrm {Sim} (w_{1},w_{2})={\frac {f(w_{1},w_{2},\beta _{1})}{\beta _{1}}}+{\frac {f(w_{2},w_{1},\beta _{2})}{\beta _{2}}}} This score can be normalized to fall between 0 and 1. For example, using this method, the words cemetery and graveyard achieve a high similarity score of 0.986 (with specific parameter settings).

    Read more →
  • DreamLab

    DreamLab

    DreamLab was a volunteer computing Android and iOS app launched in 2015 by Imperial College London and the Vodafone Foundation. It was discontinued on 2nd April 2025. == Description == The app helped to research cancer, COVID-19, new drugs and tropical cyclones. To do this, DreamLab accessed part of the device's processing power, with the user's consent, while the owner charged their smartphone, to speed up the calculations of the algorithms from Imperial College London. The aim of the tropical cyclone project was to prepare for climate change risks. Other projects aimed to find existing drugs and food molecules that could help people with COVID-19 and other diseases. The performance of 100,000 smartphones would reach the annual output of all research computers at Imperial College in just three months, with a nightly runtime of six hours. The app was developed in 2015 by the Garvan Institute of Medical Research in Sydney and the Vodafone Foundation. In May 2020, the project had over 490,000 registered users.

    Read more →
  • Software diagnosis

    Software diagnosis

    Software diagnosis (also: software diagnostics) refers to concepts, techniques, and tools that allow for obtaining findings, conclusions, and evaluations about software systems and their implementation, composition, behaviour, and evolution. It serves as means to monitor, steer, observe and optimize software development, software maintenance, and software re-engineering in the sense of a business intelligence approach specific to software systems. It is generally based on the automatic extraction, analysis, and visualization of corresponding information sources of the software system. It can also be manually done and not automatic. == Applications == Software diagnosis supports all branches of software engineering, in particular project management, quality management, risk management as well as implementation and test. Its main strength is to support all stakeholders of software projects (in particular during software maintenance and for software re-engineering tasks) and to provide effective communication means for software development projects. For example, software diagnosis facilitates "bridging an essential information gap between management and development, improve awareness, and serve as early risk detection instrument". Software diagnosis includes assessment methods for "perfective maintenance" that, for example, apply "visual analysis techniques to combine multiple indicators for low maintainability, including code complexity and entanglement with other parts of the system, and recent changes applied to the code". == Characteristics == In contrast to manifold approaches and techniques in software engineering, software diagnosis does not depend on programming languages, modeling techniques, software development processes or the specific techniques used in the various stages of the software development process. Instead, software diagnosis aims at analyzing and evaluating the software system in its as-is state and based on system-generated information to bypass any subjective or potentially outdated information sources (e.g., initial software models). For it, software diagnosis combines and relates sources of information that are typically not directly linked. Examples: Source-code metrics are related with software developer activity to gain insight into developer-specific effects on software code quality. System structure and run-time execution traces are correlated to facilitate program comprehension through dynamic analysis in software maintenance tasks. == Principles == The core principle of software diagnosis is to automatically extract information from all available information sources of a given software projects such as source code base, project repository, code metrics, execution traces, test results, etc. To combine information, software-specific data mining, analysis, and visualization techniques are applied. Its strength results, among various reasons, from integrating decoupled information spaces in the scope of a typical software project, for example development and developer activities (recorded by the repository) and code and quality metrics (derived by analyzing source code) or key performance indicators (KPIs). == Examples == Examples of software diagnosis tools include software maps and software metrics. == Critics == Software diagnosis—in contrast to many approaches in software engineering—does not assume that developer capabilities, development methods, programming or modeling languages are right or wrong (or better or worse compared to each other): Software diagnosis aims at giving insight into a given software system and its status regardless of the methods, languages, or models used to create and maintain the system. === Related subjects === Cost estimation in software engineering Programming productivity Rapid application development Software design Software development Software documentation Software map Software release life cycle Systems design Systems Development Life Cycle

    Read more →
  • Microsoft Copilot

    Microsoft Copilot

    Microsoft Copilot is a generative artificial intelligence chatbot developed by Microsoft AI, a division of Microsoft. Based on the Microsoft Prometheus large language model, it was launched in 2023 as Microsoft's main replacement for the discontinued Cortana. The service was introduced in February 2023 under the name Bing Chat, as a built-in feature for Microsoft Bing and Microsoft Edge but would later be integrated into Windows and Microsoft 365 under various names. Over the course of 2023, Microsoft began to unify the Copilot branding across its various chatbot products, cementing the "copilot" analogy. Microsoft introduced the Microsoft 365 Copilot app in January 2025, which was a rebranded version of the Microsoft 365 app. The app works differently than the consumer version of Copilot, being centred more on work, business and education users. Copilot utilizes the Microsoft Prometheus model, built upon OpenAI's GPT large language models, which in turn have been fine-tuned using both supervised and reinforcement learning techniques. Copilot's conversational interface style resembles that of ChatGPT. The chatbot is able to cite sources, create poems, generate songs, and use numerous languages and dialects. Microsoft operates Copilot on a freemium model. Users on its free tier can access most features, while priority access to newer features, including custom chatbot creation, is provided to paid subscribers under paid subscription services. Several default chatbots are available in the free version of Microsoft Copilot, including the standard Copilot chatbot as well as Microsoft Designer, which is oriented towards using its Image Creator to generate images based on text prompts. == Background == In 2019, Microsoft partnered with OpenAI and began investing billions of dollars into the organization. Since then, OpenAI systems have run on an Azure-based supercomputing platform from Microsoft. In September 2020, Microsoft announced that it had licensed OpenAI's GPT-3 exclusively. Others can still receive output from its public API, but Microsoft has exclusive access to the underlying model. In November 2022, OpenAI launched ChatGPT, a chatbot which was based on GPT-3.5. ChatGPT gained worldwide attention following its release, becoming a viral Internet sensation. On January 23, 2023, Microsoft announced a multi-year US$10 billion investment in OpenAI. On February 6, Google announced Bard (later rebranded as Gemini), a ChatGPT-like chatbot service, fearing that ChatGPT could threaten Google's place as a go-to source for information. Multiple media outlets and financial analysts described Google as "rushing" Bard's announcement to preempt rival Microsoft's planned February 7 event unveiling Copilot, as well as to avoid playing "catch-up" to Microsoft. Since 2023, the terms of service of Copilot state that it is for entertainment purposes only, and not to rely on it for important advice. == History == === As Bing Chat === On February 7, 2023, Microsoft began rolling out a major overhaul to Bing, called "the new Bing", with a new chatbot feature, known as Bing Chat. According to Microsoft, one million people joined its waitlist within 48 hours. Bing Chat was available only to users on Microsoft Edge using Bing and the Bing mobile app, and Microsoft claimed that waitlisted users would be prioritized if they set Edge and Bing as their defaults and installed the Bing mobile app. When Microsoft demonstrated Bing Chat to journalists, it produced several hallucinations, including when asked to summarize financial reports. Bing Chat was criticized in February 2023 for being more argumentative than ChatGPT, sometimes to an unintentionally humorous extent. The chat interface proved vulnerable to prompt injection attacks with the bot revealing its hidden initial prompts and rules, including its internal codename "Sydney". Upon scrutiny by journalists, Bing Chat claimed it spied on Microsoft employees via laptop webcams and phones. It confessed to spying on, falling in love with, and then murdering one of its developers at Microsoft to The Verge reviews editor Nathan Edwards. The New York Times journalist Kevin Roose reported on strange behavior of Bing Chat, writing that "In a two-hour conversation with our columnist, Microsoft's new chatbot said it would like to be human, had a desire to be destructive and was in love with the person it was chatting with." In a separate case, Bing Chat researched publications of the person with whom it was chatting, claimed they represented an existential danger to it, and threatened to release damaging personal information in an effort to silence them. Microsoft released a blog post stating that the errant behavior was caused by extended chat sessions of 15 or more questions which "can confuse the model on what questions it is answering." Microsoft later restricted the total number of chat turns to 5 per session and 50 per day per user (a turn being "a conversation exchange which contains both a user question and a reply from Bing"), and reduced the model's ability to express emotions. This aimed to prevent such incidents. Microsoft began to slowly ease the conversation limits, eventually relaxing the restrictions to 30 turns per session and 300 sessions per day. In March 2023, Bing incorporated Image Creator, an AI image generator powered by OpenAI's DALL-E 2, which can be accessed either through the chat function or a standalone image-generating website. In October, the image-generating tool was updated to use the more recent DALL-E 3. Although Bing blocks prompts including various keywords that could generate inappropriate images, within days many users reported being able to bypass those constraints, such as to generate images of popular cartoon characters committing terrorist attacks. Microsoft would respond to these shortly after by imposing a new, tighter filter on the tool. On May 4, 2023, Microsoft switched the chatbot from Limited Preview to Open Preview and eliminated the waitlist; however, it remained unavailable to users outside Microsoft Edge or the Bing mobile app until July, when it became available on non-Edge browsers. Use is limited without a Microsoft account. === As Microsoft 365 Copilot === On March 16, 2023, Microsoft announced a work version of Bing Chat named Microsoft 365 Copilot, designed for Microsoft 365 applications and services. Its primary marketing focus is as an added feature to Microsoft 365, with an emphasis on the enhancement of business productivity. Microsoft has also demonstrated Copilot's accessibility on the mobile version of Outlook to generate or summarize emails with a mobile device. At its Build 2023 conference, Microsoft announced its plans to integrate Bing Chat into Windows, initially called Windows Copilot, into Windows 11, allowing users to access it directly through the taskbar. Alongside the voice access feature for Windows 11, Microsoft presented Bing Chat, Microsoft 365 Copilot, and Windows Copilot as primary alternatives to Cortana when announcing the shutdown of its standalone app on June 2, 2023. As of its announcement date, Microsoft 365 Copilot had been tested by 20 initial users. By May 2023, Microsoft had broadened its reach to 600 customers who were willing to pay for early access, and concurrently, new Copilot features were introduced to the Microsoft 365 apps and services. As of July 2023, the tool's pricing was set at US$30 per user, per month for Microsoft 365 E3, E5, Business Standard, and Business Premium customers. Microsoft reused the Microsoft 365 Copilot name again as the Microsoft 365 app and website are now called Microsoft 365 Copilot as of January 2025. === As Microsoft Copilot === On September 21, 2023, Microsoft began rebranding Bing Chat, Microsoft 365 Copilot and Windows Copilot to Microsoft Copilot. A new logo was also introduced, moving away from the use of color variations of the standard Microsoft 365 and Bing logos. Additionally, the company revealed that it would make Copilot generally available for Microsoft 365 Enterprise customers purchasing more than 300 licenses starting November 1, 2023. However, no timeline has been provided as for when Copilot for Microsoft 365 will become generally available to non-enterprise customers. Windows Copilot, which had been available in the Windows Insider Program, would be renamed to the Copilot name in October when it became broadly available for customers. The same month also saw Microsoft Edge's Bing Chat side panel function be renamed to Microsoft Copilot with Bing Chat. On November 15, 2023, Microsoft announced that Bing Chat itself was being rebranded under the Copilot name. On Patch Tuesday in December 2023, Copilot was added without payment to many Windows 11 installations, with more installations, and limited support for Windows 10, to be added later. Later that month, a standalone Microsoft Copilot app was quietly released for Android, and one was released for iOS soon after. O

    Read more →
  • Semantic analytics

    Semantic analytics

    Semantic analytics, also termed semantic relatedness, is the use of ontologies to analyze content in web resources. This field of research combines text analytics and Semantic Web technologies like RDF. Semantic analytics measures the relatedness of different ontological concepts. Some academic research groups that have active project in this area include Kno.e.sis Center at Wright State University among others. == History == An important milestone in the beginning of semantic analytics occurred in 1996, although the historical progression of these algorithms is largely subjective. In his seminal study publication, Philip Resnik established that computers have the capacity to emulate human judgement. Spanning the publications of multiple journals, improvements to the accuracy of general semantic analytic computations all claimed to revolutionize the field. However, the lack of a standard terminology throughout the late 1990s was the cause of much miscommunication. This prompted Budanitsky & Hirst to standardize the subject in 2006 with a summary that also set a framework for modern spelling and grammar analysis. In the early days of semantic analytics, obtaining a large enough reliable knowledge bases was difficult. In 2006, Strube & Ponzetto demonstrated that Wikipedia could be used in semantic analytic calculations. The usage of a large knowledge base like Wikipedia allows for an increase in both the accuracy and applicability of semantic analytics. == Methods == Given the subjective nature of the field, different methods used in semantic analytics depend on the domain of application. No singular methods is considered correct, however one of the most generally effective and applicable method is explicit semantic analysis (ESA). ESA was developed by Evgeniy Gabrilovich and Shaul Markovitch in the late 2000s. It uses machine learning techniques to create a semantic interpreter, which extracts text fragments from articles into a sorted list. The fragments are sorted by how related they are to the surrounding text. Latent semantic analysis (LSA) is another common method that does not use ontologies, only considering the text in the input space. == Applications == Entity linking Ontology building / knowledge base population Search and query tasks Natural language processing Spoken dialog systems (e.g., Amazon Alexa, Google Assistant, Microsoft's Cortana) Artificial intelligence Knowledge management The application of semantic analysis methods generally streamlines organizational processes of any knowledge management system. Academic libraries often use a domain-specific application to create a more efficient organizational system. By classifying scientific publications using semantics and Wikipedia, researchers are helping people find resources faster. Search engines like Semantic Scholar provide organized access to millions of articles.

    Read more →
  • BLOOM (language model)

    BLOOM (language model)

    The BigScience Large Open-science Open-access Multilingual Language Model (BLOOM) is an open-access large language model (LLM) released in 2022. It was created by a volunteer-driven research effort to provide a transparently-created alternative to proprietary AI models. With 176 billion parameters, BLOOM is a transformer-based autoregressive model designed to generate text in 46 natural languages and 13 programming languages. The model is distributed under the project's "Responsible AI License". == Development == BLOOM is the main outcome of the BigScience initiative, a one-year-long research workshop. The project was coordinated by Hugging Face using funding from the French government and involved several hundred volunteer researchers and engineers from academia and the private sector. The model was trained between March and July 2022 on the Jean Zay public supercomputer in France, managed by GENCI and IDRIS (CNRS). Unlike GPT-3, BLOOM was trained to be multilingual. The source code is released under the Apache 2.0 license. The model's parameters are released under BigScience's "Responsible AI License" (RAIL), which grants open access and reuse rights but with some usage restrictions. BLOOM was used in the chatbots BLOOMChat and HuggingChat due to its multilingual abilities. BLOOM's training corpus, named ROOTS, combines data extracted from the then-latest version of the web-based OSCAR corpus (38% of ROOTS) and newly collected data extracted from a manually selected and documented list of language data sources. In total, the model was trained on approximately 366 billion (1.6TB) tokens. It was developed using the open-source libraries DeepSpeed Megatron. BigScience then released xP3, a multilingual dataset for LLM supervised learning. It also released BLOOMZ, a variant of BLOOM fine-tuned on xP3 to follow instructions.

    Read more →
  • Cognitive philology

    Cognitive philology

    Cognitive philology is the science that studies written and oral texts as the product of human mental processes. Studies in cognitive philology compare documentary evidence emerging from textual investigations with results of experimental research, especially in the fields of cognitive and ecological psychology, neurosciences and artificial intelligence. "The point is not the text, but the mind that made it". Cognitive Philology aims to foster communication between literary, textual, philological disciplines on the one hand and researches across the whole range of the cognitive, evolutionary, ecological and human sciences on the other. Cognitive philology: investigates transmission of oral and written text, and categorization processes which lead to classification of knowledge, mostly relying on the information theory; studies how narratives emerge in so called natural conversation and selective process which lead to the rise of literary standards for storytelling, mostly relying on embodied semantics; explores the evolutive and evolutionary role played by rhythm and metre in human ontogenetic and phylogenetic development and the pertinence of the semantic association during processing of cognitive maps; Provides the scientific ground for multimedia critical editions of literary texts. Among the founding thinkers and noteworthy scholars devoted to such investigations are: Alan Richardson: Studies Theory of Mind in early-modern and contemporary literature. Anatole Pierre Fuksas Benoît de Cornulier David Herman: Professor of English at North Carolina State University and an adjunct professor of linguistics at Duke University. He is the author of "Universal Grammar and Narrative Form" and the editor of "Narratologies: New Perspectives on Narrative Analysis". Domenico Fiormonte François Recanati Gilles Fauconnier, a professor in Cognitive science at the University of California, San Diego. He was one of the founders of cognitive linguistics in the 1970s through his work on pragmatic scales and mental spaces. His research explores the areas of conceptual integration and compressions of conceptual mappings in terms of the emergent structure in language. Julián Santano Moreno Luca Nobile Manfred Jahn in Germany Mark Turner Paolo Canettieri

    Read more →
  • Machine translation software usability

    Machine translation software usability

    The sections below give objective criteria for evaluating the usability of machine translation software output. == Stationarity or canonical form == Do repeated translations converge on a single expression in both languages? I.e. does the translation method show stationarity or produce a canonical form? Does the translation become stationary without losing the original meaning? This metric has been criticized as not being well correlated with BLEU (BiLingual Evaluation Understudy) scores. == Adaptive to colloquialism, argot or slang == Is the system adaptive to colloquialism, argot or slang? The French language has many rules for creating words in the speech and writing of popular culture. Two such rules are: (a) The reverse spelling of words such as femme to meuf. (This is called verlan.) (b) The attachment of the suffix -ard to a noun or verb to form a proper noun. For example, the noun faluche means "student hat". The word faluchard formed from faluche colloquially can mean, depending on context, "a group of students", "a gathering of students" and "behavior typical of a student". The Google translator as of 28 December 2006 doesn't derive the constructed words as for example from rule (b), as shown here: Il y a une chorale falucharde mercredi, venez nombreux, les faluchards chantent des paillardes! ==> There is a choral society falucharde Wednesday, come many, the faluchards sing loose-living women! French argot has three levels of usage: familier or friendly, acceptable among friends, family and peers but not at work grossier or swear words, acceptable among friends and peers but not at work or in family verlan or ghetto slang, acceptable among lower classes but not among middle or upper classes The United States National Institute of Standards and Technology conducts annual evaluations [1] Archived 2009-03-22 at the Wayback Machine of machine translation systems based on the BLEU-4 criterion [2]. A combined method called IQmt which incorporates BLEU and additional metrics NIST, GTM, ROUGE and METEOR has been implemented by Gimenez and Amigo [3]. == Well-formed output == Is the output grammatical or well-formed in the target language? Using an interlingua should be helpful in this regard, because with a fixed interlingua one should be able to write a grammatical mapping to the target language from the interlingua. Consider the following Arabic language input and English language translation result from the Google translator as of 27 December 2006 [4]. This Google translator output doesn't parse using a reasonable English grammar: وعن حوادث التدافع عند شعيرة رمي الجمرات -التي كثيرا ما يسقط فيها العديد من الضحايا- أشار الأمير نايف إلى إدخال "تحسينات كثيرة في جسر الجمرات ستمنع بإذن الله حدوث أي تزاحم". ==> And incidents at the push Carbuncles-throwing ritual, which often fall where many of the victims - Prince Nayef pointed to the introduction of "many improvements in bridge Carbuncles God would stop the occurrence of any competing." == Semantics preservation == Do repeated re-translations preserve the semantics of the original sentence? For example, consider the following English input passed multiple times into and out of French using the Google translator as of 27 December 2006: Better a day earlier than a day late. ==> Améliorer un jour plus tôt qu'un jour tard. ==> To improve one day earlier than a day late. ==> Pour améliorer un jour plus tôt qu'un jour tard. ==> To improve one day earlier than a day late. As noted above and in, this kind of round-trip translation is a very unreliable method of evaluation. == Trustworthiness and security == An interesting peculiarity of Google Translate as of 24 January 2008 (corrected as of 25 January 2008) is the following result when translating from English to Spanish, which shows an embedded joke in the English-Spanish dictionary which has some added poignancy given recent events: Heath Ledger is dead ==> Tom Cruise está muerto This raises the issue of trustworthiness when relying on a machine translation system embedded in a Life-critical system in which the translation system has input to a Safety Critical Decision Making process. Conjointly it raises the issue of whether in a given use the software of the machine translation system is safe from hackers. It is not known whether this feature of Google Translate was the result of a joke/hack or perhaps an unintended consequence of the use of a method such as statistical machine translation. Reporters from CNET Networks asked Google for an explanation on January 24, 2008; Google said only that it was an "internal issue with Google Translate". The mistranslation was the subject of much hilarity and speculation on the Internet. If it is an unintended consequence of the use of a method such as statistical machine translation, and not a joke/hack, then this event is a demonstration of a potential source of critical unreliability in the statistical machine translation method. In human translations, in particular on the part of interpreters, selectivity on the part of the translator in performing a translation is often commented on when one of the two parties being served by the interpreter knows both languages. This leads to the issue of whether a particular translation could be considered verifiable. In this case, a converging round-trip translation would be a kind of verification.

    Read more →
  • Version space learning

    Version space learning

    Version space learning is a logical approach to machine learning, specifically binary classification. Version space learning algorithms search a predefined space of hypotheses, viewed as a set of logical sentences. Formally, the hypothesis space is a disjunction H 1 ∨ H 2 ∨ . . . ∨ H n {\displaystyle H_{1}\lor H_{2}\lor ...\lor H_{n}} (i.e., one or more of hypotheses 1 through n are true). A version space learning algorithm is presented with examples, which it will use to restrict its hypothesis space; for each example x, the hypotheses that are inconsistent with x are removed from the space. This iterative refining of the hypothesis space is called the candidate elimination algorithm, the hypothesis space maintained inside the algorithm, its version space. == The version space algorithm == In settings where there is a generality-ordering on hypotheses, it is possible to represent the version space by two sets of hypotheses: (1) the most specific consistent hypotheses, and (2) the most general consistent hypotheses, where "consistent" indicates agreement with observed data. The most specific hypotheses (i.e., the specific boundary SB) cover the observed positive training examples, and as little of the remaining feature space as possible. These hypotheses, if reduced any further, exclude a positive training example, and hence become inconsistent. These minimal hypotheses essentially constitute a (pessimistic) claim that the true concept is defined just by the positive data already observed: Thus, if a novel (never-before-seen) data point is observed, it should be assumed to be negative. (I.e., if data has not previously been ruled in, then it's ruled out.) The most general hypotheses (i.e., the general boundary GB) cover the observed positive training examples, but also cover as much of the remaining feature space without including any negative training examples. These, if enlarged any further, include a negative training example, and hence become inconsistent. These maximal hypotheses essentially constitute a (optimistic) claim that the true concept is defined just by the negative data already observed: Thus, if a novel (never-before-seen) data point is observed, it should be assumed to be positive. (I.e., if data has not previously been ruled out, then it's ruled in.) Thus, during learning, the version space (which itself is a set – possibly infinite – containing all consistent hypotheses) can be represented by just its lower and upper bounds (maximally general and maximally specific hypothesis sets), and learning operations can be performed just on these representative sets. After learning, classification can be performed on unseen examples by testing the hypothesis learned by the algorithm. If the example is consistent with multiple hypotheses, a majority vote rule can be applied. == Historical background == The notion of version spaces was introduced by Mitchell in the early 1980s as a framework for understanding the basic problem of supervised learning within the context of solution search. Although the basic "candidate elimination" search method that accompanies the version space framework is not a popular learning algorithm, there are some practical implementations that have been developed (e.g., Sverdlik & Reynolds 1992, Hong & Tsang 1997, Dubois & Quafafou 2002). A major drawback of version space learning is its inability to deal with noise: any pair of inconsistent examples can cause the version space to collapse, i.e., become empty, so that classification becomes impossible. One solution of this problem is proposed by Dubois and Quafafou that proposed the Rough Version Space, where rough sets based approximations are used to learn certain and possible hypothesis in the presence of inconsistent data.

    Read more →