AI Data Bay

AI Data Bay — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • StyleGAN

    StyleGAN

    The Style Generative Adversarial Network, or StyleGAN for short, is an extension to the GAN architecture introduced by Nvidia researchers in December 2018, and made source available in February 2019. StyleGAN depends on Nvidia's CUDA software, GPUs, and Google's TensorFlow, or Meta AI's PyTorch, which supersedes TensorFlow as the official implementation library in later StyleGAN versions. The second version of StyleGAN, called StyleGAN2, was published on February 5, 2020. It removes some of the characteristic artifacts and improves the image quality. Nvidia introduced StyleGAN3, described as an "alias-free" version, on June 23, 2021, and made source available on October 12, 2021. == History == A direct predecessor of the StyleGAN series is the Progressive GAN, published in 2017. In December 2018, Nvidia researchers distributed a preprint with accompanying software introducing StyleGAN, a GAN for producing an unlimited number of (often convincing) portraits of fake human faces. StyleGAN was able to run on Nvidia's commodity GPU processors. In February 2019, Uber engineer Phillip Wang used the software to create the website This Person Does Not Exist, which displayed a new face on each web page reload. Wang himself has expressed amazement, given that humans are evolved to specifically understand human faces, that nevertheless StyleGAN can competitively "pick apart all the relevant features (of human faces) and recompose them in a way that's coherent." In September 2019, a website called Generated Photos published 100,000 images as a collection of stock photos. The collection was made using a private dataset shot in a controlled environment with similar light and angles. Similarly, two faculty at the University of Washington's Information School used StyleGAN to create Which Face is Real?, which challenged visitors to differentiate between a fake and a real face side by side. The faculty stated the intention was to "educate the public" about the existence of this technology so they could be wary of it, "just like eventually most people were made aware that you can Photoshop an image". The second version of StyleGAN, called StyleGAN2, was published on February 5, 2020. It removes some of the characteristic artifacts and improves the image quality. In 2021, a third version was released, improving consistency between fine and coarse details in the generator. Dubbed "alias-free", this version was implemented with PyTorch. === Illicit use === In December 2019, Facebook took down a network of accounts with false identities, and mentioned that some of them had used profile pictures created with machine learning techniques. == Architecture == === Progressive GAN === Progressive GAN is a method for training GAN for large-scale image generation stably, by growing a GAN generator from small to large scale in a pyramidal fashion. Like SinGAN, it decomposes the generator as G = G 1 ∘ G 2 ∘ ⋯ ∘ G N {\displaystyle G=G_{1}\circ G_{2}\circ \cdots \circ G_{N}} , and the discriminator as D = D N ∘ D N − 1 ∘ ⋯ ∘ D 1 {\displaystyle D=D_{N}\circ D_{N-1}\circ \cdots \circ D_{1}} . During training, at first only G N , D N {\displaystyle G_{N},D_{N}} are used in a GAN game to generate 4x4 images. Then G N − 1 , D N − 1 {\displaystyle G_{N-1},D_{N-1}} are added to reach the second stage of GAN game, to generate 8x8 images, and so on, until we reach a GAN game to generate 1024x1024 images. To avoid discontinuity between stages of the GAN game, each new layer is "blended in" (Figure 2 of the paper). For example, this is how the second stage GAN game starts: Just before, the GAN game consists of the pair G N , D N {\displaystyle G_{N},D_{N}} generating and discriminating 4x4 images. Just after, the GAN game consists of the pair ( ( 1 − α ) + α ⋅ G N − 1 ) ∘ u ∘ G N , D N ∘ d ∘ ( ( 1 − α ) + α ⋅ D N − 1 ) {\displaystyle ((1-\alpha )+\alpha \cdot G_{N-1})\circ u\circ G_{N},D_{N}\circ d\circ ((1-\alpha )+\alpha \cdot D_{N-1})} generating and discriminating 8x8 images. Here, the functions u , d {\displaystyle u,d} are image up- and down-sampling functions, and α {\displaystyle \alpha } is a blend-in factor (much like an alpha in image composing) that smoothly glides from 0 to 1. === StyleGAN === StyleGAN is designed as a combination of Progressive GAN with neural style transfer. The key architectural choice of StyleGAN-1 is a progressive growth mechanism, similar to Progressive GAN. Each generated image starts as a constant 4 × 4 × 512 {\displaystyle 4\times 4\times 512} array, and repeatedly passed through style blocks. Each style block applies a "style latent vector" via affine transform ("adaptive instance normalization"), similar to how neural style transfer uses Gramian matrix. It then adds noise, and normalize (subtract the mean, then divide by the variance). At training time, usually only one style latent vector is used per image generated, but sometimes two ("mixing regularization") in order to encourage each style block to independently perform its stylization without expecting help from other style blocks (since they might receive an entirely different style latent vector). After training, multiple style latent vectors can be fed into each style block. Those fed to the lower layers control the large-scale styles, and those fed to the higher layers control the fine-detail styles. Style-mixing between two images x , x ′ {\displaystyle x,x'} can be performed as well. First, run a gradient descent to find z , z ′ {\displaystyle z,z'} such that G ( z ) ≈ x , G ( z ′ ) ≈ x ′ {\displaystyle G(z)\approx x,G(z')\approx x'} . This is called "projecting an image back to style latent space". Then, z {\displaystyle z} can be fed to the lower style blocks, and z ′ {\displaystyle z'} to the higher style blocks, to generate a composite image that has the large-scale style of x {\displaystyle x} , and the fine-detail style of x ′ {\displaystyle x'} . Multiple images can also be composed this way. === StyleGAN2 === StyleGAN2 improves upon StyleGAN in two ways. One, it applies the style latent vector to transform the convolution layer's weights instead, thus solving the "blob" problem. The "blob" problem roughly speaking is because using the style latent vector to normalize the generated image destroys useful information. Consequently, the generator learned to create a "distraction" by a large blob, which absorbs most of the effect of normalization (somewhat similar to using flares to distract a heat-seeking missile). Two, it uses residual connections, which helps it avoid the phenomenon where certain features are stuck at intervals of pixels. For example, the seam between two teeth may be stuck at pixels divisible by 32, because the generator learned to generate teeth during stage N-5, and consequently could only generate primitive teeth at that stage, before scaling up 5 times (thus intervals of 32). This was updated by the StyleGAN2-ADA ("ADA" stands for "adaptive"), which uses invertible data augmentation. It also tunes the amount of data augmentation applied by starting at zero, and gradually increasing it until an "overfitting heuristic" reaches a target level, thus the name "adaptive". === StyleGAN3 === StyleGAN3 improves upon StyleGAN2 by solving the "texture sticking" problem, which can be seen in the official videos. They analyzed the problem by the Nyquist–Shannon sampling theorem, and argued that the layers in the generator learned to exploit the high-frequency signal in the pixels they operate upon. To solve this, they proposed imposing strict lowpass filters between each generator's layers, so that the generator is forced to operate on the pixels in a way faithful to the continuous signals they represent, rather than operate on them as merely discrete signals. They further imposed rotational and translational invariance by using more signal filters. The resulting StyleGAN-3 is able to generate images that rotate and translate smoothly, and without texture sticking.

    Read more →
  • How to Choose an AI Humanizer

    How to Choose an AI Humanizer

    In search of the best AI humanizer? An AI humanizer is software that uses machine learning to help you get more done — it turns a rough idea into a polished result in seconds. When choosing one, weigh output quality, pricing, export formats, and how well it fits the tools you already use. Whether you are a beginner or a pro, the right AI humanizer slots into your workflow and pays for itself fast. We tested the leading options and ranked them by quality, value, and ease of use.

    Read more →
  • Permutation automaton

    Permutation automaton

    In automata theory, a permutation automaton, or pure-group automaton, is a deterministic finite automaton such that each input symbol permutes the set of states. Formally, a deterministic finite automaton A may be defined by the tuple (Q, Σ, δ, q0, F), where Q is the set of states of the automaton, Σ is the set of input symbols, δ is the transition function that takes a state q and an input symbol x to a new state δ(q,x), q0 is the initial state of the automaton, and F is the set of accepting states (also: final states) of the automaton. A is a permutation automaton if and only if, for every two distinct states qi and qj in Q and every input symbol x in Σ, δ(qi,x) ≠ δ(qj,x). A formal language is p-regular (also: a pure-group language) if it is accepted by a permutation automaton. For example, the set of strings of even length forms a p-regular language: it may be accepted by a permutation automaton with two states in which every transition replaces one state by the other. == Applications == The pure-group languages were the first interesting family of regular languages for which the star height problem was proved to be computable. Another mathematical problem on regular languages is the separating words problem, which asks for the size of a smallest deterministic finite automaton that distinguishes between two given words of length at most n – by accepting one word and rejecting the other. The known upper bound in the general case is O ( n 2 / 5 ( log ⁡ n ) 3 / 5 ) {\displaystyle O(n^{2/5}(\log n)^{3/5})} . The problem was later studied for the restriction to permutation automata. In this case, the known upper bound changes to O ( n 1 / 2 ) {\displaystyle O(n^{1/2})} .

    Read more →
  • Devi Parikh

    Devi Parikh

    Devi Parikh is an American computer scientist. == Career == Parikh earned her PhD in Electrical and Computer Engineering at Carnegie Mellon University. She has served as a professor at Virginia Tech and Georgia Tech, and as of 2022 she is a research director at Meta. == Research == Parikh's research focuses on computer vision and natural language processing. In 2015, Parikh and her students at Virginia Tech worked on AI for Visual Question Answering (VQA). This technology allows users to ask questions about pictures, e.g. "Is this a vegetarian pizza?" Parikh's VQA dataset has been used to evaluate over 30 AI models. In 2017, Parikh published a conversational agent called ParlAI. In 2020, she developed an AI system that generates dance moves in sync with songs. In 2022, Parikh and a team at Meta developed Make-a-Video, a text-to-video AI model that is based on the diffusion algorithm. == Awards == 2017 IJCAI Computers and Thought Award 2011 ICCV Best-Paper Award ("Marr Prize")

    Read more →
  • Affinity (software)

    Affinity (software)

    Affinity is a graphics editor developed by Serif, a subsidiary of Canva. It is simultaneously a vector graphics editor, a raster graphics editor and a desktop publishing application. It was first released in 2025 as a successor to Serif's Affinity Designer, Affinity Photo and Affinity Publisher, uniting the three editors into one application. While the previous versions competed individually against Adobe's Illustrator, Photoshop, and InDesign, Affinity 3.0 integrates their functionality into a single application. It uses a freemium model monetized by AI features exclusive to Canva Pro subscribers. == Functionality == Affinity is divided into a number of workspaces ("studios"), which are equivalent to the previous suite of Affinity applications: "vector" for vector graphics (Designer), "pixel" for raster editing (Photo), and "layout" for desktop publishing (Publisher). Additionally, it introduces the ability to create custom workspaces. The application supports real-time previews and non-destructive editing, which are based on GPU acceleration. Supported file formats include Adobe Photoshop, InDesign and Illustrator files, PDF, SVG, and TIFF, as well as a custom .af file format. === Vector editing === === Raster editing === Affinity includes photo editing tools including adjustments, masks, blend modes, batch processing, and retouching facilities. Additionally, the application can develop RAW files, similar to Adobe Lightroom. === Desktop publishing === Publishing features include master pages, text styles, and advanced typography. === AI features === The application supports Canva's existing AI features, such as background removal and generative fill. This requires a Canva subscription. == Development == === Background and acquisition (2014–2024) === Serif launched the original Affinity suite starting with Affinity Designer in 2014, followed by Photo (2015) and Publisher (2019). The software gained popularity for its one-time purchase model, contrasting with Adobe's subscription-based Creative Cloud. In November 2022, Serif released Version 2 of the suite, introducing a "Universal License" that covered all three apps across all platforms. In March 2024, Canva acquired Serif for approximately A$580 million (£300 million). Following user backlash regarding a potential shift to subscriptions, Canva and Serif issued a joint "Pledge" committing to four key principles: fair pricing, no mandatory subscriptions, perpetual licenses for existing products, and continued development of Affinity as a standalone suite. === Unified release (2025) === In September 2025, Serif pulled all existing versions of Affinity Designer, Affinity Photo and Affinity Publisher from sale ahead an upcoming announcement on 30 October; also ahead of the announcement, the iPadOS versions of the Affinity suite became free on App Store. During a "Creative Freedom" keynote on 30 October 2025, Canva released a new version now simply branded as "Affinity" (also known as "Affinity by Canva"), and referred to internally as version 3.0. Version 3 drops the separate applications and integrates their functionality into a singular application, and adds the ability to export directly to the Canva platform. It also adds a Canva AI studio, including background removal, "Expand & Edit", and generative fill. As of version 3, Affinity has switched to a freemium model; it is now available at no charge to users, although access to Canva AI features are locked behind the existing Canva Pro subscription service. Serif stated that the perpetually-licensed version 2 will remain available to existing owners, although it will no longer be actively maintained. The new version is currently available for macOS and Windows only, with an iPadOS version to be released soon. == Reception == The change in business model by Canva in 2025 was met with mixed reception, including concerns about its incorporation of AI features. Some users were concerned that their projects would be used for machine learning purposes, or that future versions would suffer from a lack of maintenance or become adware. Additionally, some felt it turned Affinity into fundamentally subscription-based software, given the prevalence of these features in professional contexts. Affinity publicly stated on social media that it would remain "free forever", users' projects would not be used to train AI models, and that "Canva has built a sustainable business model that allows this kind of generosity. And when more professionals use Affinity, Canva can sell more seats into businesses."

    Read more →
  • The Best Free AI Copywriting Tool for Beginners

    The Best Free AI Copywriting Tool for Beginners

    Curious about the best AI copywriting tool? An AI copywriting tool is software that uses machine learning to help you get more done — it combines speed, accuracy, and an interface that just works. Hands-on testing shows real-world results vary, so a short free trial is the smartest way to decide. Whether you are a beginner or a pro, the right AI copywriting tool slots into your workflow and pays for itself fast. This guide breaks down the top picks, their pros and cons, and who each one is best for.

    Read more →
  • Larry Heck

    Larry Heck

    Larry Paul Heck is the Rhesa Screven Farmer, Jr., Advanced Computing Concepts Chair, Georgia Research Alliance Eminent Scholar, Co-Executive Director of the Machine Learning Center and Professor at the Georgia Institute of Technology. His career spans many of the sub-disciplines of artificial intelligence, including conversational AI, speech recognition and speaker recognition, natural language processing, web search, online advertising and acoustics. He is best known for his role as a co-founder of the Microsoft Cortana Personal Assistant and his early work in deep learning for speech processing. == Education and career == Larry Heck was born in Havre, Montana. After receiving the Bachelor of Science in electrical engineering at Texas Tech University, he was admitted to graduate school at the Georgia Institute of Technology in 1986. Heck received the MSEE in 1989 and the PhD in 1991 under advisor Prof. James H. McClellan. From 1992 to 1998, he was a senior research engineer at SRI International with the Acoustics and Radar Technology Lab (ARTL) and Speech Technology and Research (STAR) Lab, and in 1998 joined Nuance Communications, serving as vice president of R&D. Funded by the US government's NSA and DARPA from 1995-1998, Heck led the SRI team that was the first to successfully create large-scale deep neural network (DNN) deep learning technology in the field of speech processing. The deep learning technology was used to win the 1998 National Institute of Standards and Technology Speaker Recognition evaluation. The approach trained a 5-layer deep neural network, with the first two layers used as a (learned) feature extractor. To stabilize the training of the DNN, a weight normalization method was used (later rediscovered in 2010 by Xavier, et.al). Heck deployed this DNN in 1999 with Nuance Communications at the Home Shopping Network, representing the first major industrial application of deep learning with over 100K Nuance Verifier voiceprints. From 2005 to 2008, he was vice president of search & advertising quality at Yahoo!. In 2008, Heck and Ron Brachman combined search & advertising quality with Yahoo! Research to form Yahoo! Labs. Beginning in 2009, he was the chief scientist of speech products at Microsoft. In this role, he established the vision, mission and long-range plan and hired the initial team to create Microsoft’s digital-personal-assistant Cortana. Heck was named a Microsoft Distinguished Engineer in 2012 and joined Microsoft Research that same year. In 2014, he joined Google as a principal research scientist, where he founded the deep learning-based conversational AI team "Deep Dialogue". The team works on advanced research for the Google Assistant. In 2017, Heck joined Samsung as SVP and co-head of global AI Research. In 2019, he became head of Bixby (virtual assistant) North America and the CEO of Viv Labs, an independent subsidiary of Samsung. In that same year, Heck led one of the first large scale deployments of Transformer-Based LLMs as part of the Bixby Categories launch at the 2019 Samsung Developer Conference. In 2021, Heck returned to the Georgia Institute of Technology as a Professor. == Awards and honors == Larry Heck was named Fellow of the Institute of Electrical and Electronics Engineers (IEEE) in 2016 for leadership in application of machine learning to spoken and text language processing. Heck was inducted as a Fellow of the National Academy of Inventors (NAI) in 2024. Heck received the 2017 Academy of Distinguished Engineering Alumni Award from the Georgia Institute of Technology. In the same year, he also received the Texas Tech University Whitacre College of Engineering Distinguished Engineer Award. Larry Heck has several best papers including the 2020 IEEE Signal Processing Society (SPS) Best Paper Award: “Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding” published in the IEEE/ACM Transactions on Audio, Speech, and Language Processing in March 2015, and the 2020 ACM Conference on Information and Knowledge Management (CIKM) Test of Time Award for the paper "Learning Deep Structured Semantic Models for Web Search using Clickthrough Data".

    Read more →
  • Scott Fahlman

    Scott Fahlman

    Scott Elliott Fahlman (born March 21, 1948) is an American computer scientist and Professor Emeritus at Carnegie Mellon University's Language Technologies Institute and Computer Science Department. He is notable for early work on automated planning and scheduling in a blocks world, on semantic networks, on neural networks (especially the cascade correlation algorithm), on the programming languages Dylan, and Common Lisp (especially CMU Common Lisp), and he was one of the founders of Lucid Inc. During the period when it was standardized, he was recognized as "the leader of Common Lisp." From 2006 to 2015, Fahlman was engaged in developing a knowledge base named Scone, based in part on his thesis work on the NETL Semantic Network. He also is credited with coining the use of the emoticon. == Life and career == Fahlman was born in Medina, Ohio, the son of Lorna May (Dean) and John Emil Fahlman. He attended the Massachusetts Institute of Technology (MIT), where he received a Bachelor of Science (B.S.) and Master of Science (M.S.) degree in electrical engineering and computer science in 1973, and a Doctor of Philosophy (Ph.D.) in artificial intelligence in 1977. He has noted that his doctoral diploma says the degree was awarded for "original research as demonstrated by a thesis in the field of Artificial Intelligence" and suggested that it may be the first doctorate to use that term. He is a fellow of the American Association for Artificial Intelligence. Fahlman acted as thesis advisor for Donald Cohen, David B. McDonald, David S. Touretzky, Skef Wholey, Justin Boyan, Michael Witbrock, and Alicia Tribble Sagae. From May 1996 to July 2001, Fahlman directed the Justsystem Pittsburgh Research Center. === Boltzmann Machine (1983) === In 1983, Fahlman, Geoffrey Hinton, and Terry Sejnowski published a paper in Proceedings of the AAAI-83 Conference, Washington DC, August 1983. The paper was titled as "Massively Parallel Architectures for AI: NETL, Thistle and Boltzmann Machines". === Emoticons === Fahlman was not the first to suggest the concept of the emoticon – a similar concept for a marker appeared in an article of Reader's Digest in May 1967, although that idea was never put into practice. In an interview printed in The New York Times in 1969, Vladimir Nabokov noted: "I often think there should exist a special typographical sign for a smile – some sort of concave mark, a supine round bracket." Fahlman is credited with originating the first smiley emoticon, which he thought would help people on a message board at Carnegie Mellon to distinguish serious posts from jokes. He proposed the use of :-) and :-( for this purpose, and the symbols caught on. The original message from which these symbols originated was posted on 19 September 1982. The message was recovered by Jeff Baird on 10 September 2002 and read: 19-Sep-82 11:44 Scott E Fahlman :-) From: Scott E Fahlman I propose that the following character sequence for joke markers: :-) Read it sideways. Actually, it is probably more economical to mark things that are NOT jokes, given current trends. For this, use :-(

    Read more →
  • Contrastive Language-Image Pre-training

    Contrastive Language-Image Pre-training

    Contrastive Language-Image Pre-training (CLIP) is a technique for training a pair of neural network models, one for image understanding and one for text understanding, using a contrastive objective. This method has enabled broad applications across multiple domains, including cross-modal retrieval, text-to-image generation, and aesthetic ranking. == Algorithm == The CLIP method trains a pair of models contrastively. One model takes in a piece of text as input and outputs a single vector representing its semantic content. The other model takes in an image and similarly outputs a single vector representing its visual content. The models are trained so that the vectors corresponding to semantically similar text-image pairs are close together in the shared vector space, while those corresponding to dissimilar pairs are far apart. To train a pair of CLIP models, one would start by preparing a large dataset of image-caption pairs. During training, the models are presented with batches of N {\displaystyle N} image-caption pairs. Let the outputs from the text and image models be respectively v 1 , . . . , v N , w 1 , . . . , w N {\displaystyle v_{1},...,v_{N},w_{1},...,w_{N}} . Two vectors are considered "similar" if their dot product is large. The loss incurred on this batch is the multi-class N-pair loss, which is a symmetric cross-entropy loss over similarity scores: − 1 N ∑ i ln ⁡ e v i ⋅ w i / T ∑ j e v i ⋅ w j / T − 1 N ∑ j ln ⁡ e v j ⋅ w j / T ∑ i e v i ⋅ w j / T {\displaystyle -{\frac {1}{N}}\sum _{i}\ln {\frac {e^{v_{i}\cdot w_{i}/T}}{\sum _{j}e^{v_{i}\cdot w_{j}/T}}}-{\frac {1}{N}}\sum _{j}\ln {\frac {e^{v_{j}\cdot w_{j}/T}}{\sum _{i}e^{v_{i}\cdot w_{j}/T}}}} In essence, this loss function encourages the dot product between matching image and text vectors ( v i ⋅ w i {\displaystyle v_{i}\cdot w_{i}} ) to be high, while discouraging high dot products between non-matching pairs. The parameter T > 0 {\displaystyle T>0} is the temperature, which is parameterized in the original CLIP model as T = e − τ {\displaystyle T=e^{-\tau }} where τ ∈ R {\displaystyle \tau \in \mathbb {R} } is a learned parameter. Other loss functions are possible. For example, Sigmoid CLIP (SigLIP) proposes the following loss function: L = 1 N ∑ i , j ∈ 1 : N f ( ( 2 δ i , j − 1 ) ( e τ w i ⋅ v j + b ) ) {\displaystyle L={\frac {1}{N}}\sum _{i,j\in 1:N}f((2\delta _{i,j}-1)(e^{\tau }w_{i}\cdot v_{j}+b))} where f ( x ) = ln ⁡ ( 1 + e − x ) {\displaystyle f(x)=\ln(1+e^{-x})} is the negative log sigmoid loss, and the Dirac delta symbol δ i , j {\displaystyle \delta _{i,j}} is 1 if i = j {\displaystyle i=j} else 0. == CLIP models == While the original model was developed by OpenAI, subsequent models have been trained by other organizations as well. === Image model === The image encoding models used in CLIP are typically vision transformers (ViT). The naming convention for these models often reflects the specific ViT architecture used. For instance, "ViT-L/14" means a "vision transformer large" (compared to other models in the same series) with a patch size of 14, meaning that the image is divided into 14-by-14 pixel patches before being processed by the transformer. The size indicator ranges from B, L, H, G (base, large, huge, giant), in that order. Other than ViT, the image model is typically a convolutional neural network, such as ResNet (in the original series by OpenAI), or ConvNeXt (in the OpenCLIP model series by LAION). Since the output vectors of the image model and the text model must have exactly the same length, both the image model and the text model have fixed-length vector outputs, which in the original report is called "embedding dimension". For example, in the original OpenAI model, the ResNet models have embedding dimensions ranging from 512 to 1024, and for the ViTs, from 512 to 768. Its implementation of ViT was the same as the original one, with one modification: after position embeddings are added to the initial patch embeddings, there is a LayerNorm. Its implementation of ResNet was the same as the original one, with 3 modifications: In the start of the CNN (the "stem"), they used three stacked 3x3 convolutions instead of a single 7x7 convolution, as suggested by. There is an average pooling of stride 2 at the start of each downsampling convolutional layer (they called it rect-2 blur pooling according to the terminology of ). This has the effect of blurring images before downsampling, for antialiasing. The final convolutional layer is followed by a multiheaded attention pooling. ALIGN a model with similar capabilities, trained by researchers from Google used EfficientNet, a kind of convolutional neural network. === Text model === The text encoding models used in CLIP are typically Transformers. In the original OpenAI report, they reported using a Transformer (63M-parameter, 12-layer, 512-wide, 8 attention heads) with lower-cased byte pair encoding (BPE) with 49152 vocabulary size. Context length was capped at 76 for efficiency. Like GPT, it was decoder-only, with only causally-masked self-attention. Its architecture is the same as GPT-2. Like BERT, the text sequence is bracketed by two special tokens [SOS] and [EOS] ("start of sequence" and "end of sequence"). Take the activations of the highest layer of the transformer on the [EOS], apply LayerNorm, then a final linear map. This is the text encoding of the input sequence. The final linear map has output dimension equal to the embedding dimension of whatever image encoder it is paired with. These models all had context length 77 and vocabulary size 49408. ALIGN used BERT of various sizes. == Dataset == === WebImageText === The CLIP models released by OpenAI were trained on a dataset called "WebImageText" (WIT) containing 400 million pairs of images and their corresponding captions scraped from the internet. The total number of words in this dataset is similar in scale to the WebText dataset used for training GPT-2, which contains about 40 gigabytes of text data. The dataset contains 500,000 text-queries, with up to 20,000 (image, text) pairs per query. The text-queries were generated by starting with all words occurring at least 100 times in English Wikipedia, then extended by bigrams with high mutual information, names of all Wikipedia articles above a certain search volume, and WordNet synsets. The dataset is private and has not been released to the public, and there is no further information on it. ==== Data preprocessing ==== For the CLIP image models, the input images are preprocessed by first dividing each of the R, G, B values of an image by the maximum possible value, so that these values fall between 0 and 1, then subtracting by [0.48145466, 0.4578275, 0.40821073], and dividing by [0.26862954, 0.26130258, 0.27577711]. The rationale was that these are the mean and standard deviations of the images in the WebImageText dataset, so this preprocessing step roughly whitens the image tensor. These numbers slightly differ from the standard preprocessing for ImageNet, which uses [0.485, 0.456, 0.406] and [0.229, 0.224, 0.225]. If the input image does not have the same resolution as the native resolution (224×224 for all except ViT-L/14@336px, which has 336×336 resolution), then the input image is first scaled by bicubic interpolation, so that its shorter side is the same as the native resolution, then the central square of the image is cropped out. === Others === ALIGN used over one billion image-text pairs, obtained by extracting images and their alt-tags from online crawling. The method was described as similar to how the Conceptual Captions dataset was constructed, but instead of complex filtering, they only applied a frequency-based filtering. Later models trained by other organizations had published datasets. For example, LAION trained OpenCLIP with published datasets LAION-400M, LAION-2B, and DataComp-1B. == Training == In the original OpenAI CLIP report, they reported training 5 ResNet and 3 ViT (ViT-B/32, ViT-B/16, ViT-L/14). Each was trained for 32 epochs. The largest ResNet model took 18 days to train on 592 V100 GPUs. The largest ViT model took 12 days on 256 V100 GPUs. All ViT models were trained on 224×224 image resolution. The ViT-L/14 was then boosted to 336×336 resolution by FixRes, resulting in a model. They found this was the best-performing model. In the OpenCLIP series, the ViT-L/14 model was trained on 384 A100 GPUs on the LAION-2B dataset, for 160 epochs for a total of 32B samples seen. == Applications == === Cross-modal retrieval === CLIP's cross-modal retrieval enables the alignment of visual and textual data in a shared latent space, allowing users to retrieve images based on text descriptions and vice versa, without the need for explicit image annotations. In text-to-image retrieval, users input descriptive text, and CLIP retrieves images with matching embeddings. In image-to-text retrieval, images are used to find related text content. CLIP’s ability to connect vis

    Read more →
  • Comparison of machine translation applications

    Comparison of machine translation applications

    Machine translation is an algorithm which attempts to translate text or speech from one natural language to another. == General information == Basic general information for popular machine translation applications. == Languages features comparison == The following table compares the number of languages which the following machine translation programs can translate between. (Moses and Moses for Mere Mortals allow you to train translation models for any language pair, though collections of translated texts (parallel corpus) need to be provided by the user. The Moses site provides links to training corpora.) This is not an all-encompassing list. Some applications have many more language pairs than those listed below. This is a general comparison of key languages only. A full and accurate list of language pairs supported by each product should be found on each of the product's websites. === Multi-pair translations === === Paired translations ===

    Read more →
  • How to Choose an AI Writing Assistant

    How to Choose an AI Writing Assistant

    Comparing the best AI writing assistant? An AI writing assistant is software that uses machine learning to help you get more done — it lowers the barrier so anyone can produce professional output. Privacy matters too: check whether your data trains the model and whether a no-log or enterprise tier is available. Whether you are a beginner or a pro, the right AI writing assistant slots into your workflow and pays for itself fast. We tested the leading options and ranked them by quality, value, and ease of use.

    Read more →
  • Vlado Keselj

    Vlado Keselj

    Vlado Keselj (Vlado Kešelj) is a Serbian-Canadian computer scientist known for his research in natural language processing and authorship attribution. He is a professor at Dalhousie University. == Education == As a high school student in Yugoslavia, Keselj competed in the 1987 International Mathematical Olympiad, earning a bronze medal. He earned his Ph.D. in 2002 at the University of Waterloo, with the dissertation Modular Stochastic HPSGs for Question Answering supervised by Nick Cercone. == Awards == Vlado Keselj is a recipient of the 2019 CAIAC Distinguished Service Award, awarded by the Canadian Artificial Intelligence Association (CAIAC). == Selected publications == Kešelj, V., Peng, F., Cercone, N., & Thomas, C. (2003, August). N-gram-based author profiles for authorship attribution. In Proceedings of the Conference of the Pacific Association for Computational Linguistics, PACLING 2003 (Vol. 3, pp. 255–264).

    Read more →
  • Kuki AI

    Kuki AI

    Kuki is an embodied AI bot designed for usage in the metaverse. Formerly known as Mitsuku, Kuki is a chatbot created from the Pandorabots framework. The bot has won the Loebner Prize 5 times. == Features == Kuki claims to be an 18-year-old female chatbot from the Metaverse, and the developers have stated she has been worked on since 2005. Early work by one of the company's co-founders inspired the Spike Jonze movie Her. As of 2015, she conversed, on average, in excess of a quarter of a million times daily, and it was estimated 5 million unique users had interacted with her between 2016 and 2020. == Virtual talent, model, and influencer == Kuki has appeared as a Virtual Model in Vogue Business and at Crypto Fashion Week where she modelled NFTs and spoke about the future of digital fashion. In 2021, Kuki modelled five digital looks from emerging Vogue Talents designers for Italian Vogue, that sold out as NFTs in under an hour. Kuki has also modeled for H&M on Instagram in a digital campaign that resulted in an "11x increase in ad recall" per a case study by Meta. == Awards == As of 2019, Kuki had been awarded the Loebner Prize five times, more than any other entrant. In 2020, Kuki competed against Facebook AI's Blenderbot in a 24/7 verbal sparring match called "Bot Battle", winning 79% of the audience vote.

    Read more →
  • Boris Katz

    Boris Katz

    Boris Gershevich Katz (Russian: Борис Гершевич Кац; born October 5, 1947) is a principal American research scientist (computer scientist) at the MIT Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology in Cambridge and head of the Laboratory's InfoLab Group. His research interests include natural language processing and understanding, machine learning and intelligent information access. His brother Victor Kac is a mathematician at MIT. He was able to get out of the USSR with the help of U.S. Senator Ted Kennedy, before the end of the Cold War. Over the last several decades, Boris Katz has been developing the START natural language system that allows the user to access various types of information using English. == Biography == Boris Katz was born on October 5, 1947, in Chișinău in the family of Hersh Katz (died 1976) and Hayki (Klara) Landman (born 1921, Lipcani, Briceni District - died 2006, Cambridge, Middlesex County), who moved from Lipcani, a town located in the northern Bessarabian, to Chișinău before the war. He graduated from Moscow State University and in November 1978, he left for the United States thanks to the personal intervention of Senator Edward M. Kennedy. He defended his thesis as a candidate of physical and mathematical sciences in 1975 under the supervision of Evgenii M. Landis. He currently lives in Boston and heads the InfoLabresearch team at the Laboratory of Informatics and Artificial Intelligence at the Massachusetts Institute of Technology. Boris Katz is the creator of the START information processing system (since 1993 - on the Internet), the author of several works in the field of processing, generation and perception of natural languages, machine learning, and accelerated access to multimedia information. == Family == Brothers - Victor Gershevich Katz, American mathematician, professor at the Massachusetts Institute of Technology; Mikhail Gershevich Katz, Israeli mathematician, graduate of Harvard and Columbia (Ph.D., 1984) universities, professor at Bar-Ilan University, author of the monograph "Systolic Geometry and Topology" (Mathematical Surveys and Monographs, vol. 137. American Mathematical Society: Providence, 2007). Daughter - Luba Katz, a bioinformatics scientist (her husband is Alan Jasanoff, a neuroimaging scientist, a professor at MIT, the son of Harvard University professors Jay Jasanoff and Sheila Jasanoff). == Past works == A Knowledge Entry System for Subject Matter Experts: The goal of SHAKEN project is to enable subject matter experts, without any assistance from AI technologists, to assemble the models of processes and mechanisms so that questions about them can be answered by declarative inference and simulation. Exploiting lexical regularities in designing natural language systems Word sense disambiguation for information retrieval HIKE (HPKB integrated knowledge environment)- a query interface and integrated knowledge environment for HPKB Quantitative evaluation of passage retrieval algorithms for question answering Sticky notes for the semantic web Question answering from the web using knowledge annotation and knowledge mining techniques The role of context in question answering systems

    Read more →
  • Is an AI Copywriting Tool Worth It in 2026?

    Is an AI Copywriting Tool Worth It in 2026?

    Looking for the best AI copywriting tool? An AI copywriting tool is software that uses machine learning to help you get more done — it can save you hours every week by automating repetitive work. Most options offer a generous free tier, with paid plans unlocking higher limits, faster processing, and team features. Whether you are a beginner or a pro, the right AI copywriting tool slots into your workflow and pays for itself fast. Read on for hands-on impressions, pricing tiers, and the standout features that matter.

    Read more →