Self-supervised learning

Self-supervised learning

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on externally-provided labels. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving them requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples, where one sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects. During SSL, the model learns in two steps. First, the task is solved based on an auxiliary or pretext classification task using pseudo-labels, which help to initialize the model parameters. Next, the actual task is performed with supervised or unsupervised learning. Self-supervised learning has produced promising results in recent years, and has found practical application in fields such as audio processing, and is being used by Facebook and others for speech recognition. == Pseudo-labels == Pseudo-labels are automatically generated labels that a model assigns to unlabeled data based on its own predictions. They are widely used in self-supervised and semi-supervised learning, where ground-truth annotations are limited or unavailable. By treating predicted labels as surrogate ground truth, learning algorithms can make use of large quantities of unlabeled data in the training process. Pseudo-labeling also plays an important role in systems that must adapt to concept drift, where the statistical properties of the data change over time. In these scenarios, the model may detect that an incoming instance deviates from previously learned behavior. The system then generates a classification result for that instance, and this predicted class is used as a pseudo-label for updating or retraining model components that are becoming outdated. This approach enables continuous adaptation in dynamic environments without requiring manual annotation. In many adaptive learning pipelines, pseudo-labels are chosen when the classifier produces sufficiently confident predictions, reducing the risk of propagating errors. These pseudo-labeled instances are then incorporated into training to refresh or evolve the model's understanding of emerging data patterns, particularly when existing components show signs of “aging” due to drift or distributional shifts. This strategy reduces reliance on manual labeling while helping maintain long-term model performance. == Types == === Autoassociative self-supervised learning === Autoassociative self-supervised learning is a specific category of self-supervised learning where a neural network is trained to reproduce or reconstruct its own input data. In other words, the model is tasked with learning a representation of the data that captures its essential features or structure, allowing it to regenerate the original input. The term "autoassociative" comes from the fact that the model is essentially associating the input data with itself. This is often achieved using autoencoders, which are a type of neural network architecture used for representation learning. Autoencoders consist of an encoder network that maps the input data to a lower-dimensional representation (latent space), and a decoder network that reconstructs the input from this representation. The training process involves presenting the model with input data and requiring it to reconstruct the same data as closely as possible. The loss function used during training typically penalizes the difference between the original input and the reconstructed output (e.g. mean squared error). By minimizing this reconstruction error, the autoencoder learns a meaningful representation of the data in its latent space. === Contrastive self-supervised learning === For a binary classification task, training data can be divided into positive examples and negative examples. Positive examples are those that match the target. For example, if training a classifier to identify birds, the positive training data would include images that contain birds. Negative examples would be images that do not. Contrastive self-supervised learning uses both positive and negative examples. The loss function in contrastive learning is used to minimize the distance between positive sample pairs, while maximizing the distance between negative sample pairs. An early example uses a pair of 1-dimensional convolutional neural networks to process a pair of images and maximize their agreement. Contrastive Language-Image Pre-training (CLIP) allows joint pretraining of a text encoder and an image encoder, such that a matching image-text pair have image encoding vector and text encoding vector that span a small angle (having a large cosine similarity). InfoNCE (Noise-Contrastive Estimation) is a method to optimize two models jointly, based on Noise Contrastive Estimation (NCE). Given a set X = { x 1 , … x N } {\displaystyle X=\left\{x_{1},\ldots x_{N}\right\}} of N {\displaystyle N} random samples containing one positive sample from p ( x t + k ∣ c t ) {\displaystyle p\left(x_{t+k}\mid c_{t}\right)} and N − 1 {\displaystyle N-1} negative samples from the 'proposal' distribution p ( x t + k ) {\displaystyle p\left(x_{t+k}\right)} , it minimizes the following loss function: L N = − E X [ log ⁡ f k ( x t + k , c t ) ∑ x j ∈ X f k ( x j , c t ) ] {\displaystyle {\mathcal {L}}_{\mathrm {N} }=-\mathbb {E} _{X}\left[\log {\frac {f_{k}\left(x_{t+k},c_{t}\right)}{\sum _{x_{j}\in X}f_{k}\left(x_{j},c_{t}\right)}}\right]} === Non-contrastive self-supervised learning === Non-contrastive self-supervised learning (NCSSL) uses only positive examples. Counterintuitively, NCSSL converges on a useful local minimum rather than reaching a trivial solution, with zero loss. For the example of binary classification, it would trivially learn to classify each example as positive. Effective NCSSL requires an extra predictor on the online side that does not back-propagate on the target side. === Joint-Embedding and Predictive Architectures === A major class of self-supervised learning moves beyond contrastive pairs, instead maximizing the agreement between views while preventing collapse through statistical constraints. Rooted in Deep Canonical Correlation Analysis (Deep CCA), this approach includes Joint-Embedding Architectures (JEA) like Barlow Twins and VICReg, which enforce covariance constraints to learn invariant representations without negative sampling. Deep Latent Variable Path Modelling (DLVPM) generalizes this to multimodal systems, using path models to enforce correlation and orthogonality across diverse data types. In 2022 Yann LeCun introduced Joint-Embedding Predictive Architectures (JEPA) as a step towards decision making, reasoning, and autonomous human intelligence in machines, including self-improvement through autonomous learning. Founded in representation learning, LeCun included the concept of a “world model” in JEPA which aims to enable machines to replicate human intellect by providing machines with a concept for the world in which they exist. Unlike autoencoders, JEPAs operate entirely in latent space, avoiding pixel-level noise to focus on semantic structure. Rather than just learning invariance, JEPAs learn by predicting masked latent representations from visible context. JEPA has been applied to domains such as image analysis, audio processing, and motion in images and video. == Comparison with other forms of machine learning == SSL belongs to supervised learning methods insofar as the goal is to generate a classified output from the input. At the same time, however, it does not require the explicit use of labeled input-output pairs. Instead, correlations, metadata embedded in the data, or domain knowledge present in the input are implicitly and autonomously extracted from the data. These supervisory signals, extracted from the data, can then be used for training. SSL is similar to unsupervised learning in that it does not require labels in the sample data. Unlike unsupervised learning, however, learning is not done using inherent data structures. Semi-supervised learning combines supervised and unsupervised learning, requiring only a small portion of the learning data be labeled. In transfer learning, a model designed for one task is reused on a different task. Training an autoencoder intrinsically constitutes a self-supervised process, because the output pattern needs to become an optimal reconstruction of the input pattern itself. However, in current jargon, the term 'self-supervised' often refers to tasks based on a pretext-task training setup

Lessac Technologies

Lessac Technologies, Inc. (LTI) is an American firm which develops voice synthesis software, licenses technology and sells synthesized novels as MP3 files. The firm currently has seven patents granted and three more pending for its automated methods of converting digital text into human-sounding speech, more accurately recognizing human speech and outputting the text representing the words and phrases of said speech, along with recognizing the speaker's emotional state. The LTI technology is partly based on the work of the late Arthur Lessac, a Professor of Theater at the State University of New York and the creator of Lessac Kinesensic Training, and LTI has licensed exclusive rights to exploit Arthur Lessac's copyrighted works in the fields of speech synthesis and speech recognition. Based on the view that music is speech and speech is music, Lessac's work and books focused on body and speech energies and how they go together. Arthur Lessac's textual annotation system, which was originally developed to assist actors, singers, and orators in marking up scripts to prepare for performance, is adapted in LTI's speech synthesis system as the basic representation of the speech to be synthesized (Lessemes), in contrast to many other systems which use a phonetic representation. LTI's software has two major components: (1) a linguistic front-end that converts plain text to a sequence of prosodic and phonosensory graphic symbols (Lessemes) based on Arthur Lessac's annotation system, which specify the speech units to be synthesized; (2) a signal-processing back-end that takes the Lessemes as acoustic data and produces human-sounding synthesized speech as output, using unit selection and concatenation. LTI's text-to-speech system came in second in the world-wide Blizzard Challenge 2011 and 2012. The first-place team in 2011 also employed LTI's "front-end" technology, but with its own back-end. The Blizzard Challenge, conducted by the Language Technologies Institute of Carnegie Mellon University, was devised as a way to evaluate speech synthesis techniques by having different research groups build voices from the same voice-actor recordings, and comparing the results through listening tests. LTI was founded in 2000 by H. Donald Wilson (chairman), a lawyer, LexisNexis entrepreneur and business associate of Arthur Lessac; and Gary A. Marple (chief inventor), after Marple suggested that Arthur Lessac's kinesensic voice training might be applicable to computational linguistics. After Wilson's death in 2006, his nephew John Reichenbach became the firm's CEO.

Project Debater

Project Debater is an IBM artificial intelligence project, designed to participate in a full live debate with expert human debaters. It follows on from the Watson project which played Jeopardy! == Development == Project Debater was developed at IBM's lab in Haifa, Israel. The project was proposed by Noam Slonim in 2011 as the IBM Research next Grand Challenge, following Deep Blue and the victory of Watson in Jeopardy! It was exposed for the first time in a closed media event at June 18, 2018, in San Francisco, under the leadership of Ranit Aharonov and Slonim, both from the IBM Research lab in Haifa, Israel. The AI technology debated two human debaters, Noa Ovadia, who was the 2016 Israeli debate champion and Dan Zafrir. The two debated on the topics "We should subsidize space exploration" and "Should we increase the use of telemedicine." A demonstration of Project Debater also aired on the Discovery Channel in June 2018 debating the question of whether sports gambling should be legalized. == Live Debate == On February 11, 2019, Project Debater was revealed to the world in a live debate in San Francisco. Nonpartisan media group Intelligence Squared U.S. Debates hosted the debate which was moderated by journalist John Donvan. The debate took place between Project Debater and Harish Natarajan, who holds the world record in number of debate competition victories. The motion was “We should subsidize preschools.” == That's Debatable Television Show == Project Debater was featured in a television series called “That’s Debatable” presented by Intelligence Squared U.S. Debates and Bloomberg Media. For each episode of “That’s Debatable,” Project Debater provided insight into three distinct debate topics on the redistribution of wealth, modern monetary theory, and a US-China space race. More than 5,000 arguments were submitted online from around the world across the three topics, which were then analyzed and distilled into key points that were highlighted on the television show and discussed by human debaters. == Artificial Intelligence Capabilities == To develop Project Debater, the IBM Research team had to endow the system with the following AI capabilities: Data-driven speech writing and delivery: Project Debater is the first demonstration of a computer that can digest massive corpora, and given a short description of a controversial topic, write a well-structured speech, and deliver it with clarity and purpose, while even incorporating humor where appropriate. Listening comprehension: the ability to identify the key concepts and claims hidden within long continuous spoken language. Four minutes of persuasive speech: the guarantee of producing four minutes of persuasive speech. Modeling human dilemmas: modeling the world of human controversy and dilemmas in a unique knowledge representation, enabling the system to suggest principled arguments as needed. An article on the project was published in Nature in March 2021.

AI@50

AI@50, formally known as the "Dartmouth Artificial Intelligence Conference: The Next Fifty Years" (July 13–15, 2006), was a conference organized by James H. Moor, commemorating the 50th anniversary of the Dartmouth workshop which effectively inaugurated the history of artificial intelligence. Five of the original ten attendees were present: Marvin Minsky, Ray Solomonoff, Oliver Selfridge, Trenchard More, and John McCarthy. While sponsored by Dartmouth College, General Electric, and the Frederick Whittemore Foundation, a $200,000 grant from the Defense Advanced Research Projects Agency (DARPA) called for a report of the proceedings that would: Analyze progress on AI's original challenges during the first 50 years, and assess whether the challenges were "easier" or "harder" than originally thought and why Document what the AI@50 participants believe are the major research and development challenges facing this field over the next 50 years, and identify what breakthroughs will be needed to meet those challenges Relate those challenges and breakthroughs against developments and trends in other areas such as control theory, signal processing, information theory, statistics, and optimization theory. A summary report by the conference director, James H. Moor, was published in AI Magazine. == Conference Program and links to published papers == James H. Moor, conference Director, Introduction Carol Folt and Barry Scherr, Welcome Carey Heckman, Tonypandy and the Origins of Science === AI: Past, Present, Future === John McCarthy, What Was Expected, What We Did, and AI Today Marvin Minsky, The Emotion Machine === The Future Model of Thinking === Ron Brachman and Hector Levesque, A Large Part of Human Thought David Mumford, What is the Right Model for 'Thought'? Stuart Russell, The Approach of Modern AI === The Future of Network Models === Geoffrey Hinton & Simon Osindero, From Pandemonium to Graphical Models and Back Again Rick Granger, From Brain Circuits to Mind Manufacture === The Future of Learning & Search === Oliver Selfridge, Learning and Education for Software: New Approaches in Machine Learning Ray Solomonoff, Machine Learning — Past and Future Leslie Pack Kaelbling, Learning to be Intelligent Peter Norvig, Web Search as a Product of and Catalyst for AI === The Future of AI === Rod Brooks, Intelligence and Bodies Nils Nilsson, Routes to the Summit Eric Horvitz, In Pursuit of Artificial Intelligence: Reflections on Challenges and Trajectories === The Future of Vision === Eric Grimson, Intelligent Medical Image Analysis: Computer Assisted Surgery and Disease Monitoring Takeo Kanade, Artificial Intelligence Vision: Progress and Non-Progress Terry Sejnowski, A Critique of Pure Vision === The Future of Reasoning === Alan Bundy, Constructing, Selecting and Repairing Representations of Knowledge Edwina Rissland, The Exquisite Centrality of Examples Bart Selman, The Challenge and Promise of Automated Reasoning === The Future of Language and Cognition === Trenchard More The Birth of Array Theory and Nial Eugene Charniak, Why Natural Language Processing is Now Statistical Natural Language Processing Pat Langley, Intelligent Behavior in Humans and Machines === The Future of the Future === Ray Kurzweil, Why We Can Be Confident of Turing Test Capability Within a Quarter Century George Cybenko, The Future Trajectory of AI Charles J. Holland, DARPA's Perspective === AI and Games === Jonathan Schaeffer, Games as a Test-bed for Artificial Intelligence Research Danny Kopec, Chess and AI Shay Bushinsky, Principle Positions in Deep Junior's Development === Future Interactions with Intelligent Machines === Daniela Rus, Making Bodies Smart Sherry Turkle, From Building Intelligences to Nurturing Sensibilities === Selected Submitted Papers: Future Strategies for AI === J. Storrs Hall, Self-improving AI: An Analysis Selmer Bringsjord, The Logicist Manifesto Vincent C. Müller, Is There a Future for AI Without Representation? Kristinn R. Thórisson, Integrated A.I. Systems === Selected Submitted Papers: Future Possibilities for AI === Eric Steinhart, Survival as a Digital Ghost Colin T. A. Schmidt, Did You Leave That 'Contraption' Alone With Your Little Sister? Michael Anderson & Susan Leigh Anderson, The Status of Machine Ethics Marcello Guarini, Computation, Coherence, and Ethical Reasoning

Rumelhart Prize

The David E. Rumelhart Prize for Contributions to the Theoretical Foundations of Human Cognition was founded in 2001 in honor of the cognitive scientist David Rumelhart to introduce the equivalent of a Nobel Prize for cognitive science. It is awarded annually to "an individual or collaborative team making a significant contemporary contribution to the theoretical foundations of human cognition". The annual award is presented at the Cognitive Science Society meeting, where the recipient gives a lecture and receives a check for $100,000. At the conclusion of the ceremony, the next year's award winner is announced. The award is funded by the Robert J. Glushko and Pamela Samuelson Foundation. The Rumelhart Prize committee is independent of the Cognitive Science Society. However, the society provides a large and interested audience for the awards. == Selection Committee == As of 2022, the selection committee for the prize consisted of: Richard Cooper (chair) Dedre Gentner Robert J. Glushko Tania Lombrozo Steven T. Piantadosi Jesse Snedeker == Recipients ==

Sub-pixel resolution

In digital image processing, sub-pixel resolution can be obtained in images constructed from sources with information exceeding the nominal pixel resolution of said images. == Example == For example, if the image of a ship of length 50 metres (160 ft), viewed side-on, is 500 pixels long, the nominal resolution (pixel size) on the side of the ship facing the camera is 0.1 metres (3.9 in). Now sub-pixel resolution of well resolved features can measure ship movements which are an order of magnitude (10×) smaller. Movement is specifically mentioned here because measuring absolute positions requires an accurate lens model and known reference points within the image to achieve sub-pixel position accuracy. Small movements can however be measured (down to 1 cm) with simple calibration procedures. Specific fit functions often suffer specific bias with respect to image pixel boundaries. Users should therefore take care to avoid these "pixel locking" (or "peak locking") effects. == Determining feasibility == Whether features in a digital image are sharp enough to achieve sub-pixel resolution can be quantified by measuring the point spread function (PSF) of an isolated point in the image. If the image does not contain isolated points, similar methods can be applied to edges in the image. It is also important when attempting sub-pixel resolution to keep image noise to a minimum. This, in the case of a stationary scene, can be measured from a time series of images. Appropriate pixel averaging, through both time (for stationary images) and space (for uniform regions of the image) is often used to prepare the image for sub-pixel resolution measurements.

Midjourney

Midjourney is a generative artificial intelligence program and service created and hosted by the San Francisco–based "independent research lab" Midjourney, Inc. Midjourney generates images from natural language descriptions, called prompts, similar to OpenAI's DALL-E and Stability AI's Stable Diffusion. It is one of the technologies of the AI boom. The tool was launched into open beta on July 12, 2022. The Midjourney team is led by David Holz, who co-founded Leap Motion. Holz told The Register in August 2022 that the company was already profitable. Users generate images with Midjourney using Discord bot commands or the official website. == History == Midjourney, Inc. was founded in San Francisco, California, by David Holz, previously a co-founder of Leap Motion. The Midjourney image generation platform entered open beta on July 12, 2022. On March 14, 2022, the Midjourney Discord server launched with a request to post high-quality photographs to Twitter and Reddit for systems training. === Model versions === The company has been working on improving its algorithms, releasing new model versions every few months. Version 2 of their algorithm was launched in April 2022, and version 3 on July 25. On November 5, 2022, the alpha iteration of version 4 was released to users. Starting from the 4th version, MJ models were trained on Google TPUs. On March 15, 2023, the alpha iteration of version 5 was released. The 5.1 model is more opinionated than version 5, applying more of its own stylization to images, while the 5.1 RAW model adds improvements while working better with more literal prompts. The version 5.2 included a new "aesthetics system", and the ability to "zoom out" by generating surroundings to an existing image. On December 21, 2023, the alpha iteration of version 6 was released. The model was trained from scratch over a nine month period. Support was added for better text rendition and a more literal interpretation of prompts. == Functionality == Midjourney is accessible through a Discord bot or by accessing their website. Users can use Midjourney through Discord either through their official Discord server, by directly messaging the bot, or by inviting the bot to a third-party server. To generate images, users use the /imagine command and type in a prompt; the bot then returns a set of four images, which users are given the option to upscale. To generate images on the website, users initially needed to have generated at least 1,000 images through the bot; this limitation has since been removed. === Vary (Region) + remix feature === Midjourney released a Vary (Region) feature on September 5, 2023, as part of MidJourney V5.2. This feature allows users to select a specific area of an image and apply variations only to that region while keeping the rest of the image unchanged. === Midjourney web interface === Midjourney introduced its web interface to make its tools more accessible, moving beyond its initial reliance on Discord. This web-based platform was launched in August 2024 alongside the release of Midjourney version 6.1. The web editor consolidates tools such as image editing, panning, zooming, region variation, and inpainting into a single interface. The introduction of the web interface also syncs conversations between Midjourney's Discord channels and web rooms, further enhancing collaboration across both platforms. This shift was in response to growing competition from other AI image generation platforms like Adobe Firefly and Google’s Imagen, which had already launched as native web apps with integration into popular design tools. === Image Weight === This feature lets users control how much influence an uploaded image has on the final output. By adjusting the "image weight" parameter, users can prioritize either the content of the prompt or the characteristics of the image. For instance, setting a higher weight will ensure that the generated result closely follows the image's structure and details, while a lower weight allows the text prompt to have more influence over the final output. === Style Reference === With Style Reference, users can upload an image to use as a stylistic guide for their creation. This tool enables MidJourney to extract the style—whether it is the color palette, texture, or overall atmosphere—from the reference image and apply it to a newly generated image. The feature allows users to fine-tune the aesthetics of their creations by integrating specific artistic styles or moods. === Character Reference === The Character Reference feature allows for a more targeted approach in defining characters. Users can upload an image of a character, and the system uses that image as a reference to generate similar characters in the output. This feature is particularly useful in maintaining consistency in appearance for characters across different images. == Uses == Midjourney's founder, David Holz, told The Register that artists use Midjourney for rapid prototyping of artistic concepts to show to clients before starting work themselves. The advertising industry quickly adopted AI tools such as Midjourney, DALL-E, and Stable Diffusion to create original content and brainstorm ideas. Architects have described using the software to generate mood boards for the early stages of projects, as an alternative to searching Google Images. === Notable usage and controversy === The program was used by the British magazine The Economist to create the front cover for an issue in June 2022. In Italy, the leading newspaper Corriere della Sera published a comic created with Midjourney by writer Vanni Santoni in August 2022. Charlie Warzel used Midjourney to generate two images of Alex Jones for Warzel's newsletter in The Atlantic. The use of an AI-generated cover was criticised by people who felt it was taking jobs from artists. Warzel called his action a mistake in an article about his decision to use generated images. Last Week Tonight with John Oliver included a 10-minute segment on Midjourney in an episode broadcast in August 2022. A Midjourney image called Théâtre D'opéra Spatial won first place in the digital art competition at the 2022 Colorado State Fair. Jason Allen, who wrote the prompt that led Midjourney to generate the image, printed the image onto a canvas and entered it into the competition using the name Jason M. Allen via Midjourney. Other digital artists were upset by the news. Allen was unapologetic, insisting that he followed the competition's rules. The two category judges were unaware that Midjourney used AI to generate images, although they later said that had they known this, they would have awarded Allen the top prize anyway. In December 2022, Midjourney was used to generate the images for an AI-generated children's book that was created over a weekend. Titled Alice and Sparkle, the book features a young girl who builds a robot that becomes self-aware. The creator, Ammaar Reeshi, used Midjourney to generate a large number of images, from which he chose 13 for the book. Both the product and process drew criticism. One artist wrote that "the main problem... is that it was trained off of artists' work. It's our creations, our distinct styles that we created, that we did not consent to being used." In 2023, the realism of AI-based text-to-image generators, such as Midjourney, DALL-E, or Stable Diffusion, reached such a high level that it led to a significant wave of viral AI-generated photos. Widespread attention was gained by a Midjourney-generated photo of Pope Francis wearing a white puffer coat, the fictional arrest of Donald Trump, and a hoax of an attack on the Pentagon, as well as the usage in professional creative arts. Research has suggested that the images Midjourney generates can be biased. For example, even neutral prompts in one study returned unequal results on the aspects of gender, skin color, and location. A study by researchers at the nonprofit group Center for Countering Digital Hate found the tool to be easy to use to generate racist and conspiratorial images. In October 2023, Rest of World reported that Midjourney tends to generate images based on national stereotypes. In 2024, a Frontiers journal published a paper which contained gibberish figures generated with Midjourney, one of which was a diagram of a rat with large testicles and a large penis towering over himself. The paper was retracted a day after the images went viral on Twitter. ==== Content moderation and censorship in Midjourney ==== Prior to May 2023, Midjourney implemented a moderation mechanism predicated on a banned word system. This method prohibited the use of language associated with explicit content, such as sexual or pornographic themes, as well as extreme violence. Moreover, the system also banned certain individual words, including those of religious and political figures, such as Allah or General Secretary of the Chinese Communist Party Xi Jinping. This practice occasionally stirred controversy due to perceiv