Bookmarklet

Bookmarklet

A bookmarklet is a bookmark stored in a web browser that contains JavaScript commands that add new features to the browser. They are stored as the URL of a bookmark in a web browser or as a hyperlink on a web page. Bookmarklets are usually small snippets of JavaScript executed when a user clicks on them. When clicked, bookmarklets can perform a wide variety of operations, such as running a search query from selected text or extracting data from a table. Another name for bookmarklet is favelet or favlet, derived from favorites (synonym of bookmark). == History == Steve Kangas of bookmarklets.com coined the word bookmarklet when he started to create short scripts based on a suggestion in Netscape's JavaScript guide. Before that, Tantek Çelik called these scripts favelets and used that word as early as on 6 September 2001 (personal email). Brendan Eich, who developed JavaScript at Netscape, gave this account of the origin of bookmarklets: They were a deliberate feature in this sense: I invented the javascript: URL along with JavaScript in 1995, and intended that javascript: URLs could be used as any other kind of URL, including being bookmark-able. In particular, I made it possible to generate a new document by loading, e.g. javascript:'hello, world', but also (key for bookmarklets) to run arbitrary script against the DOM of the current document, e.g. javascript:alert(document.links[0].href). The difference is that the latter kind of URL uses an expression that evaluates to the undefined type in JS. I added the void operator to JS before Netscape 2 shipped to make it easy to discard any non-undefined value in a javascript: URL. The increased implementation of Content Security Policy (CSP) in websites has caused problems with bookmarklet execution and usage (2013–2015), with some suggesting that this hails the end or death of bookmarklets. William Donnelly created a work-around solution for this problem (in the specific instance of loading, referencing and using JavaScript library code) in early 2015 using a Greasemonkey userscript (Firefox / Pale Moon browser add-on extension) and a simple bookmarklet-userscript communication protocol. It allows (library-based) bookmarklets to be executed on any and all websites, including those using CSP and having an https:// URI scheme. However, if/when browsers support disabling/disallowing inline script execution using CSP, and if/when websites begin to implement that feature, it will "break" this "fix". == Concept == Web browsers use URIs for the href attribute of the tag and for bookmarks. The URI scheme, such as http or ftp, and which generally specifies the protocol, determines the format of the rest of the string. Browsers also implement javascript: URIs that to a parser is just like any other URI. The browser recognizes the specified javascript scheme and treats the rest of the string as a JavaScript program which is then executed. The expression result, if any, is treated as the HTML source code for a new page displayed in place of the original. The executing script has access to the current page, which it may inspect and change. If the script returns an undefined type (rather than, for example, a string), the browser will not load a new page, with the result that the script simply runs against the current page content. This permits changes such as in-place font size and color changes without a page reload. An immediately invoked function that returns no value or an expression preceded by the void operator will prevent the browser from attempting to parse the result of the evaluation as a snippet of HTML markup: == Usage == Bookmarklets are saved and used as normal bookmarks. As such, they are simple "one-click" tools which add functionality to the browser. For example, they can: Modify the appearance of a web page within the browser (e.g., change font size, background color, etc.) Extract data from a web page (e.g., hyperlinks, images, text, etc.) Remove redirects from (e.g. Google) search results, to show the actual target URL Submit the current page to a blogging service such as Posterous, link-shortening service such as bit.ly, or bookmarking service such as Delicious Query a search engine or online encyclopedia with highlighted text or by a dialog box Submit the current page to a link validation service or translation service Set commonly chosen configuration options when the page itself provides no way to do this Control HTML5 audio and video playback parameters such as speed, position, toggling looping, and showing/hiding playback controls, the first of which can be adjusted beyond HTML5 players' typical range setting. Installing a bookmarklet follows the same process as adding a normal bookmark; the only difference is that in place of the URL destination field is JavaScript code preceded by javascript:. Once created, bookmarklets can be run by clicking on them.

Autonomic networking

Autonomic networking follows the concept of Autonomic Computing, an initiative started by IBM in 2001. Its ultimate aim is to create self-managing networks to overcome the rapidly growing complexity of the Internet and other networks and to enable their further growth, far beyond the size of today. == Increasing size and complexity == The ever-growing management complexity of the Internet caused by its rapid growth is seen by some experts as a major problem that limits its usability in the future. What's more, increasingly popular smartphones, PDAs, networked audio and video equipment, and game consoles need to be interconnected. Pervasive Computing not only adds features, but also burdens existing networking infrastructure with more and more tasks that sooner or later will not be manageable by human intervention alone. Another important aspect is the price of manually controlling huge numbers of vitally important devices of current network infrastructures. == Autonomic nervous system == The autonomic nervous system (ANS) is the part of complex biological nervous systems that is not consciously controlled. It regulates bodily functions and the activity of specific organs. As proposed by IBM, future communication systems might be designed in a similar way to the ANS. == Components of autonomic networking == As autonomics conceptually derives from biological entities such as the human autonomic nervous system, each of the areas can be metaphorically related to functional and structural aspects of a living being. In the human body, the autonomic system facilitates and regulates a variety of functions including respiration, blood pressure and circulation, and emotive response. The autonomic nervous system is the interconnecting fabric that supports feedback loops between internal states and various sources by which internal and external conditions are monitored. === Autognostics === Autognostics includes a range of self-discovery, awareness, and analysis capabilities that provide the autonomic system with a view on high-level state. In metaphor, this represents the perceptual sub-systems that gather, analyze, and report on internal and external states and conditions – for example, this might be viewed as the eyes, visual cortex and perceptual organs of the system. Autognostics, or literally "self-knowledge", provides the autonomic system with a basis for response and validation. A rich autognostic capability may include many different "perceptual senses". For example, the human body gathers information via the usual five senses, the so-called sixth sense of proprioception (sense of body position and orientation), and through emotive states that represent the gross wellness of the body. As conditions and states change, they are detected by the sensory monitors and provide the basis for adaptation of related systems. Implicit in such a system are imbedded models of both internal and external environments such that relative value can be assigned to any perceived state - perceived physical threat (e.g. a snake) can result in rapid shallow breathing related to fight-flight response, a phylogenetically effective model of interaction with recognizable threats. In the case of autonomic networking, the state of the network may be defined by inputs from: individual network elements such as switches and network interfaces including specification and configuration historical records and current state traffic flows end-hosts application performance data logical diagrams and design specifications Most of these sources represent relatively raw and unprocessed views that have limited relevance. Post-processing and various forms of analysis must be applied to generate meaningful measurements and assessments against which current state can be derived. The autognostic system interoperates with: configuration management - to control network elements and interfaces policy management - to define performance objectives and constraints autodefense - to identify attacks and accommodate the impact of defensive responses === Configuration management === Configuration management is responsible for the interaction with network elements and interfaces. It includes an accounting capability with historical perspective that provides for the tracking of configurations over time, with respect to various circumstances. In the biological metaphor, these are the hands and, to some degree, the memory of the autonomic system. On a network, remediation and provisioning are applied via configuration setting of specific devices. Implementation affecting access and selective performance with respect to role and relationship are also applied. Almost all the "actions" that are currently taken by human engineers fall under this area. With only a few exceptions, interfaces are set by hand, or by extension of the hand, through automated scripts. Implicit in the configuration process is the maintenance of a dynamic population of devices under management, a historical record of changes and the directives which invoked change. Typical to many accounting functions, configuration management should be capable of operating on devices and then rolling back changes to recover previous configurations. Where change may lead to unrecoverable states, the sub-system should be able to qualify the consequences of changes prior to issuing them. As directives for change must originate from other sub-systems, the shared language for such directives must be abstracted from the details of the devices involved. The configuration management sub-system must be able to translate unambiguously between directives and hard actions or to be able to signal the need for further detail on a directive. An inferential capacity may be appropriate to support sufficient flexibility (i.e. configuration never takes place because there is no unique one-to-one mapping between directive and configuration settings). Where standards are not sufficient, a learning capacity may also be required to acquire new knowledge of devices and their configuration. Configuration management interoperates with all of the other sub-systems including: autognostics - receives direction for and validation of changes policy management - implements policy models through mapping to underlying resources security - applies access and authorization constraints for particular policy targets autodefense - receives direction for changes === Policy management === Policy management includes policy specification, deployment, reasoning over policies, updating and maintaining policies, and enforcement. Policy-based management is required for: constraining different kinds of behavior including security, privacy, resource access, and collaboration configuration management describing business processes and defining performance defining role and relationship, and establishing trust and reputation It provides the models of environment and behavior that represent effective interaction according to specific goals. In the human nervous system metaphor, these models are implicit in the evolutionary "design" of biological entities and specific to the goals of survival and procreation. Definition of what constitutes a policy is necessary to consider what is involved in managing it. A relatively flexible and abstract framework of values, relationships, roles, interactions, resources, and other components of the network environment is required. This sub-system extends far beyond the physical network to the applications in use and the processes and end-users that employ the network to achieve specific goals. It must express the relative values of various resources, outcomes, and processes and include a basis for assessing states and conditions. Unless embodied in some system outside the autonomic network or implicit to the specific policy implementation, the framework must also accommodate the definition of process, objectives and goals. Business process definitions and descriptions are then an integral part of the policy implementation. Further, as policy management represents the ultimate basis for the operation of the autonomic system, it must be able to report on its operation with respect to the details of its implementation. The policy management sub-system interoperates (at least) indirectly with all other sub-systems but primarily interacts with: autognostics - providing the definition of performance and accepting reports on conditions configuration management - providing constraints on device configuration security - providing definitions of roles, access and permissions === Autodefense === Autodefense represents a dynamic and adaptive mechanism that responds to malicious and intentional attacks on the network infrastructure, or use of the network infrastructure to attack IT resources. As defensive measures tend to impede the operation of IT, it is optimally capable of balancing performance objectives with typically over-riding threat management actions. In the

One-class classification

In machine learning, one-class classification (OCC), also known as unary classification or class-modelling, is an approach to the training of binary classifiers in which only examples of one of the two classes are used. Examples include the monitoring of helicopter gearboxes, motor failure prediction, or assessing the operational status of a nuclear plant as 'normal': In such scenarios, there are few, if any, examples of the catastrophic system states – rare outliers – that comprise the second class. Alternatively, the class that is being focused on may cover a small, coherent subset of the data and the training may rely on an information bottleneck approach. In practice, counter-examples from the second class may be used in later rounds of training to further refine the algorithm. == Overview == The term one-class classification (OCC) was coined by Moya & Hush (1996) and many applications can be found in scientific literature, for example outlier detection, anomaly detection, novelty detection. A feature of OCC is that it uses only sample points from the assigned class, so that a representative sampling is not strictly required for non-target classes. == Introduction == SVM based one-class classification (OCC) relies on identifying the smallest hypersphere (with radius r, and center c) consisting of all the data points. This method is called Support Vector Data Description (SVDD). Formally, the problem can be defined in the following constrained optimization form, min r , c r 2 subject to, | | Φ ( x i ) − c | | 2 ≤ r 2 ∀ i = 1 , 2 , . . . , n {\displaystyle \min _{r,c}r^{2}{\text{ subject to, }}||\Phi (x_{i})-c||^{2}\leq r^{2}\;\;\forall i=1,2,...,n} However, the above formulation is highly restrictive, and is sensitive to the presence of outliers. Therefore, a flexible formulation, that allow for the presence of outliers is formulated as shown below, min r , c , ζ r 2 + 1 ν n ∑ i = 1 n ζ i {\displaystyle \min _{r,c,\zeta }r^{2}+{\frac {1}{\nu n}}\sum _{i=1}^{n}\zeta _{i}} subject to, | | Φ ( x i ) − c | | 2 ≤ r 2 + ζ i ∀ i = 1 , 2 , . . . , n {\displaystyle {\text{subject to, }}||\Phi (x_{i})-c||^{2}\leq r^{2}+\zeta _{i}\;\;\forall i=1,2,...,n} From the Karush–Kuhn–Tucker conditions for optimality, we get c = ∑ i = 1 n α i Φ ( x i ) , {\displaystyle c=\sum _{i=1}^{n}\alpha _{i}\Phi (x_{i}),} where the α i {\displaystyle \alpha _{i}} 's are the solution to the following optimization problem: max α ∑ i = 1 n α i κ ( x i , x i ) − ∑ i , j = 1 n α i α j κ ( x i , x j ) {\displaystyle \max _{\alpha }\sum _{i=1}^{n}\alpha _{i}\kappa (x_{i},x_{i})-\sum _{i,j=1}^{n}\alpha _{i}\alpha _{j}\kappa (x_{i},x_{j})} subject to, ∑ i = 1 n α i = 1 and 0 ≤ α i ≤ 1 ν n for all i = 1 , 2 , . . . , n . {\displaystyle \sum _{i=1}^{n}\alpha _{i}=1{\text{ and }}0\leq \alpha _{i}\leq {\frac {1}{\nu n}}{\text{for all }}i=1,2,...,n.} The introduction of kernel function provide additional flexibility to the One-class SVM (OSVM) algorithm. === PU (Positive Unlabeled) learning === A similar problem is PU learning, in which a binary classifier is constructed by semi-supervised learning from only positive and unlabeled sample points. In PU learning, two sets of examples are assumed to be available for training: the positive set P {\displaystyle P} and a mixed set U {\displaystyle U} , which is assumed to contain both positive and negative samples, but without these being labeled as such. This contrasts with other forms of semisupervised learning, where it is assumed that a labeled set containing examples of both classes is available in addition to unlabeled samples. A variety of techniques exist to adapt supervised classifiers to the PU learning setting, including variants of the EM algorithm. PU learning has been successfully applied to text, time series, bioinformatics tasks, and remote sensing data. == Approaches == Several approaches have been proposed to solve one-class classification (OCC). The approaches can be distinguished into three main categories, density estimation, boundary methods, and reconstruction methods. === Density estimation methods === Density estimation methods rely on estimating the density of the data points, and set the threshold. These methods rely on assuming distributions, such as Gaussian, or a Poisson distribution. Following which discordancy tests can be used to test the new objects. These methods are robust to scale variance. Gaussian model is one of the simplest methods to create one-class classifiers. Due to Central Limit Theorem (CLT), these methods work best when large number of samples are present, and they are perturbed by small independent error values. The probability distribution for a d-dimensional object is given by: p N ( z ; μ ; Σ ) = 1 ( 2 π ) d 2 | Σ | 1 2 exp ⁡ { − 1 2 ( z − μ ) T Σ − 1 ( z − μ ) } {\displaystyle p_{\mathcal {N}}(z;\mu ;\Sigma )={\frac {1}{(2\pi )^{\frac {d}{2}}|\Sigma |^{\frac {1}{2}}}}\exp \left\{-{\frac {1}{2}}(z-\mu )^{T}\Sigma ^{-1}(z-\mu )\right\}} Where, μ {\displaystyle \mu } is the mean and Σ {\displaystyle \Sigma } is the covariance matrix. Computing the inverse of covariance matrix ( Σ − 1 {\displaystyle \Sigma ^{-1}} ) is the costliest operation, and in the cases where the data is not scaled properly, or data has singular directions pseudo-inverse Σ + {\displaystyle \Sigma ^{+}} is used to approximate the inverse, and is calculated as Σ T ( Σ Σ T ) − 1 {\displaystyle \Sigma ^{T}(\Sigma \Sigma ^{T})^{-1}} . === Boundary methods === Boundary methods focus on setting boundaries around a few set of points, called target points. These methods attempt to optimize the volume. Boundary methods rely on distances, and hence are not robust to scale variance. K-centers method, NN-d, and SVDD are some of the key examples. K-centers In K-center algorithm, k {\displaystyle k} small balls with equal radius are placed to minimize the maximum distance of all minimum distances between training objects and the centers. Formally, the following error is minimized, ε k − c e n t e r = max i ( min k | | x i − μ k | | 2 ) {\displaystyle \varepsilon _{k-center}=\max _{i}(\min _{k}||x_{i}-\mu _{k}||^{2})} The algorithm uses forward search method with random initialization, where the radius is determined by the maximum distance of the object, any given ball should capture. After the centers are determined, for any given test object z {\displaystyle z} the distance can be calculated as, d k − c e n t r ( z ) = min k | | z − μ k | | 2 {\displaystyle d_{k-centr}(z)=\min _{k}||z-\mu _{k}||^{2}} === Reconstruction methods === Reconstruction methods use prior knowledge and generating process to build a generating model that best fits the data. New objects can be described in terms of a state of the generating model. Some examples of reconstruction methods for OCC are, k-means clustering, learning vector quantization, self-organizing maps, etc. == Applications == === Document classification === The basic Support Vector Machine (SVM) paradigm is trained using both positive and negative examples, however studies have shown there are many valid reasons for using only positive examples. When the SVM algorithm is modified to only use positive examples, the process is considered one-class classification. One situation where this type of classification might prove useful to the SVM paradigm is in trying to identify a web browser's sites of interest based only off of the user's browsing history. === Biomedical studies === One-class classification can be particularly useful in biomedical studies where often data from other classes can be difficult or impossible to obtain. In studying biomedical data it can be difficult and/or expensive to obtain the set of labeled data from the second class that would be necessary to perform a two-class classification. A study from The Scientific World Journal found that the typicality approach is the most useful in analysing biomedical data because it can be applied to any type of dataset (continuous, discrete, or nominal). The typicality approach is based on the clustering of data by examining data and placing it into new or existing clusters. To apply typicality to one-class classification for biomedical studies, each new observation, y 0 {\displaystyle y_{0}} , is compared to the target class, C {\displaystyle C} , and identified as an outlier or a member of the target class. === Unsupervised Concept Drift Detection === One-class classification has similarities with unsupervised concept drift detection, where both aim to identify whether the unseen data share similar characteristics to the initial data. A concept is referred to as the fixed probability distribution which data is drawn from. In unsupervised concept drift detection, the goal is to detect if the data distribution changes without utilizing class labels. In one-class classification, the flow of data is not important. Unseen data is classified as typical or outlier depending on its characteristics, whether it is from the initi

FERET (facial recognition technology)

The Facial Recognition Technology (FERET) program was a government-sponsored project that aimed to create a large, automatic face-recognition system for intelligence, security, and law enforcement purposes. The program began in 1993 under the combined leadership of Dr. Harry Wechsler at George Mason University (GMU) and Dr. Jonathon Phillips at the Army Research Laboratory (ARL) in Adelphi, Maryland and resulted in the development of the Facial Recognition Technology (FERET) database. The goal of the FERET program was to advance the field of face recognition technology by establishing a common database of facial imagery for researchers to use and setting a performance baseline for face-recognition algorithms. Potential areas where this face-recognition technology could be used include: Automated searching of mug books using surveillance photos Controlling access to restricted facilities or equipment Checking the credentials of personnel for background and security clearances Monitoring airports, border crossings, and secure manufacturing facilities for particular individuals Finding and logging multiple appearances of individuals over time in surveillance videos Verifying identities at ATM machines Searching photo ID records for fraud detection The FERET database has been used by more than 460 research groups and is currently managed by the National Institute of Standards and Technology (NIST). By 2017, the FERET database has been used to train artificial intelligence programs and computer vision algorithms to identify and sort faces. == History == The origin of facial recognition technology is largely attributed to Woodrow Wilson Bledsoe and his work in the 1960s, when he developed a system to identify faces from a database of thousands of photographs. The FERET program first began as a way to unify a large body of face-recognition technology research under a standard database. Before the program's inception, most researchers created their own facial imagery database that was attuned to their own specific area of study. These personal databases were small and usually consisted of images from less than 50 individuals. The only notable exceptions were the following: Alex Pentland’s database of around 7500 facial images at the Massachusetts Institute of Technology (MIT) Joseph Wilder's database of around 250 individuals at Rutgers University Christoph von der Malsburg’s database of around 100 facial images at the University of Southern California (USC) The lack of a common database made it difficult to compare the results of face recognition studies in the scientific literature because each report involved different assumptions, scoring methods, and images. Most of the papers that were published did not use images from a common database nor follow a standard testing protocol. As a result, researchers were unable to make informed comparisons between the performances of different face-recognition algorithms. In September 1993, the FERET program was spearheaded by Dr. Harry Wechsler and Dr. Jonathon Phillips under the sponsorship of the U.S. Department of Defense Counterdrug Technology Development Program through DARPA with ARL serving as technical agent. === Phase I === The first facial images for the FERET database were collected from August 1993 to December 1994, a time period known as Phase I. The pictures were initially taken with a 35-mm camera at both GMU and ARL facilities, and the same physical setup was used in each photography session to keep the images consistent. For each individual, the pictures were taken in sets, including two frontal views, a right and left profile, a right and left quarter profile, a right and left half profile, and sometimes at five extra locations. Therefore, a set of images consisted of 5 to 11 images per person. At the end of Phase I, the FERET database had collected 673 sets of images, resulting in over 5000 total images. At the end of Phase I, five organizations were given the opportunity to test their face-recognition algorithm on the newly created FERET database in order to compare how they performed against each other. There five principal investigators were: MIT, led by Alex Pentland Rutgers University, led by Joseph Wilder The Analytic Science Company (TASC), led by Gale Gordon The University of Illinois at Chicago (UIC) and the University of Illinois at Urbana-Champaign, led by Lewis Sadler and Thomas Huang USC, led by Christoph von der Malsburg During this evaluation, three different automatic tests were given to the principal investigators without human intervention: The large gallery test, which served to baseline how algorithms performed against a database when it has not been properly tuned. The false-alarm test, which tested how well the algorithm monitored an airport for suspected terrorists. The rotation test, which measured how well the algorithm performed when the images of an individual in the gallery had different poses compared to those in the probe set. For most of the test trials, the algorithms developed by USC and MIT managed to outperform the other three algorithms for the Phase I evaluation. === Phase II === Phase II began after Phase I, and during this time, the FERET database acquired more sets of facial images. By the start of the Phase II evaluation in March 1995, the database contained 1109 sets of images for a total of 8525 images of 884 individuals. During the second evaluation, the same algorithms from the Phase I evaluation were given a single test. However, the database now contained significantly more duplicate images (463, compared to the previous 60), making the test more challenging. === Phase III === Afterwards, the FERET program entered Phase III where another 456 sets of facial images were added to the database. The Phase III evaluation, which took place in September 1996, aimed to not only gauge the progress of the algorithms since the Phase I assessment but also identify the strengths and weaknesses of each algorithm and determine future objectives for research. By the end of 1996, the FERET database had accumulated a total of 14,126 facial images pertaining to 1199 different individuals as well as 365 duplicate sets of images. As a result of the FERET program, researchers were able to establish a common baseline for comparing different face-recognition algorithms and create a large standard database of facial images that is open for research. In 2003, DARPA released a high-resolution, 24-bit color version of the images in the FERET database (existing reference).

Common Voice

Common Voice is a crowdsourcing project started by Mozilla to create a free and open speech corpus. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. The transcribed sentences are collected in a voice database available under the public domain license CC0. This license ensures that developers can use the database for voice-to-text and text-to-voice applications without restrictions or costs. == Aims == Common Voice aims to provide diverse voice samples. According to Mozilla's Katharina Borchert, many existing projects took datasets from public radio or otherwise had datasets that underrepresented both women and people with pronounced accents. == Voice database == The first dataset was released in November 2017. More than 20,000 users worldwide had recorded 500 hours of English sentences. In February 2019, the first batch of languages was released for use. This included 18 languages such as English, French, German and Mandarin Chinese, but also less prevalent languages like Welsh and Kabyle. In total, this included almost 1,400 hours of recorded voice data from more than 42,000 contributors. By July 2020 the database had amassed 7,226 hours of voice recordings in 54 languages, 5,591 hours of which had been verified by volunteers. In May 2021, following the work to add Kinyarwanda, the project received a grant to add Kiswahili. At the beginning of 2022, Bengali.AI partnered with Common Voice to launch the "Bangla Speech Recognition" project that aims to make machines understand the Bangla language. 2000 hours of voice was collected. In September 2022, it was announced that the Twi language of Ghana was the 100th language to be added to the database. As of December 2025, Mozilla Common Voice collects voice data for over 250 languages, with the most hours having been collected in English, Catalan, Kinyarwanda, Belarusian and Esperanto.

Artificial intelligence content detection

Artificial intelligence detection software aims to determine whether some content (text, image, video, or audio) was generated using artificial intelligence (AI). This software is often unreliable. == Accuracy issues == Many AI detection tools have been shown to be unreliable in detecting AI-generated text. In a 2023 study conducted by Weber-Wulff et al., researchers evaluated 14 detection tools including Turnitin and GPTZero and found that "all scored below 80% of accuracy and only 5 over 70%." They also found that these tools tend to have a bias for classifying texts more as human than as AI, and that accuracy of these tools worsens upon paraphrasing. === False positives === In AI content detection, a false positive is when human-written work is incorrectly flagged as AI-written. Many AI detection platforms claim to have a minimal level of false positives, with Turnitin claiming a less than 1% false positive rate. However, later research by The Washington Post produced much higher rates of 50%, though they used a smaller sample size. False positives in an academic setting frequently lead to accusations of academic misconduct, which can have serious consequences for a student's academic record. Additionally, studies have shown evidence that many AI detection models are prone to give false positives to work written by people whose first language is not English, and also to neurodivergent people. In June 2023, Janelle Shane wrote that portions of her book You Look Like a Thing and I Love You were flagged as AI-generated. === False negatives === A false negative is a failure to identify documents with AI-written text. False negatives often happen as a result of a detection software's sensitivity level or because evasive techniques were used when generating the work to make it sound more human. False negatives are less of a concern academically, since they aren't likely to lead to accusations and ramifications. Notably, Turnitin stated they have a 15% false negative rate. == Text detection == For text, this is usually done to prevent alleged plagiarism, often by detecting repetition of words as telltale signs that a text was AI-generated (including hallucinations). Detection systems may also rely on stylistic and structural regularities associated with LLM output, such as unusually consistent grammar, formulaic transitions, repeated discourse markers, and recurring rhetorical templates. Some tools are designed less to establish authorship provenance than to flag prose that resembles common LLM-generated style patterns. They are often used by teachers marking their students, usually on an ad hoc basis. Following the release of ChatGPT and similar AI text generative software, many educational establishments have issued policies against the use of AI by students. AI text detection software is also used by those assessing job applicants, as well as online search engines, hiring, online moderation and publishing. Current detectors may sometimes be unreliable and have incorrectly marked work by humans as originating from AI while failing to detect AI-generated work in other instances. MIT Technology Review said that the technology "struggled to pick up ChatGPT-generated text that had been slightly rearranged by humans and obfuscated by a paraphrasing tool". AI text detection software has also been shown to discriminate against non-native speakers of English. Two students from the University of California, Davis, were referred to the university's Office of Student Success and Judicial Affairs (OSSJA) after their professors scanned their essays with positive results; the first with an AI detector called GPTZero, and the second with an AI detector integration in Turnitin. However, following media coverage, and a thorough investigation, the students were cleared of any wrongdoing. In April 2023, Cambridge University and other members of the Russell Group of universities in the United Kingdom opted out of Turnitin's AI text detection tool, after expressing concerns it was unreliable. The University of Texas at Austin opted out of the system six months later. In May 2023, a professor at Texas A&M University–Commerce used ChatGPT to detect whether his students' content was written by it, which ChatGPT said was the case. As such, he threatened to fail the class despite ChatGPT not being able to detect AI-generated writing. No students were prevented from graduating because of the issue, and all but one student (who admitted to using the software) were exonerated from accusations of having used ChatGPT in their content. In July 2023, a paper titled "GPT detectors are biased against non-native English writers" was released, reporting that GPTs discriminate against non-native English authors. The paper compared seven GPT detectors against essays from both non-native English speakers and essays from United States students. The essays from non-native English speakers had an average false positive rate of 61.3%. An article by Thomas Germain, published on Gizmodo in June 2024, reported job losses among freelance writers and journalists due to AI text detection software mistakenly classifying their work as AI-generated. In September 2024, Common Sense Media reported that generative AI detectors had a 20% false positive rate for Black students, compared to 10% of Latino students and 7% of White students. To improve the reliability of AI text detection, researchers have explored digital watermarking techniques. A 2023 paper titled "A Watermark for Large Language Models" presents a method to embed imperceptible watermarks into text generated by large language models (LLMs). This watermarking approach allows content to be flagged as AI-generated with a high level of accuracy, even when text is slightly paraphrased or modified. The technique is designed to be subtle and hard to detect for casual readers, thereby preserving readability, while providing a detectable signal for those employing specialized tools. However, while promising, watermarking faces challenges in remaining robust under adversarial transformations and ensuring compatibility across different LLMs. == Anti text detection == There is software available designed to bypass AI text detection. In practice, evasion may not require specialized bypass tools. Paraphrasing, style editing, and removal of repeated discourse markers can substantially reduce the effectiveness of detectors that rely on recognizable surface patterns. A study published in August 2023 analyzed 20 abstracts from papers published in the Eye Journal, which were then paraphrased using GPT-4.0. The AI-paraphrased abstracts were examined for plagiarism using QueText and for AI-generated content using Originality.AI. The texts were then re-processed through an adversarial software called Undetectable.ai in order to reduce the AI-detection scores. The study found that the AI detection tool, Originality.AI, identified text generated by GPT-4 with a mean accuracy of 91.3%. However, after reprocessing by Undetectable.ai, the detection accuracy of Originality.ai dropped to a mean accuracy of 27.8%. Some experts also believe that techniques like digital watermarking are ineffective because they can be removed or added to trigger false positives. "A Watermark for Large Language Models" paper by Kirchenbauer et al. (2023) also addresses potential vulnerabilities of watermarking techniques. The authors outline a range of adversarial tactics, including text insertion, deletion, and substitution attacks, that could be used to bypass watermark detection. These attacks vary in complexity, from simple paraphrasing to more sophisticated approaches involving tokenization and homoglyph alterations. The study highlights the challenge of maintaining watermark robustness against attackers who may employ automated paraphrasing tools or even specific language model replacements to alter text spans iteratively while retaining semantic similarity. Experimental results show that although such attacks can degrade watermark strength, they also come at the cost of text quality and increased computational resources. == Image, video, and audio detection == Several purported AI image detection software exist, to detect AI-generated images (for example, those originating from Midjourney or DALL-E). They are not completely reliable. Industry analyses have also noted that AI-driven image recognition systems often struggle in real-world environments, where inconsistent lighting, noise and variable visual inputs reduce detection reliability, a challenge highlighted in modern agricultural quality-control research. Others claim to identify video and audio deepfakes, but this technology is also not fully reliable yet either. Despite debate around the efficacy of watermarking, Google DeepMind is actively developing a detection software called SynthID, which works by inserting a digital watermark that is invisible to the human eye into the pixels of an image.

Autoencoder

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms. Variants exist which aim to make the learned representations assume useful properties. Examples are regularized autoencoders (sparse, denoising and contractive autoencoders), which are effective in learning representations for subsequent classification tasks, and variational autoencoders, which can be used as generative models. Autoencoders are applied to many problems, including facial recognition, feature detection, anomaly detection, and learning the meaning of words. In terms of data synthesis, autoencoders can also be used to randomly generate new data that is similar to the input (training) data. == Mathematical principles == === Definition === An autoencoder is defined by the following components: Two sets: the space of encoded messages Z {\displaystyle {\mathcal {Z}}} ; the space of decoded messages X {\displaystyle {\mathcal {X}}} . Typically X {\displaystyle {\mathcal {X}}} and Z {\displaystyle {\mathcal {Z}}} are Euclidean spaces, that is, X = R m , Z = R n {\displaystyle {\mathcal {X}}=\mathbb {R} ^{m},{\mathcal {Z}}=\mathbb {R} ^{n}} with m > n . {\displaystyle m>n.} Two parametrized families of functions: the encoder family E ϕ : X → Z {\displaystyle E_{\phi }:{\mathcal {X}}\rightarrow {\mathcal {Z}}} , parametrized by ϕ {\displaystyle \phi } ; the decoder family D θ : Z → X {\displaystyle D_{\theta }:{\mathcal {Z}}\rightarrow {\mathcal {X}}} , parametrized by θ {\displaystyle \theta } .For any x ∈ X {\displaystyle x\in {\mathcal {X}}} , we usually write z = E ϕ ( x ) {\displaystyle z=E_{\phi }(x)} , and refer to it as the code, the latent variable, latent representation, latent vector, etc. Conversely, for any z ∈ Z {\displaystyle z\in {\mathcal {Z}}} , we usually write x ′ = D θ ( z ) {\displaystyle x'=D_{\theta }(z)} , and refer to it as the (decoded) message. Usually, both the encoder and the decoder are defined as multilayer perceptrons (MLPs). For example, a one-layer-MLP encoder E ϕ {\displaystyle E_{\phi }} is: E ϕ ( x ) = σ ( W x + b ) {\displaystyle E_{\phi }(\mathbf {x} )=\sigma (Wx+b)} where σ {\displaystyle \sigma } is an element-wise activation function, W {\displaystyle W} is a "weight" matrix, and b {\displaystyle b} is a "bias" vector. === Training an autoencoder === An autoencoder, by itself, is simply a tuple of two functions. To judge its quality, we need a task. A task is defined by a reference probability distribution μ r e f {\displaystyle \mu _{ref}} over X {\displaystyle {\mathcal {X}}} , and a "reconstruction quality" function d : X × X → [ 0 , ∞ ] {\displaystyle d:{\mathcal {X}}\times {\mathcal {X}}\to [0,\infty ]} , such that d ( x , x ′ ) {\displaystyle d(x,x')} measures how much x ′ {\displaystyle x'} differs from x {\displaystyle x} . With those, we can define the loss function for the autoencoder as L ( θ , ϕ ) := E x ∼ μ r e f [ d ( x , D θ ( E ϕ ( x ) ) ) ] {\displaystyle L(\theta ,\phi ):=\mathbb {\mathbb {E} } _{x\sim \mu _{ref}}[d(x,D_{\theta }(E_{\phi }(x)))]} The optimal autoencoder for the given task ( μ r e f , d ) {\displaystyle (\mu _{ref},d)} is then arg ⁡ min θ , ϕ L ( θ , ϕ ) {\displaystyle \arg \min _{\theta ,\phi }L(\theta ,\phi )} . The search for the optimal autoencoder can be accomplished by any mathematical optimization technique, but usually by gradient descent. This search process is referred to as "training the autoencoder". In most situations, the reference distribution is just the empirical distribution given by a dataset { x 1 , . . . , x N } ⊂ X {\displaystyle \{x_{1},...,x_{N}\}\subset {\mathcal {X}}} , so that μ r e f = 1 N ∑ i = 1 N δ x i {\displaystyle \mu _{ref}={\frac {1}{N}}\sum _{i=1}^{N}\delta _{x_{i}}} where δ x i {\displaystyle \delta _{x_{i}}} is the Dirac measure, the quality function is just L 2 {\displaystyle L^{2}} loss: d ( x , x ′ ) = ‖ x − x ′ ‖ 2 2 {\displaystyle d(x,x')=\|x-x'\|_{2}^{2}} , and ‖ ⋅ ‖ 2 {\displaystyle \|\cdot \|_{2}} is the Euclidean norm. Then the problem of searching for the optimal autoencoder is just a least-squares optimization: min θ , ϕ L ( θ , ϕ ) , where L ( θ , ϕ ) = 1 N ∑ i = 1 N ‖ x i − D θ ( E ϕ ( x i ) ) ‖ 2 2 {\displaystyle \min _{\theta ,\phi }L(\theta ,\phi ),\qquad {\text{where }}L(\theta ,\phi )={\frac {1}{N}}\sum _{i=1}^{N}\|x_{i}-D_{\theta }(E_{\phi }(x_{i}))\|_{2}^{2}} === Interpretation === An autoencoder has two main parts: an encoder that maps the message to a code, and a decoder that reconstructs the message from the code. An optimal autoencoder would perform as close to perfect reconstruction as possible, with "close to perfect" defined by the reconstruction quality function d {\displaystyle d} . The simplest way to perform the copying task perfectly would be to duplicate the signal. To suppress this behavior, the code space Z {\displaystyle {\mathcal {Z}}} usually has fewer dimensions than the message space X {\displaystyle {\mathcal {X}}} . Such an autoencoder is called undercomplete. It can be interpreted as compressing the message, or reducing its dimensionality. At the limit of an ideal undercomplete autoencoder, every possible code z {\displaystyle z} in the code space is used to encode a message x {\displaystyle x} that really appears in the distribution μ r e f {\displaystyle \mu _{ref}} , and the decoder is also perfect: D θ ( E ϕ ( x ) ) = x {\displaystyle D_{\theta }(E_{\phi }(x))=x} . This ideal autoencoder can then be used to generate messages indistinguishable from real messages, by feeding its decoder arbitrary code z {\displaystyle z} and obtaining D θ ( z ) {\displaystyle D_{\theta }(z)} , which is a message that really appears in the distribution μ r e f {\displaystyle \mu _{ref}} . If the code space Z {\displaystyle {\mathcal {Z}}} has dimension larger than (overcomplete), or equal to, the message space X {\displaystyle {\mathcal {X}}} , or the hidden units are given enough capacity, an autoencoder can learn the identity function and become useless. However, experimental results found that overcomplete autoencoders might still learn useful features. In the ideal setting, the code dimension and the model capacity could be set on the basis of the complexity of the data distribution to be modeled. A standard way to do so is to add modifications to the basic autoencoder, to be detailed below. == Variations == === Variational autoencoder (VAE) === Variational autoencoders (VAEs) belong to the families of variational Bayesian methods. Despite the architectural similarities with basic autoencoders, VAEs are architected with different goals and have a different mathematical formulation. The latent space is, in this case, composed of a mixture of distributions instead of fixed vectors. Given an input dataset x {\displaystyle x} characterized by an unknown probability function P ( x ) {\displaystyle P(x)} and a multivariate latent encoding vector z {\displaystyle z} , the objective is to model the data as a distribution p θ ( x ) {\displaystyle p_{\theta }(x)} , with θ {\displaystyle \theta } defined as the set of the network parameters so that p θ ( x ) = ∫ z p θ ( x , z ) d z {\displaystyle p_{\theta }(x)=\int _{z}p_{\theta }(x,z)dz} . === Sparse autoencoder (SAE) === Inspired by the sparse coding hypothesis in neuroscience, sparse autoencoders (SAE) are variants of autoencoders, such that the codes E ϕ ( x ) {\displaystyle E_{\phi }(x)} for messages tend to be sparse codes, that is, E ϕ ( x ) {\displaystyle E_{\phi }(x)} is close to zero in most entries. Sparse autoencoders may include more (rather than fewer) hidden units than inputs, but only a small number of the hidden units are allowed to be active at the same time. Encouraging sparsity improves performance on classification tasks. There are two main ways to enforce sparsity. One way is to simply clamp all but the highest-k activations of the latent code to zero. This is the k-sparse autoencoder. The k-sparse autoencoder inserts the following "k-sparse function" in the latent layer of a standard autoencoder: f k ( x 1 , . . . , x n ) = ( x 1 b 1 , . . . , x n b n ) {\displaystyle f_{k}(x_{1},...,x_{n})=(x_{1}b_{1},...,x_{n}b_{n})} where b i = 1 {\displaystyle b_{i}=1} if | x i | {\displaystyle |x_{i}|} ranks in the top k, and 0 otherwise. Backpropagating through f k {\displaystyle f_{k}} is simple: set gradient to 0 for b i = 0 {\displaystyle b_{i}=0} entries, and keep gradient for b i = 1 {\displaystyle b_{i}=1} entries. This is essentially a generalized ReLU function. The other way is a relaxed version of the k-