AI Content Detection Tools

AI Content Detection Tools — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Screenless video

    Screenless video

    Screenless video is any system for transmitting visual information from a video source without the use of a screen. Screenless computing systems can be divided into three groups: Visual Image, Retinal Direct, and Synaptic Interface. == Visual image == Visual Image screenless display includes any image that the eye can perceive. The most common example of Visual Image screenless display is a hologram. In these cases, light is reflected off some intermediate object (hologram, LCD panel, or cockpit window) before it reaches the retina. In the case of LCD panels the light is refracted from the back of the panel, but is nonetheless a reflected source. Google has proposed a similar system to replace the screens of tablet computers and smartphones. == Retinal display == Virtual retinal display systems are a class of screenless displays in which images are projected directly onto the retina. They are distinguished from visual image systems because light is not reflected from some intermediate object onto the retina, it is instead projected directly onto the retina. Retinal Direct systems, once marketed, hold out the promise of extreme privacy when computing work is done in public places because most snooping relies on viewing the same light as the person who is legitimately viewing the screen, and retinal direct systems send light only into the pupils of their intended viewer. == Synaptic interface == Synaptic Interface screenless video does not use light at all. Visual information completely bypasses the eye and is transmitted directly to the brain. While such systems have only been implemented in humans in rudimentary form - for example, displaying single Braille characters to blind people – success has been achieved in sampling usable video signals from the biological eyes of a living horseshoe crab through their optic nerves, and in sending video signals from electronic cameras into the creatures' brains using the same method.

    Read more →
  • Subject indexing

    Subject indexing

    Subject indexing is the act of describing or classifying a document by index terms, keywords, or other symbols in order to indicate what different documents are about, to summarize their contents or to increase findability. In other words, the objective is to identify and describe the subject of documents. Indexes are constructed, separately, on three distinct levels: terms in a document, such as a book; objects in a collection, such as a library; and documents (such as books and articles) within a field of knowledge. Subject indexing is used in information retrieval especially to create bibliographic indexes to retrieve documents on a particular subject. Examples of academic indexing services are Zentralblatt MATH, Chemical Abstracts, and PubMed. The index terms were mostly assigned by experts but author keywords are also common. The process of indexing begins with any analysis of the subject of the document. The indexer must then identify terms that appropriately identify the subject, either by extracting words directly from the document or assigning words from a controlled vocabulary. The terms in the index are then presented in a systematic order. Indexers must decide how many terms to include and how specific the terms should be. Together this gives a depth of indexing. == Subject analysis == The first step in indexing is to decide on the subject matter of the document. In manual indexing, the indexer would consider the subject matter in terms of answer to a set of questions such as "Does the document deal with a specific product, condition or phenomenon?". As the analysis is influenced by the knowledge and experience of the indexer, it follows that two indexers may analyze the content differently and so come up with different index terms. This will impact on the success of retrieval. === Automatic vs. manual subject analysis === Automatic indexing follows set processes of analyzing frequencies of word patterns and comparing results to other documents in order to assign to subject categories. This requires no understanding of the material being indexed. This leads to more uniform indexing but at the expense of the true meaning being interpreted. A computer program will not understand the meaning of statements and may therefore fail to assign some relevant terms or assign incorrectly. Human indexers focus their attention on certain parts of the document such as the title, abstract, summary and conclusions, as analyzing the full text in depth is costly and time-consuming. An automated system takes away the time limit and allows the entire document to be analyzed, but also has the option to be directed to particular parts of the document. == Term selection == The second stage of indexing involves the translation of the subject analysis into a set of index terms. This can involve extracting from the document or assigning from a controlled vocabulary. With the ability to conduct a full text search widely available, many people have come to rely on their own expertise in conducting information searches and full text search has become very popular. Subject indexing and its experts, professional indexers, catalogers, and librarians, remains crucial to information organization and retrieval. These experts understand controlled vocabularies and are able to find information that cannot be located by full text search. The cost of expert analysis to create subject indexing is not easily compared to the cost of hardware, software and labor to manufacture a comparable set of full-text, fully searchable materials. With new web applications that allow every user to annotate documents, social tagging has gained popularity especially in the Web. One application of indexing, the book index, remains relatively unchanged despite the information revolution. === Extraction/Derived indexing === Extraction indexing involves taking words directly from the document. It uses natural language and lends itself well to automated techniques where word frequencies are calculated and those with a frequency over a pre-determined threshold are used as index terms. A stop-list containing common words (such as "the", "and") would be referred to and such stop words would be excluded as index terms. Automated extraction indexing may lead to loss of meaning of terms by indexing single words as opposed to phrases. Although it is possible to extract commonly occurring phrases, it becomes more difficult if key concepts are inconsistently worded in phrases. Automated extraction indexing also has the problem that, even with use of a stop-list to remove common words, some frequent words may not be useful for allowing discrimination between documents. For example, the term glucose is likely to occur frequently in any document related to diabetes. Therefore, use of this term would likely return most or all the documents in the database. Post-coordinated indexing where terms are combined at the time of searching would reduce this effect but the onus would be on the searcher to link appropriate terms as opposed to the information professional. In addition terms that occur infrequently may be highly significant for example a new drug may be mentioned infrequently but the novelty of the subject makes any reference significant. One method for allowing rarer terms to be included and common words to be excluded by automated techniques would be a relative frequency approach where frequency of a word in a document is compared to frequency in the database as a whole. Therefore, a term that occurs more often in a document than might be expected based on the rest of the database could then be used as an index term, and terms that occur equally frequently throughout will be excluded. Another problem with automated extraction is that it does not recognize when a concept is discussed but is not identified in the text by an indexable keyword. Since this process is based on simple string matching and involves no intellectual analysis, the resulting product is more appropriately known as a concordance than an index. === Assignment indexing === An alternative is assignment indexing where index terms are taken from a controlled vocabulary. This has the advantage of controlling for synonyms as the preferred term is indexed and synonyms or related terms direct the user to the preferred term. This means the user can find articles regardless of the specific term used by the author and saves the user from having to know and check all possible synonyms. It also removes any confusion caused by homographs by inclusion of a qualifying term. A third advantage is that it allows the linking of related terms whether they are linked by hierarchy or association, e.g. an index entry for an oral medication may list other oral medications as related terms on the same level of the hierarchy but would also link to broader terms such as treatment. Assignment indexing is used in manual indexing to improve inter-indexer consistency as different indexers will have a controlled set of terms to choose from. Controlled vocabularies do not completely remove inconsistencies as two indexers may still interpret the subject differently. == Index presentation == The final phase of indexing is to present the entries in a systematic order. This may involve linking entries. In a pre-coordinated index the indexer determines the order in which terms are linked in an entry by considering how a user may formulate their search. In a post-coordinated index, the entries are presented singly and the user can link the entries through searches, most commonly carried out by computer software. Post-coordination results in a loss of precision in comparison to pre-coordination. == Depth of indexing == Indexers must make decisions about what entries should be included and how many entries an index should incorporate. The depth of indexing describes the thoroughness of the indexing process with reference to exhaustivity and specificity. === Exhaustivity === An exhaustive index is one which lists all possible index terms. Greater exhaustivity gives a higher recall, or more likelihood of all the relevant articles being retrieved, however, this occurs at the expense of precision. This means that the user may retrieve a larger number of irrelevant documents or documents which only deal with the subject in little depth. In a manual system a greater level of exhaustivity brings with it a greater cost as more man-hours are required. The additional time taken in an automated system would be much less significant. At the other end of the scale, in a selective index only the most important aspects are covered. Recall is reduced in a selective index as if an indexer does not include enough terms, a highly relevant article may be overlooked. Therefore, indexers should strive for a balance and consider what the document may be used. They may also have to consider the implications of time and expense. === Specificity === The specificity describes how closel

    Read more →
  • Single-source publishing

    Single-source publishing

    Single-source publishing, also known as single-sourcing publishing, is a content management method which allows the same source content to be used across different forms of media and more than one time. The labor-intensive and expensive work of editing need only be carried out once, on only one document; that source document (the single source of truth) can then be stored in one place and reused. This reduces the potential for error, as corrections are only made one time in the source document. The benefits of single-source publishing primarily relate to the editor rather than the user. The user benefits from the consistency that single-sourcing brings to terminology and information. This assumes the content manager has applied an organized conceptualization to the underlying content (A poor conceptualization can make single-source publishing less useful). Single-source publishing is sometimes used synonymously with multi-channel publishing though whether or not the two terms are synonymous is a matter of discussion. == Definition == While there is a general definition of single-source publishing, there is no single official delineation between single-source publishing and multi-channel publishing, nor are there any official governing bodies to provide such a delineation. Single-source publishing is most often understood as the creation of one source document in an authoring tool and converting that document into different file formats or human languages (or both) multiple times with minimal effort. Multi-channel publishing can either be seen as synonymous with single-source publishing, or similar in that there is one source document but the process itself results in more than a mere reproduction of that source. == History == The origins of single-source publishing lie, indirectly, with the release of Windows 3.0 in 1990. With the eclipsing of MS-DOS by graphical user interfaces, help files went from being unreadable text along the bottom of the screen to hypertext systems such as WinHelp. On-screen help interfaces allowed software companies to cease the printing of large, expensive help manuals with their products, reducing costs for both producer and consumer. This system raised opportunities as well, and many developers fundamentally changed the way they thought about publishing. Writers of software documentation did not simply move from being writers of traditional bound books to writers of electronic publishing, but rather they became authors of central documents which could be reused multiple times across multiple formats. The first single-source publishing project was started in 1993 by Cornelia Hofmann at Schneider Electric in Seligenstadt, using software based on Interleaf to automatically create paper documentation in multiple languages based on a single original source file. XML, developed during the mid- to late-1990s, was also significant to the development of single-source publishing as a method. XML, a markup language, allows developers to separate their documentation into two layers: a shell-like layer based on presentation and a core-like layer based on the actual written content. This method allows developers to write the content only one time while switching it in and out of multiple different formats and delivery methods. In the mid-1990s, several firms began creating and using single-source content for technical documentation (Boeing Helicopter, Sikorsky Aviation and Pratt & Whitney Canada) and user manuals (Ford owners manuals) based on tagged SGML and XML content generated using the Arbortext Epic editor with add-on functions developed by a contractor. The concept behind this usage was that complex, hierarchical content that did not lend itself to discrete componentization could be used across a variety of requirements by tagging the differences within a single document using the capabilities built into SGML and XML. Ford, for example, was able to tag its single owner's manual files so that 12 model years could be generated via a resolution script running on the single completed file. Pratt & Whitney, likewise, was able to tag up to 20 subsets of its jet engine manuals in single-source files, calling out the desired version at publication time. World Book Encyclopedia also used the concept to tag its articles for American and British versions of English. Starting from the early 2000s, single-source publishing was used with an increasing frequency in the field of technical translation. It is still regarded as the most efficient method of publishing the same material in different languages. Once a printed manual was translated, for example, the online help for the software program which the manual accompanies could be automatically generated using the method. Metadata could be created for an entire manual and individual pages or files could then be translated from that metadata with only one step, removing the need to recreate information or even database structures. Although single-source publishing is now decades old, its importance has increased urgently as of the 2010s. As consumption of information products rises and the number of target audiences expands, so does the work of developers and content creators. Within the industry of software and its documentation, there is a perception that the choice is to embrace single-source publishing or render one's operations obsolete. == Criticism == Editors using single-source publishing have been criticized for below-standard work quality, leading some critics to describe single-source publishing as the "conveyor belt assembly" of content creation. While heavily used in technical translation, there are risks of error in regard to indexing. While two words might be synonyms in English, they may not be synonyms in another language. In a document produced via single-sourcing, the index will be translated automatically and the two words will be rendered as synonyms. This is because they are synonyms in the source language, while in the target language they are not.

    Read more →
  • Least-squares spectral analysis

    Least-squares spectral analysis

    Least-squares spectral analysis (LSSA) is a class of methods for estimating a frequency spectrum by fitting sinusoids to data using a least-squares fit. Unlike Fourier analysis, the most widely used spectral method in science, data need not be equally spaced to use LSSA. Furthermore, while Fourier analysis generally amplifies long-period noise in long or gapped records, LSSA mitigates such problems. The first strictly least-squares LSSA method was developed in 1969 and 1971, and is known as the Vaníček method or the Gauss–Vaniček method, after its inventor Petr Vaníček and Carl Friedrich Gauss, the inventor of the least-squares method for error minimization. A widely known LSSA variant is the Lomb method or the Lomb–Scargle periodogram, based on dated computational simplifications of the Vaníček method introduced in the 1970s and 1980s, first by Nicholas R. Lomb and later by Jeffrey D. Scargle. Other LSSA variants have been subsequently developed. == Historical background == The close connections between Fourier analysis, the periodogram, and the least-squares fitting of sinusoids have been known for a long time. However, most developments are restricted to complete data sets of equally spaced samples. In 1963, Freek J. M. Barning of Mathematisch Centrum, Amsterdam, handled unequally spaced data by similar techniques, including both a periodogram analysis equivalent to what nowadays is called the Lomb method and least-squares fitting of selected frequencies of sinusoids determined from such periodograms — and connected by a procedure known today as the matching pursuit with post-back fitting or the orthogonal matching pursuit. Petr Vaníček, a Canadian geophysicist and geodesist of the University of New Brunswick, proposed in 1969 also the matching-pursuit approach for equally and unequally spaced data, which he called "successive spectral analysis" and the result a "least-squares periodogram". He generalized this method to account for any systematic components beyond a simple mean, such as a "predicted linear (quadratic, exponential, ...) secular trend of unknown magnitude", and applied it to a variety of samples, in 1971. Vaníček's strictly least-squares method was then simplified in 1976 by Nicholas R. Lomb of the University of Sydney, who pointed out its close connection to periodogram analysis. Subsequently, the definition of a periodogram of unequally spaced data was modified and analyzed by Jeffrey D. Scargle of NASA Ames Research Center, who showed that, with minor changes, it becomes identical to Lomb's least-squares formula for fitting individual sinusoid frequencies. Scargle states that his paper "does not introduce a new detection technique, but instead studies the reliability and efficiency of detection with the most commonly used technique, the periodogram, in the case where the observation times are unevenly spaced," and further points out regarding least-squares fitting of sinusoids compared to periodogram analysis, that his paper "establishes, apparently for the first time, that (with the proposed modifications) these two methods are exactly equivalent." Press summarizes the development this way: A completely different method of spectral analysis for unevenly sampled data, one that mitigates these difficulties and has some other very desirable properties, was developed by Lomb, based in part on earlier work by Barning and Vanicek, and additionally elaborated by Scargle. In 1989, Michael J. Korenberg of Queen's University in Kingston, Ontario, developed the "fast orthogonal search" method of more quickly finding a near-optimal decomposition of spectra or other problems, similar to the technique that later became known as the orthogonal matching pursuit. == Development of LSSA and variants == === The Vaníček method === In the Vaníček method, a discrete data set is approximated by a weighted sum of sinusoids of progressively determined frequencies using a standard linear regression or least-squares fit. The frequencies are chosen using a method similar to Barning's, but going further in optimizing the choice of each successive new frequency by picking the frequency that minimizes the residual after least-squares fitting (equivalent to the fitting technique now known as matching pursuit with pre-backfitting). The number of sinusoids must be less than or equal to the number of data samples (counting sines and cosines of the same frequency as separate sinusoids). The relationship between the DFT and the approximation of trigonometric functions using the least-squares method is well explained in (Strutz, 2017). A data vector Φ is represented as a weighted sum of sinusoidal basis functions, tabulated in a matrix A by evaluating each function at the sample times, with weight vector x: ϕ ≈ A x , {\displaystyle \phi \approx {\textbf {A}}x,} where the weights vector x is chosen to minimize the sum of squared errors in approximating Φ. The solution for x is closed-form, using standard linear regression: x = ( A T A ) − 1 A T ϕ . {\displaystyle x=({\textbf {A}}^{\mathrm {T} }{\textbf {A}})^{-1}{\textbf {A}}^{\mathrm {T} }\phi .} Here the matrix A can be based on any set of functions mutually independent (not necessarily orthogonal) when evaluated at the sample times; functions used for spectral analysis are typically sines and cosines evenly distributed over the frequency range of interest. If we choose too many frequencies in a too-narrow frequency range, the functions will be insufficiently independent, the matrix ill-conditioned, and the resulting spectrum meaningless. When the basis functions in A are orthogonal (that is, not correlated, meaning the columns have zero pair-wise dot products), the matrix ATA is diagonal; when the columns all have the same power (sum of squares of elements), then that matrix is an identity matrix times a constant, so the inversion is trivial. The latter is the case when the sample times are equally spaced and sinusoids chosen as sines and cosines equally spaced in pairs on the frequency interval 0 to a half cycle per sample (spaced by 1/N cycles per sample, omitting the sine phases at 0 and maximum frequency where they are identically zero). This case is known as the discrete Fourier transform, slightly rewritten in terms of measurements and coefficients. x = A T ϕ {\displaystyle x={\textbf {A}}^{\mathrm {T} }\phi } — DFT case for N equally spaced samples and frequencies, within a scalar factor. === The Lomb method === Trying to lower the computational burden of the Vaníček method in 1976 (no longer an issue), Lomb proposed using the above simplification in general, except for pair-wise correlations between sine and cosine bases of the same frequency, since the correlations between pairs of sinusoids are often small, at least when they are not tightly spaced. This formulation is essentially that of the traditional periodogram but adapted for use with unevenly spaced samples. The vector x is a reasonably good estimate of an underlying spectrum, but since we ignore any correlations, Ax is no longer a good approximation to the signal, and the method is no longer a least-squares method — yet in the literature continues to be referred to as such. Rather than just taking dot products of the data with sine and cosine waveforms directly, Scargle modified the standard periodogram formula so to find a time delay τ {\displaystyle \tau } first, such that this pair of sinusoids would be mutually orthogonal at sample times t j {\displaystyle t_{j}} and also adjusted for the potentially unequal powers of these two basis functions, to obtain a better estimate of the power at a frequency. This procedure made his modified periodogram method exactly equivalent to Lomb's method. Time delay τ {\displaystyle \tau } by definition equals to tan ⁡ 2 ω τ = ∑ j sin ⁡ 2 ω t j ∑ j cos ⁡ 2 ω t j . {\displaystyle \tan {2\omega \tau }={\frac {\sum _{j}\sin 2\omega t_{j}}{\sum _{j}\cos 2\omega t_{j}}}.} Then the periodogram at frequency ω {\displaystyle \omega } is estimated as: P x ( ω ) = 1 2 [ [ ∑ j X j cos ⁡ ω ( t j − τ ) ] 2 ∑ j cos 2 ⁡ ω ( t j − τ ) + [ ∑ j X j sin ⁡ ω ( t j − τ ) ] 2 ∑ j sin 2 ⁡ ω ( t j − τ ) ] , {\displaystyle P_{x}(\omega )={\frac {1}{2}}\left[{\frac {\left[\sum _{j}X_{j}\cos \omega (t_{j}-\tau )\right]^{2}}{\sum _{j}\cos ^{2}\omega (t_{j}-\tau )}}+{\frac {\left[\sum _{j}X_{j}\sin \omega (t_{j}-\tau )\right]^{2}}{\sum _{j}\sin ^{2}\omega (t_{j}-\tau )}}\right],} which, as Scargle reports, has the same statistical distribution as the periodogram in the evenly sampled case. At any individual frequency ω {\displaystyle \omega } , this method gives the same power as does a least-squares fit to sinusoids of that frequency and of the form: ϕ ( t ) = A sin ⁡ ω t + B cos ⁡ ω t . {\displaystyle \phi (t)=A\sin \omega t+B\cos \omega t.} In practice, it is always difficult to judge if a given Lomb peak is significant or not, especially when the nature of the noise is unknown, so for example a false-alarm spectr

    Read more →
  • Vx-underground

    Vx-underground

    vx-underground, also known as VXUG, is an educational website about malware and cybersecurity. It claims to have the largest online repository of malware. The site was launched in May, 2019 and has grown to host over 35 million pieces of malware samples. On their account on Twitter, VXUG reports on and verifies cybersecurity breaches. == Reception == Kim Crawley compared the site to VirusTotal and states that vx-underground is more susceptible to suspicion for law enforcement. == Data breach reports == In May 2024, the International Baccalaureate organizations faced allegations over supposed breaches in their IT infrastructure after an incident of examination leaks. Upon inspecting leaked data, VXUG were the first to report that the breach seemed legitimate on the morning of May 6.

    Read more →
  • Operational database

    Operational database

    Operational database management systems (also referred to as OLTP databases or online transaction processing databases), are used to update data in real-time. These types of databases allow users to do more than simply view archived data. Operational databases allow you to modify that data (add, change or delete data), doing it in real-time. OLTP databases provide transactions as main abstraction to guarantee data consistency that guarantee the so-called ACID properties. Basically, the consistency of the data is guaranteed in the case of failures and/or concurrent access to the data. == History == Since the early 1990s, the operational database software market has been largely taken over by SQL engines. In 2014, the operational DBMS market (formerly OLTP) was evolving dramatically, with new, innovative entrants and incumbents supporting the growing use of unstructured data and NoSQL DBMS engines, as well as XML databases and NewSQL databases. NoSQL databases typically have focused on scalability and have renounced to data consistency by not providing transactions as OLTP system do. Operational databases are increasingly supporting distributed database architecture that can leverage distribution to provide high availability and fault tolerance through replication and scale out ability. The growing role of operational databases in the IT industry is moving fast from legacy databases to real-time operational databases capable to handle distributed web and mobile demand and to address Big data challenges. Recognizing this, Gartner started to publish the Magic Quadrant for Operational Database Management Systems in October 2013. == List of operational databases == Notable operational databases include: == Use in business == Operational databases are used to store, manage and track real-time business information. For example, a company might have an operational database used to track warehouse/stock quantities. As customers order products from an online web store, an operational database can be used to keep track of how many items have been sold and when the company will need to reorder stock. An operational database stores information about the activities of an organization, for example customer relationship management transactions or financial operations, in a computer database. Operational databases allow a business to enter, gather, and retrieve large quantities of specific information, such as company legal data, financial data, call data records, personal employee information, sales data, customer data, data on assets and many other information. An important feature of storing information in an operational database is the ability to share information across the company and over the Internet. Operational databases can be used to manage mission-critical business data, to monitor activities, to audit suspicious transactions, or to review the history of dealings with a particular customer. They can also be part of the actual process of making and fulfilling a purchase, for example in e-commerce. == Data warehouse terminology == In data warehousing, the term is even more specific: the operational database is the one which is accessed by an operational system (for example a customer-facing website or the application used by the customer service department) to carry out regular operations of an organization. Operational databases usually use an online transaction processing database which is optimized for faster transaction processing (create, read, update and delete operations). An operational database is the source for a data warehouse. Data from an operational database can be loaded into an operational data store at a data warehouse before the data is processed into the data warehouse.

    Read more →
  • Least-squares spectral analysis

    Least-squares spectral analysis

    Least-squares spectral analysis (LSSA) is a class of methods for estimating a frequency spectrum by fitting sinusoids to data using a least-squares fit. Unlike Fourier analysis, the most widely used spectral method in science, data need not be equally spaced to use LSSA. Furthermore, while Fourier analysis generally amplifies long-period noise in long or gapped records, LSSA mitigates such problems. The first strictly least-squares LSSA method was developed in 1969 and 1971, and is known as the Vaníček method or the Gauss–Vaniček method, after its inventor Petr Vaníček and Carl Friedrich Gauss, the inventor of the least-squares method for error minimization. A widely known LSSA variant is the Lomb method or the Lomb–Scargle periodogram, based on dated computational simplifications of the Vaníček method introduced in the 1970s and 1980s, first by Nicholas R. Lomb and later by Jeffrey D. Scargle. Other LSSA variants have been subsequently developed. == Historical background == The close connections between Fourier analysis, the periodogram, and the least-squares fitting of sinusoids have been known for a long time. However, most developments are restricted to complete data sets of equally spaced samples. In 1963, Freek J. M. Barning of Mathematisch Centrum, Amsterdam, handled unequally spaced data by similar techniques, including both a periodogram analysis equivalent to what nowadays is called the Lomb method and least-squares fitting of selected frequencies of sinusoids determined from such periodograms — and connected by a procedure known today as the matching pursuit with post-back fitting or the orthogonal matching pursuit. Petr Vaníček, a Canadian geophysicist and geodesist of the University of New Brunswick, proposed in 1969 also the matching-pursuit approach for equally and unequally spaced data, which he called "successive spectral analysis" and the result a "least-squares periodogram". He generalized this method to account for any systematic components beyond a simple mean, such as a "predicted linear (quadratic, exponential, ...) secular trend of unknown magnitude", and applied it to a variety of samples, in 1971. Vaníček's strictly least-squares method was then simplified in 1976 by Nicholas R. Lomb of the University of Sydney, who pointed out its close connection to periodogram analysis. Subsequently, the definition of a periodogram of unequally spaced data was modified and analyzed by Jeffrey D. Scargle of NASA Ames Research Center, who showed that, with minor changes, it becomes identical to Lomb's least-squares formula for fitting individual sinusoid frequencies. Scargle states that his paper "does not introduce a new detection technique, but instead studies the reliability and efficiency of detection with the most commonly used technique, the periodogram, in the case where the observation times are unevenly spaced," and further points out regarding least-squares fitting of sinusoids compared to periodogram analysis, that his paper "establishes, apparently for the first time, that (with the proposed modifications) these two methods are exactly equivalent." Press summarizes the development this way: A completely different method of spectral analysis for unevenly sampled data, one that mitigates these difficulties and has some other very desirable properties, was developed by Lomb, based in part on earlier work by Barning and Vanicek, and additionally elaborated by Scargle. In 1989, Michael J. Korenberg of Queen's University in Kingston, Ontario, developed the "fast orthogonal search" method of more quickly finding a near-optimal decomposition of spectra or other problems, similar to the technique that later became known as the orthogonal matching pursuit. == Development of LSSA and variants == === The Vaníček method === In the Vaníček method, a discrete data set is approximated by a weighted sum of sinusoids of progressively determined frequencies using a standard linear regression or least-squares fit. The frequencies are chosen using a method similar to Barning's, but going further in optimizing the choice of each successive new frequency by picking the frequency that minimizes the residual after least-squares fitting (equivalent to the fitting technique now known as matching pursuit with pre-backfitting). The number of sinusoids must be less than or equal to the number of data samples (counting sines and cosines of the same frequency as separate sinusoids). The relationship between the DFT and the approximation of trigonometric functions using the least-squares method is well explained in (Strutz, 2017). A data vector Φ is represented as a weighted sum of sinusoidal basis functions, tabulated in a matrix A by evaluating each function at the sample times, with weight vector x: ϕ ≈ A x , {\displaystyle \phi \approx {\textbf {A}}x,} where the weights vector x is chosen to minimize the sum of squared errors in approximating Φ. The solution for x is closed-form, using standard linear regression: x = ( A T A ) − 1 A T ϕ . {\displaystyle x=({\textbf {A}}^{\mathrm {T} }{\textbf {A}})^{-1}{\textbf {A}}^{\mathrm {T} }\phi .} Here the matrix A can be based on any set of functions mutually independent (not necessarily orthogonal) when evaluated at the sample times; functions used for spectral analysis are typically sines and cosines evenly distributed over the frequency range of interest. If we choose too many frequencies in a too-narrow frequency range, the functions will be insufficiently independent, the matrix ill-conditioned, and the resulting spectrum meaningless. When the basis functions in A are orthogonal (that is, not correlated, meaning the columns have zero pair-wise dot products), the matrix ATA is diagonal; when the columns all have the same power (sum of squares of elements), then that matrix is an identity matrix times a constant, so the inversion is trivial. The latter is the case when the sample times are equally spaced and sinusoids chosen as sines and cosines equally spaced in pairs on the frequency interval 0 to a half cycle per sample (spaced by 1/N cycles per sample, omitting the sine phases at 0 and maximum frequency where they are identically zero). This case is known as the discrete Fourier transform, slightly rewritten in terms of measurements and coefficients. x = A T ϕ {\displaystyle x={\textbf {A}}^{\mathrm {T} }\phi } — DFT case for N equally spaced samples and frequencies, within a scalar factor. === The Lomb method === Trying to lower the computational burden of the Vaníček method in 1976 (no longer an issue), Lomb proposed using the above simplification in general, except for pair-wise correlations between sine and cosine bases of the same frequency, since the correlations between pairs of sinusoids are often small, at least when they are not tightly spaced. This formulation is essentially that of the traditional periodogram but adapted for use with unevenly spaced samples. The vector x is a reasonably good estimate of an underlying spectrum, but since we ignore any correlations, Ax is no longer a good approximation to the signal, and the method is no longer a least-squares method — yet in the literature continues to be referred to as such. Rather than just taking dot products of the data with sine and cosine waveforms directly, Scargle modified the standard periodogram formula so to find a time delay τ {\displaystyle \tau } first, such that this pair of sinusoids would be mutually orthogonal at sample times t j {\displaystyle t_{j}} and also adjusted for the potentially unequal powers of these two basis functions, to obtain a better estimate of the power at a frequency. This procedure made his modified periodogram method exactly equivalent to Lomb's method. Time delay τ {\displaystyle \tau } by definition equals to tan ⁡ 2 ω τ = ∑ j sin ⁡ 2 ω t j ∑ j cos ⁡ 2 ω t j . {\displaystyle \tan {2\omega \tau }={\frac {\sum _{j}\sin 2\omega t_{j}}{\sum _{j}\cos 2\omega t_{j}}}.} Then the periodogram at frequency ω {\displaystyle \omega } is estimated as: P x ( ω ) = 1 2 [ [ ∑ j X j cos ⁡ ω ( t j − τ ) ] 2 ∑ j cos 2 ⁡ ω ( t j − τ ) + [ ∑ j X j sin ⁡ ω ( t j − τ ) ] 2 ∑ j sin 2 ⁡ ω ( t j − τ ) ] , {\displaystyle P_{x}(\omega )={\frac {1}{2}}\left[{\frac {\left[\sum _{j}X_{j}\cos \omega (t_{j}-\tau )\right]^{2}}{\sum _{j}\cos ^{2}\omega (t_{j}-\tau )}}+{\frac {\left[\sum _{j}X_{j}\sin \omega (t_{j}-\tau )\right]^{2}}{\sum _{j}\sin ^{2}\omega (t_{j}-\tau )}}\right],} which, as Scargle reports, has the same statistical distribution as the periodogram in the evenly sampled case. At any individual frequency ω {\displaystyle \omega } , this method gives the same power as does a least-squares fit to sinusoids of that frequency and of the form: ϕ ( t ) = A sin ⁡ ω t + B cos ⁡ ω t . {\displaystyle \phi (t)=A\sin \omega t+B\cos \omega t.} In practice, it is always difficult to judge if a given Lomb peak is significant or not, especially when the nature of the noise is unknown, so for example a false-alarm spectr

    Read more →
  • Artificial intelligence in India

    Artificial intelligence in India

    The artificial intelligence (AI) market in India is projected to reach $8 billion by 2025, growing at 40% CAGR from 2020 to 2025. This growth is part of the broader AI boom, a global period of rapid technological advancements with India being pioneer starting in the early 2010s with NLP based Chatbots from Haptik, Corover.ai, Niki.ai and then gaining prominence in the early 2020s based on reinforcement learning, marked by breakthroughs such as generative AI models from Krutrim, Sarvam, CoRover, OpenAI and Alphafold by Google DeepMind. In India, the development of AI has been similarly transformative, with applications in healthcare, finance, and education, bolstered by government initiatives like NITI Aayog's 2018 National Strategy for Artificial Intelligence. Institutions such as the Indian Statistical Institute and the Indian Institute of Science published breakthrough AI research papers and patents. India's transformation to AI is primarily being driven by startups and government initiatives & policies like Digital India. By fostering technological trust through digital public infrastructure, India is tackling socioeconomic issues by taking a bottom-up approach to AI. NASSCOM and Boston Consulting Group estimate that by 2027, India's AI services might be valued at $17 billion. According to 2025 Technology and Innovation Report, by UN Trade and Development, India ranks 10th globally for private sector investments in AI. According to Mary Meeker, India has emerged as a key market for AI platforms, accounting for the largest share of ChatGPT's mobile app users and having the third-largest user base for DeepSeek in 2025. While AI presents significant opportunities for economic growth and social development in India, challenges such as data privacy concerns, skill shortages, and ethical considerations need to be addressed for responsible AI deployment. The growth of AI in India has also led to an increase in the number of cyberattacks that use AI to target organizations. == History == === Early days (1960s-1980s) === The TIFRAC (Tata Institute of Fundamental Research Automatic Calculator) was designed and developed by a team led by Rangaswamy Narasimhan between 1954 and 1960. He worked on pattern recognition from 1961 to 1964 at the University of Illinois Urbana-Champaign's Digital Computer Laboratory. In order to conduct research on database technology, computer networking, computer graphics, and systems software, he and M. G. K. Menon founded the National Centre for Software Development and Computing Techniques. In 1965, he established the Computer Society of India and supervised the initial research work on AI at Tata Institute of Fundamental Research. Jagdish Lal launched the first computer science program in 1976 at Motilal Nehru Regional Engineering College. H. K. Kesavan from the University of Waterloo and Vaidyeswaran Rajaraman from the University of Wisconsin–Madison joined the IIT Kanpur Electrical Engineering Department in 1963–1964 as Assistant Professor and Head of Department, respectively. H.N. Mahabala, who was employed at Bendix Corporation's Computer Division, joined the department in 1965. He previously worked with Marvin Minsky. The IIT Kanpur Computer Center was led by H. K. Kesavan, with Vaidyeswaran Rajaraman serving as his deputy. Kesavan informally permitted Rajaraman and Mahabala to introduce artificial intelligence into computer science classes. The computer science program was approved by IIT Kanpur in 1971 and split out from the electrical engineering department. In 1973, an IBM System/370 Model 155 was installed at IIT Madras. John McCarthy, head of the Artificial Intelligence Laboratory at Stanford University visited IIT Kanpur in 1971. He donated PDP-1 with a time-sharing operating system. During the 1970s, the balance of payments deficit in India restricted import of computers. The Department of Computer Science and Automation at the Indian Institute of Science established in 1969, played an important role in nurturing the development of data science and artificial intelligence in India. First course on AI was introduced in the 1970s by G. Krishna. B. L. Deekshatulu introduced the first course on pattern recognition in the early 1970s. === Foundation phase === ==== 1980s ==== In the 1980s, the Indian Statistical Institute's Optical Character Recognition Project was one of the country's first attempts at studying artificial intelligence and machine learning. OCR technology has benefited greatly from the work of ISI's Computer Vision and Pattern Recognition Unit, which is headed by Bidyut Baran Chaudhuri. He also contributed in the development of computer vision and digital image processing. As part of the Indian Fifth Generation Computer Systems Research Programme, the Department of Electronics, with support from the United Nations Development Programme, initiated the Knowledge Based Computer Systems Project in 1986, marking the beginning of India's first major AI research program. Prime Minister Rajiv Gandhi requested that the Department of Electronics and IISc to initiate the Parallel Processing Project in 1986–1987. The Center for Development of Advanced Computing eventually joined those efforts. IIT Madras was selected to develop system diagnosis, ISI for image processing, National Centre for Software Technology for natural language processing and TIFR for speech processing. In 1987, the proposal of N. Seshagiri, Director General of the National Informatics Centre for the prototype development of supercomputer was cleared. Negotiations for a Cray supercomputer were underway between the Reagan administration and the Rajiv Gandhi government. US Defense Secretaries Frank Carlucci and Caspar Weinberger visited New Delhi after the US approved the transfer in 1988. The sale of a lower-end XMP-14 supercomputer was permitted in lieu of the Cray XMP-24 supercomputer due to security concerns. The Center for Development of Advanced Computing was formally established in March 1988 by the Ministry of Communications and Information Technology (previously the Ministry of IT) within the Department of Information Technology (formerly the Department of Electronics) in response to a recommendation made to the Prime Minister by the Scientific Advisory Council. The National Initiative in Supercomputing, which produced the PARAM series, was led by Vijay P. Bhatkar. For the first ten years, supercomputing and Indian language computing were the two main focus areas. C-DAC has expanded its operations in order to meet the needs in a number of domains, including network and internet software, real-time systems, artificial intelligence, and NLP. Under the direction of Professor KV Ramakrishnamacharyulu from National Sanskrit University and Professor Rajeev Sangal from the International Institute of Information Technology, Hyderabad, the Akshar Bharati Research Group was established in 1984 with support from IIT Kanpur and the University of Hyderabad for computational processing of Indian languages. They focused on computational linguistics, NLP with ontological database systems, and Indian language/translation theories with linguistic tradition. ==== 1990s ==== From IIT Kanpur, Mohan Tambe joined C-DAC in the 1990s to work on Graphics and Intelligence based Script Technology (GIST), which addressed the challenge of adapting personal computer software based on Latin script to Devanagiri and a number of other Indian language scripts. He was previously working on the Machine Translation for Indian languages Project. Within C-DAC, he established the GIST group. The technology was expanded to encompass NLP, artificial intelligence-based machine-aided language learning and translation, multimedia and multilingual computing solutions, and more. GIST resulted in the creation of G-CLASS (GIST cross language search plug-ins suite), a cross-language search engine. The Applied Artificial Intelligence Group at C-DAC has developed some basic and novel applications in the field of NLP, including machine translation, information extraction/retrieval, automatic summarization, speech recognition, text-to-speech synthesis, intelligent language teaching, and natural language-based document management with Decision Support Systems. These applications are the result of the foundation laid by previous language technology activities. Software firms in the Indian private sector began looking into AI applications, mostly in the area of business process automation. In order to allow machines to read, comprehend, and interpret human languages, the Language Technologies Research Center was founded in October 1999 at the International Institute of Information Technology, Hyderabad. It focused on the advancements in semantic parsing, information extraction, natural language generation, sentiment analysis, and dialogue systems. Some of the early AI research in India was driven by societal needs. For example; Eklavya, a knowledge-based program created by I

    Read more →
  • Admissible heuristic

    Admissible heuristic

    In computer science, specifically in algorithms related to pathfinding, a heuristic function is said to be admissible if it never overestimates the cost of reaching the goal, i.e. the cost it estimates to reach the goal is not higher than the lowest possible cost from the current point in the path. In other words, it should act as a lower bound. It is related to the concept of consistent heuristics. While all consistent heuristics are admissible, not all admissible heuristics are consistent. == Search algorithms == An admissible heuristic is used to estimate the cost of reaching the goal state in an informed search algorithm. In order for a heuristic to be admissible to the search problem, the estimated cost must always be lower than or equal to the actual cost of reaching the goal state. The search algorithm uses the admissible heuristic to find an estimated optimal path to the goal state from the current node. For example, in A search the evaluation function (where n {\displaystyle n} is the current node) is: f ( n ) = g ( n ) + h ( n ) {\displaystyle f(n)=g(n)+h(n)} where f ( n ) {\displaystyle f(n)} = the evaluation function. g ( n ) {\displaystyle g(n)} = the cost from the start node to the current node h ( n ) {\displaystyle h(n)} = estimated cost from current node to goal. h ( n ) {\displaystyle h(n)} is calculated using the heuristic function. With a non-admissible heuristic, the A algorithm could overlook the optimal solution to a search problem due to an overestimation in f ( n ) {\displaystyle f(n)} . == Formulation == n {\displaystyle n} is a node h {\displaystyle h} is a heuristic h ( n ) {\displaystyle h(n)} is cost indicated by h {\displaystyle h} to reach a goal from n {\displaystyle n} h ∗ ( n ) {\displaystyle h^{}(n)} is the optimal cost to reach a goal from n {\displaystyle n} h ( n ) {\displaystyle h(n)} is admissible if, ∀ n {\displaystyle \forall n} h ( n ) ≤ h ∗ ( n ) {\displaystyle h(n)\leq h^{}(n)} == Construction == An admissible heuristic can be derived from a relaxed version of the problem, or by information from pattern databases that store exact solutions to subproblems of the problem, or by using inductive learning methods. == Examples == Two different examples of admissible heuristics apply to the fifteen puzzle problem: Hamming distance Manhattan distance The Hamming distance is the total number of misplaced tiles. It is clear that this heuristic is admissible since the total number of moves to order the tiles correctly is at least the number of misplaced tiles (each tile not in place must be moved at least once). The cost (number of moves) to the goal (an ordered puzzle) is at least the Hamming distance of the puzzle. The Manhattan distance of a puzzle is defined as: h ( n ) = ∑ all tiles d i s t a n c e ( tile, correct position ) {\displaystyle h(n)=\sum _{\text{all tiles}}{\mathit {distance}}({\text{tile, correct position}})} Consider the puzzle below in which the player wishes to move each tile such that the numbers are ordered. The Manhattan distance is an admissible heuristic in this case because every tile will have to be moved at least the number of spots in between itself and its correct position. The subscripts show the Manhattan distance for each tile. The total Manhattan distance for the shown puzzle is: h ( n ) = 3 + 1 + 0 + 1 + 2 + 3 + 3 + 4 + 3 + 2 + 4 + 4 + 4 + 1 + 1 = 36 {\displaystyle h(n)=3+1+0+1+2+3+3+4+3+2+4+4+4+1+1=36} == Optimality proof == If an admissible heuristic is used in an algorithm that, per iteration, progresses only the path of lowest evaluation (current cost + heuristic) of several candidate paths, terminates the moment its exploration reaches the goal and, crucially, closes all optimal paths before terminating (something that's possible with A search algorithm if special care isn't taken), then this algorithm can only terminate on an optimal path. To see why, consider the following proof by contradiction: Assume such an algorithm managed to terminate on a path T with a true cost Ttrue greater than the optimal path S with true cost Strue. This means that before terminating, the evaluated cost of T was less than or equal to the evaluated cost of S (or else S would have been picked). Denote these evaluated costs Teval and Seval respectively. The above can be summarized as follows, Strue < Ttrue Teval ≤ Seval If our heuristic is admissible it follows that at this penultimate step Teval = Ttrue because any increase on the true cost by the heuristic on T would be inadmissible and the heuristic cannot be negative. On the other hand, an admissible heuristic would require that Seval ≤ Strue which combined with the above inequalities gives us Teval < Ttrue and more specifically Teval ≠ Ttrue. As Teval and Ttrue cannot be both equal and unequal our assumption must have been false and so it must be impossible to terminate on a more costly than optimal path. As an example, let us say we have costs as follows:(the cost above/below a node is the heuristic, the cost at an edge is the actual cost) 0 10 0 100 0 START ---- O ----- GOAL | | 0| |100 | | O ------- O ------ O 100 1 100 1 100 So clearly we would start off visiting the top middle node, since the expected total cost, i.e. f ( n ) {\displaystyle f(n)} , is 10 + 0 = 10 {\displaystyle 10+0=10} . Then the goal would be a candidate, with f ( n ) {\displaystyle f(n)} equal to 10 + 100 + 0 = 110 {\displaystyle 10+100+0=110} . Then we would clearly pick the bottom nodes one after the other, followed by the updated goal, since they all have f ( n ) {\displaystyle f(n)} lower than the f ( n ) {\displaystyle f(n)} of the current goal, i.e. their f ( n ) {\displaystyle f(n)} is 100 , 101 , 102 , 102 {\displaystyle 100,101,102,102} . So even though the goal was a candidate, we could not pick it because there were still better paths out there. This way, an admissible heuristic can ensure optimality. However, note that although an admissible heuristic can guarantee final optimality, it is not necessarily efficient.

    Read more →
  • Flajolet–Martin algorithm

    Flajolet–Martin algorithm

    The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements in a stream with a single pass and space-consumption logarithmic in the maximal number of possible distinct elements in the stream (the count-distinct problem). The algorithm was introduced by Philippe Flajolet and G. Nigel Martin in their 1984 article "Probabilistic Counting Algorithms for Data Base Applications". Later it has been refined in "LogLog counting of large cardinalities" by Marianne Durand and Philippe Flajolet, and "HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm" by Philippe Flajolet et al. In their 2010 article "An optimal algorithm for the distinct elements problem", Daniel M. Kane, Jelani Nelson and David P. Woodruff give an improved algorithm, which uses nearly optimal space and has optimal O(1) update and reporting times. == The algorithm == Assume that we are given a hash function h a s h ( x ) {\displaystyle \mathrm {hash} (x)} that maps input x {\displaystyle x} to integers in the range [ 0 ; 2 L − 1 ] {\displaystyle [0;2^{L}-1]} , and where the outputs are sufficiently uniformly distributed. Note that the set of integers from 0 to 2 L − 1 {\displaystyle 2^{L}-1} corresponds to the set of binary strings of length L {\displaystyle L} . For any non-negative integer y {\displaystyle y} , define b i t ( y , k ) {\displaystyle \mathrm {bit} (y,k)} to be the k {\displaystyle k} -th bit in the binary representation of y {\displaystyle y} , such that: y = ∑ k ≥ 0 b i t ( y , k ) 2 k . {\displaystyle y=\sum _{k\geq 0}\mathrm {bit} (y,k)2^{k}.} We then define a function ρ ( y ) {\displaystyle \rho (y)} that outputs the position of the least-significant set bit in the binary representation of y {\displaystyle y} , and L {\displaystyle L} if no such set bit can be found as all bits are zero: ρ ( y ) = { min { k ≥ 0 ∣ b i t ( y , k ) ≠ 0 } y > 0 L y = 0 {\displaystyle \rho (y)={\begin{cases}\min\{k\geq 0\mid \mathrm {bit} (y,k)\neq 0\}&y>0\\L&y=0\end{cases}}} Note that with the above definition we are using 0-indexing for the positions, starting from the least significant bit. For example, ρ ( 13 ) = ρ ( 1101 2 ) = 0 {\displaystyle \rho (13)=\rho (1101_{2})=0} , since the least significant bit is a 1 (0th position), and ρ ( 8 ) = ρ ( 1000 2 ) = 3 {\displaystyle \rho (8)=\rho (1000_{2})=3} , since the least significant set bit is at the 3rd position. At this point, note that under the assumption that the output of our hash function is uniformly distributed, then the probability of observing a hash output ending with 2 k {\displaystyle 2^{k}} (a one, followed by k {\displaystyle k} zeroes) is 2 − ( k + 1 ) {\displaystyle 2^{-(k+1)}} , since this corresponds to flipping k {\displaystyle k} heads and then a tail with a fair coin. Now the Flajolet–Martin algorithm for estimating the cardinality of a multiset M {\displaystyle M} is as follows: Initialize a bit-vector BITMAP to be of length L {\displaystyle L} and contain all 0s. For each element x {\displaystyle x} in M {\displaystyle M} : Calculate the index i = ρ ( h a s h ( x ) ) {\displaystyle i=\rho (\mathrm {hash} (x))} . Set B I T M A P [ i ] = 1 {\displaystyle \mathrm {BITMAP} [i]=1} . Let R {\displaystyle R} denote the smallest index i {\displaystyle i} such that B I T M A P [ i ] = 0 {\displaystyle \mathrm {BITMAP} [i]=0} . Estimate the cardinality of M {\displaystyle M} as 2 R / ϕ {\displaystyle 2^{R}/\phi } , where ϕ ≈ 0.77351 {\displaystyle \phi \approx 0.77351} . The idea is that if n {\displaystyle n} is the number of distinct elements in the multiset M {\displaystyle M} , then B I T M A P [ 0 ] {\displaystyle \mathrm {BITMAP} [0]} is accessed approximately n / 2 {\displaystyle n/2} times, B I T M A P [ 1 ] {\displaystyle \mathrm {BITMAP} [1]} is accessed approximately n / 4 {\displaystyle n/4} times and so on. Consequently, if i ≫ log 2 ⁡ n {\displaystyle i\gg \log _{2}n} , then B I T M A P [ i ] {\displaystyle \mathrm {BITMAP} [i]} is almost certainly 0, and if i ≪ log 2 ⁡ n {\displaystyle i\ll \log _{2}n} , then B I T M A P [ i ] {\displaystyle \mathrm {BITMAP} [i]} is almost certainly 1. If i ≈ log 2 ⁡ n {\displaystyle i\approx \log _{2}n} , then B I T M A P [ i ] {\displaystyle \mathrm {BITMAP} [i]} can be expected to be either 1 or 0. The correction factor ϕ ≈ 0.77351 {\displaystyle \phi \approx 0.77351} (OEIS: A244256) is found by calculations, which can be found in the original article. == Improving accuracy == A problem with the Flajolet–Martin algorithm in the above form is that the results vary significantly. A common solution has been to run the algorithm multiple times with k {\displaystyle k} different hash functions and combine the results from the different runs. One idea is to take the mean of the k {\displaystyle k} results together from each hash function, obtaining a single estimate of the cardinality. The problem with this is that averaging is very susceptible to outliers (which are likely here). A different idea is to use the median, which is less prone to be influences by outliers. The problem with this is that the results can only take form 2 R / ϕ {\displaystyle 2^{R}/\phi } , where R {\displaystyle R} is integer. A common solution is to combine both the mean and the median: Create k ⋅ l {\displaystyle k\cdot l} hash functions and split them into k {\displaystyle k} distinct groups (each of size l {\displaystyle l} ). Within each group use the mean for aggregating together the l {\displaystyle l} results, and finally take the median of the k {\displaystyle k} group estimates as the final estimate. The 2007 HyperLogLog algorithm splits the multiset into subsets and estimates their cardinalities, then it uses the harmonic mean to combine them into an estimate for the original cardinality.

    Read more →
  • Species distribution modelling

    Species distribution modelling

    Species distribution modelling (SDM), also known as environmental (or ecological) niche modelling (ENM), habitat suitability modelling, predictive habitat distribution modelling, and range mapping uses ecological models to predict the distribution of a species across geographic space and time using environmental data. The environmental data are most often climate data (e.g. temperature, precipitation), but can include other variables such as soil type, water depth, and land cover. SDMs are used in several research areas in conservation biology, ecology and evolution. These models can be used to understand how environmental conditions influence the occurrence or abundance of a species, and for predictive purposes (ecological forecasting). Predictions from an SDM may be of a species' future distribution under climate change, a species' past distribution in order to assess evolutionary relationships, or the potential future distribution of an invasive species. Predictions of current and/or future habitat suitability can be useful for management applications (e.g. reintroduction or translocation of vulnerable species, reserve placement in anticipation of climate change). There are two main types of SDMs. Correlative SDMs, also known as climate envelope models, bioclimatic models, or resource selection function models, model the observed distribution of a species as a function of environmental conditions. Mechanistic SDMs, also known as process-based models or biophysical models, use independently derived information about a species' physiology to develop a model of the environmental conditions under which the species can exist. The extent to which such modelled data reflect real-world species distributions will depend on a number of factors, including the nature, complexity, and accuracy of the models used and the quality of the available environmental data layers; the availability of sufficient and reliable species distribution data as model input; and the influence of various factors such as barriers to dispersal, geologic history, or biotic interactions, that increase the difference between the realized niche and the fundamental niche. Environmental niche modelling may be considered a part of the discipline of biodiversity informatics. == History == A. F. W. Schimper used geographical and environmental factors to explain plant distributions in his 1898 Pflanzengeographie auf physiologischer Grundlage (Plant Geography Upon a Physiological Basis) and his 1908 work of the same name. Andrew Murray used the environment to explain the distribution of mammals in his 1866 The Geographical Distribution of Mammals. Robert Whittaker's work with plants and Robert MacArthur's work with birds strongly established the role the environment plays in species distributions. Elgene O. Box constructed environmental envelope models to predict the range of tree species. His computer simulations were among the earliest uses of species distribution modelling. The adoption of more sophisticated generalised linear models (GLMs) made it possible to create more sophisticated and realistic species distribution models. The expansion of remote sensing and the development of GIS-based environmental modelling increase the amount of environmental information available for model-building and made it easier to use. == Correlative vs mechanistic models == === Correlative SDMs === SDMs originated as correlative models. Correlative SDMs model the observed distribution of a species as a function of geographically referenced climatic predictor variables using multiple regression approaches. Given a set of geographically referred observed presences of a species and a set of climate maps, a model defines the most likely environmental ranges within which a species lives. Correlative SDMs assume that species are at equilibrium with their environment and that the relevant environmental variables have been adequately sampled. The models allow for interpolation between a limited number of species occurrences. For these models to be effective, it is required to gather observations not only of species presences, but also of absences, that is, where the species does not live. Records of species absences are typically not as common as records of presences, thus often "random background" or "pseudo-absence" data are used to fit these models. If there are incomplete records of species occurrences, pseudo-absences can introduce bias. Since correlative SDMs are models of a species' observed distribution, they are models of the realized niche (the environments where a species is found), as opposed to the fundamental niche (the environments where a species can be found, or where the abiotic environment is appropriate for the survival). For a given species, the realized and fundamental niches might be the same, but if a species is geographically confined due to dispersal limitation or species interactions, the realized niche will be smaller than the fundamental niche. Correlative SDMs are easier and faster to implement than mechanistic SDMs, and can make ready use of available data. Since they are correlative however, they do not provide much information about causal mechanisms and are not good for extrapolation. They will also be inaccurate if the observed species range is not at equilibrium (e.g. if a species has been recently introduced and is actively expanding its range). In standard SDMs, the distribution of a single species is often modeled, with unique parameters describing how environmental (abiotic) factors influence its occurrence probability. This allows for differentiated responses to environmental drivers among species, but can be problematic for data-deficient species. In contrast, similarities in environmental responses can be accounted for in multi-species SDMs, which model several species jointly using shared or hierarchically related parameters. However, neither approach explicitly accounts for community-level biotic interactions, which can be important in explaining species diversity patterns. Joint species distribution models (joint SDMs or J-SDMs) address this by modeling species co-occurrence patterns directly. The occurrence probability of a given species is thus influenced not only by abiotic drivers but also by inferred biotic associations with other species. This can improve accuracy for rarer taxa and provide insights into community ecology. Both standard SDMs and J-SDMs can be used to generate community-level metrics, such as species richness, by aggregating outputs across multiple species. These can be important for decision-making such as conservation planning. === Mechanistic SDMs === Mechanistic SDMs are more recently developed. In contrast to correlative models, mechanistic SDMs use physiological information about a species (taken from controlled field or laboratory studies) to determine the range of environmental conditions within which the species can persist. These models aim to directly characterize the fundamental niche, and to project it onto the landscape. A simple model may simply identify threshold values outside of which a species can't survive. A more complex model may consist of several sub-models, e.g. micro-climate conditions given macro-climate conditions, body temperature given micro-climate conditions, fitness or other biological rates (e.g. survival, fecundity) given body temperature (thermal performance curves), resource or energy requirements, and population dynamics. Geographically referenced environmental data are used as model inputs. Because the species distribution predictions are independent of the species' known range, these models are especially useful for species whose range is actively shifting and not at equilibrium, such as invasive species. Mechanistic SDMs incorporate causal mechanisms and are better for extrapolation and non-equilibrium situations. However, they are more labor-intensive to create than correlational models and require the collection and validation of a lot of physiological data, which may not be readily available. The models require many assumptions and parameter estimates, and they can become very complicated. Dispersal, biotic interactions, and evolutionary processes present challenges, as they aren't usually incorporated into either correlative or mechanistic models. Correlational and mechanistic models can be used in combination to gain additional insights. For example, a mechanistic model could be used to identify areas that are clearly outside the species' fundamental niche, and these areas can be marked as absences or excluded from analysis. See for a comparison between mechanistic and correlative models. == Niche models (correlative) == There are a variety of mathematical methods that can be used for fitting, selecting, and evaluating correlative SDMs. Models include "profile" methods, which are simple statistical techniques that use e.g. environmental distance to known sites of occurrence such as

    Read more →
  • Artificial empathy

    Artificial empathy

    Artificial empathy or computational empathy is the development of AI systems—such as companion robots or virtual agents—that can detect emotions and respond to them in an empathic way. Although such technology can be perceived as scary or threatening, it could also have a significant advantage over humans for roles in which emotional expression can be important, such as in the health care sector. An October 2025 review and meta-analysis in the British Medical Bulletin found that AI chatbots were rated as showing more empathy than human healthcare professionals in 13 of 15 studies that compared them. Care-givers who perform emotional labor above and beyond the requirements of paid labor can experience chronic stress or burnout, and can become desensitized to patients. Artificial empathy could also help the socialization of care-givers, or serve as role model for emotional detachment. A broader definition of artificial empathy is "the ability of nonhuman models to predict a person's internal state (e.g., cognitive, affective, physical) given the signals (s)he emits (e.g., facial expression, voice, gesture) or to predict a person's reaction (including, but not limited to internal states) when he or she is exposed to a given set of stimuli (e.g., facial expression, voice, gesture, graphics, music, etc.)". A 2025 study reported that some multimodal large language models can recognize basic facial expressions with human-level accuracy on a commonly used research dataset of posed facial expressions. == Areas of research == There are a variety of philosophical, theoretical, and applicative questions related to artificial empathy. For example: Which conditions would have to be met for a robot to respond competently to a human emotion? What models of empathy can or should be applied to Social and Assistive Robotics? Must the interaction of humans with robots imitate affective interaction between humans? Can a robot help science learn about affective development of humans? Would robots create unforeseen categories of inauthentic relations? What relations with robots can be considered authentic? How can we assess artificial empathy in AI systems? == Examples of artificial empathy research and practice == People often communicate and make decisions based on inferences about each other's internal states (e.g., emotional, cognitive, and physical states) that are in turn based on signals emitted by the person such as facial expression, body gesture, voice, and words. Broadly speaking, artificial empathy focuses on developing non-human models that achieve similar objectives using similar data. === Streams of artificial empathy research === Artificial empathy has been applied in various research disciplines, including artificial intelligence and business. Two main streams of research in this domain are: the use of nonhuman models to predict a person's internal state (e.g., cognitive, affective, physical) given the signals he or she emits (e.g., facial expression, voice, gesture) the use of nonhuman models to predict a person's reaction when he or she is exposed to a given set of stimuli (e.g., facial expression, voice, gesture, graphics, music, etc.). Research on affective computing, such as emotional speech recognition and facial expression detection, falls within the first stream of artificial empathy. Contexts that have been studied include oral interviews, call centers, human-computer interaction, sales pitches, and financial reporting. The second stream of artificial empathy has been researched more in marketing contexts, such as advertising, branding, customer reviews, in-store recommendation systems, movies, and online dating. === Artificial empathy applications in practice === With the increasing volume of visual, audio, and text data in commerce, many business applications for artificial empathy have followed. For example, Affectiva analyses viewers' facial expressions from video recordings while they are watching video advertisements in order to optimize the content design of video ads. Software like HireVue, BarRaiser, a hiring intelligence firm, helps firms make recruitment decisions by analyzing audio and video information from candidates' video interviews. Lapetus Solutions develops a model to estimate an individual's longevity, health status, and disease susceptibility from a face photo. Their technology has been applied in the insurance industry. == Artificial empathy and human services == Although artificial intelligence cannot yet replace social workers themselves, the technology has been deployed in that field. Florida State University published a study about Artificial Intelligence being used in the human services field. The research used computer algorithms to analyze health records for combinations of risk factors that could predict a future suicide attempt. The article reports, "machine learning—a future frontier for artificial intelligence—can predict with 80% to 90% accuracy whether someone will attempt suicide as far off as two years into the future. The algorithms become even more accurate as a person's suicide attempt gets closer. For example, the accuracy climbs to 92% one week before a suicide attempt when artificial intelligence focuses on general hospital patients". Such algorithmic machines can help social workers. Social work operates on a cycle of engagement, assessment, intervention, and evaluation with clients. Earlier assessment for risk of suicide can lead to earlier interventions and prevention, therefore saving lives. The system would learn, analyze, and detect risk factors, alerting the clinician of a patient's suicide risk score (analogous to a patient's cardiovascular risk score). Then, social workers could step in for further assessment and preventive intervention.

    Read more →
  • Comparison of raster graphics editors

    Comparison of raster graphics editors

    Raster graphics editors can be compared by many variables, including availability. == List == == General information == Basic general information about the editor: creator, company, license, etc. == Operating system support == The operating systems on which the editors can run natively, that is, without emulation, virtual machines or compatibility layers. In other words, the software must be specifically coded for the operation system; for example, Adobe Photoshop for Windows running on Linux with Wine does not fit. == Features == == Color spaces == == File support ==

    Read more →
  • Exploratory search

    Exploratory search

    Exploratory search is a specialization of information exploration which represents the activities carried out by searchers who are: unfamiliar with the domain of their goal (i.e. need to learn about the topic in order to understand how to achieve their goal) or unsure about the ways to achieve their goals (either the technology or the process) or unsure about their goals in the first place. Exploratory search is distinguished from known-item search, for which the searcher has a particular target in mind. Consequently, exploratory search covers a broader class of activities than typical information retrieval, such as investigating, evaluating, comparing, and synthesizing, where new information is sought in a defined conceptual area; exploratory data analysis is another example of an information exploration activity. Typically, therefore, such users generally combine querying and browsing strategies to foster learning and investigation. == History == Exploratory search is a topic that has grown from the fields of information retrieval and information seeking but has become more concerned with alternatives to the kind of search that has received the majority of focus (returning the most relevant documents to a Google-like keyword search). The research is motivated by questions like "What if the user doesn't know which keywords to use?" or "What if the user isn't looking for a single answer?" Consequently, research has begun to focus on defining the broader set of information behaviors in order to learn about the situations when a user is, or feels, limited by only having the ability to perform a keyword search. In the last few years, a series of workshops has been held at various related and key events. In 2005, the Exploratory Search Interfaces workshop focused on beginning to define some of the key challenges in the field. Since then a series of other workshops has been held at related conferences: Evaluating Exploratory Search at SIGIR06 and Exploratory Search and HCI at CHI07 (in order to meet with the experts in human–computer interaction). In March 2008, an Information Processing and Management special issue focused particularly on the challenges of evaluating exploratory search, given the reduced assumptions that can be made about scenarios of use. In June 2008, the National Science Foundation sponsored an invitational workshop to identify a research agenda for exploratory search and similar fields for the coming years. == Research challenges == === Important scenarios === With the majority of research in the information retrieval community focusing on typical keyword search scenarios, one challenge for exploratory search is to further understand the scenarios of use for when keyword search is not sufficient. An example scenario, often used to motivate the research by mSpace, states: if a user does not know much about classical music, how should they even begin to find a piece that they might like. Similarly, for patients or their carers, if they don't know the right keywords for their health problems, how can they effectively find useful health information for themselves? === Designing new interfaces === With one of the motivations being to support users when keyword search is not enough, some research has focused on identifying alternative user interfaces and interaction models that support the user in different ways. An example is faceted search which presents diverse category-style options to the users, so that they can choose from a list instead of guess a possible keyword query. Many of the interactive forms of search, including faceted browsers, are being considered for their support of exploratory search conditions. Computational cognitive models of exploratory search have been developed to capture the cognitive complexities involved in exploratory search. Model-based dynamic presentation of information cues are proposed to facilitate exploratory search performance. === Evaluating interfaces === As the tasks and goals involved with exploratory search are largely undefined or unpredictable, it is very hard to evaluate systems with the measures often used in information retrieval. Accuracy was typically used to show that a user had found a correct answer, but when the user is trying to summarize a domain of information, the correct answer is near impossible to identify, if not entirely subjective (for example: possible hotels to stay in Paris). In exploration, it is also arguable that spending more time (where time efficiency is typically desirable) researching a topic shows that a system provides increased support for investigation. Finally, and perhaps most importantly, giving study participants a well specified task could immediately prevent them from exhibiting exploratory behavior. === Models of exploratory search behavior === There have been recent attempts to develop a process model of exploratory search behavior, especially in social information system (e.g., see models of collaborative tagging. The process model assumes that user-generated information cues, such as social tags, can act as navigational cues that facilitate exploration of information that others have found and shared with other users on a social information system (such as social bookmarking system). These models provided extension to existing process model of information search that characterizes information-seeking behavior in traditional fact-retrievals using search engines. Recent development in exploratory search is often concentrated in predicting users' search intents in interaction with the user. Such predictive user modeling, also referred as intent modeling, can help users to get accustomed to a body of domain knowledge and help users to make sense of the potential directions to be explored around their initial, often vague, expression of information needs. == Major figures == Key figures, including experts from both information seeking and human–computer interaction, are: Marcia Bates Nicholas Belkin Gary Marchionini m.c. schraefel Ryen W. White

    Read more →
  • The Citation Project

    The Citation Project

    The Citation Project is a series of studies that measure and analyze first-year college writing students' source use and their ability to understand and implement sources within their own writing. The Citation Project reveals students' source-use habits and the issues that can be seen based on their lack of proper citation skills, such as the prevalence of plagiarism, institution policies, and the results of current writing pedagogy. The Citation Project's central findings were first presented at the Conference on College Composition and Communication in 2012. Although The Citation Project originally referred to this single 2012 study, the feedback received led to the conception of the Project as a broader initiative and as a place to gather and publish studies and data relating to student writing habits for the usage of other researches. == Method == The Citation Project's data comes from the work of 20 researchers analyzing 174 first-year composition students' research papers. The student papers studied originated from 16 institutions across the United States of America, including community colleges, public and private universities, denominational colleges, and Ivy Leagues. Researchers used bibliographic coding to aggregate data regarding the type, length, reading level, and usage of students' sources. == Findings == === Student source assessment and use === This study found that students were capable of identifying, locating, and accessing librarian-approved academic sources, most commonly accessing them with the internet. Despite students demonstrating their ability to find appropriate sources, they tend to exclusively cite the first few pages of their sources. Students' use and analysis of their citations are often limited, frequently resorting to patchwriting, directly restating their source's points, and omitting their own interpretations of their reference's ideas. The Citation Project also highlights students' struggle to accurately determine, address, and value their sources' bias, authority, and credibility. According to the Project's researchers' analysis, these habits demonstrate that first-year college writing students minimally engage with their sources and the academic conversations between them. One researcher from the Citation Project, Rebecca Moore Howard, believes these findings do not point towards students being lazy, but is rather a result of a writing pedagogy that prioritizes efficient, product-focused writing. Another interpretation offered by Sandra Jamieson, another researcher from the Citation Project explains their findings as a result of a lack of adherence to Information Learning (IL) Standards. === Pedagogy === A significant focus of The Citation Project is the development of pedagogical practices intended to equip students with writing and research techniques that will set them up for future success. Writers associated with The Citation Project, such as Tricia Serviss, believe that the practices of teachers surrounding academic integrity and writing practices are what form the foundation of how students think about writing and how to engage with assignments throughout their academic career. They also stress the importance of teaching students to effectively engage with sources rather than simply how to correctly cite them. The Citation Project asserts that endowing students with the ability to read, understand, and synthesize a variety of sources in their writing is a skill that will benefit them throughout their academic careers, and that the surface level typographical focus that many writing programs utilize is inadequate. == Plagiarism == One of the areas that The Citation Project also looks at is how students commit plagiarism throughout their writing. Plagiarism tends to be a checkpoint that gives instructors a sense where students' citation skills stand. Findings from The Citation Project reveal that the most common type of plagiarism is patchwriting which is the act of using the same sentence with only changing a couple of words. These types of issues can be seen as a learning curve due to lack of proper training. Student's that commit plagiarism are often unaware. === Policies === Another issue found is that academic plagiarism policies may not benefit a student's growth but may instead obstruct it. Policies against plagiarism tend to be harsh on the student that committed of offense. Even though student plagiarism is often unintentional academic institutions see this behavior as intentional. Student may then face harsh consequences as a result from their lack of citation knowledge. Additionally, higher level institutions assume that new students already have the skill set to avoid plagiarism which may be an unrealistic expectation. == Legacy == === Inspired studies === ==== Parrott and Napier ==== In one study, "Critical Reading and Student Self-Selected Texts: Results of a Collaborative, Explicit Curricular Approach," Jill Parrot and Trenia Napier quoted the Citation Project's findings as evidence that current collegiate writing curriculums are an ineffective means of teaching students how to properly write academic research papers. The researchers accredited current writing pedagogy's lack of emphasis on teaching critical reading skills. Parrott and Napier tested their thesis by seeing if students would produce more academic writing if they partook in a writing course that taught critical reading. Their results mostly went against this hypothesis, finding students who received additional critical reading training only significantly improved in how they integrated their sources. ==== Kocatepe ==== In May Mehtap Kocatep's study, "Reconceptualising the notion of finding information: How undergraduate students construct information as they read-to-write in an academic writing class," Kocatep expresses that she believes current conversations around writing pedagogy, including the Citation Project, operate with the underlying misconception that information is an easily discoverable static entity and its retrieval is an objective, unbiased decision. Kocatepe instead offers the analysis of what students view as valuable information and if it is worth using is influenced by the socially constructed meanings available to writers at the moment. To further examine students' source engagement, Kocatepe did a study on how female university students from the United Arab Emirates find, retrieve, use, and value sources. Kocatepe's results mainly noted students' almost exclusive reliance on using Google to find sources, as well as how students' navigated mainly English-speaking academic conversations as non-native English speakers. ==== Dahlen, Nordstrom-Sanchez, and Graff ==== Dahlen, Nordstrom-Sanchez, and Graff built their study off The Citation Project research in order to explore the attitudes and practices of students in an undergraduate writing course. As the researchers acknowledge, data collected by the Citation Project was the subject of the bulk of their analysis. This study sought to examine undergraduate writing practices tied to source-usage and elucidate any relevant trends. Dahlen, Nordstrom-Sanchez and Graff found that undergraduate writing students were not engaging with outside sources properly. Key issues discussed include lack of engagement with broad source ideas (in favor of picking out quotes), lack of paraphrasing, and inability to link information between multiple sources. ==== Davis ==== Phillip M. Davis based much of the analysis in his study on data gathered by the Citation Project. This study aimed to examine the particular effects web-based research and study had on undergraduate's papers and the replicability of their bibliographies. Davis sought to see how the shift from physical in-person library based research to online, often at-home research changed the function and usability of the bibliography as a form of documenting source usage in a given work. The primary method of analysis involved examining students' bibliographies to see where they were finding information online and how these sources were accessed. A main issue Davis found was "persistency" of URLs used for online citations. He found that only 18% of URL-based citations continued to function (the others either no longer pointing to the correct document or ceasing to exist altogether) within 3 years of their usage by students, and more than half of claimed online citations could not be found in any form. He suggests that this result brings up questions about how web-based citations should be dealt with in a university context.

    Read more →