AI Data Trainer

AI Data Trainer — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Natural language understanding

    Natural language understanding

    Natural language understanding (NLU) or natural language interpretation (NLI) is a subset of natural language processing in artificial intelligence that deals with machine reading comprehension. NLU has been considered an AI-hard problem. There is considerable commercial interest in the field because of its application to automated reasoning, machine translation, question answering, news-gathering, text categorization, voice-activation, archiving, and large-scale content analysis. == History == The program STUDENT, written in 1964 by Daniel Bobrow for his PhD dissertation at MIT, is one of the earliest known attempts at NLU by a computer. Eight years after John McCarthy coined the term artificial intelligence, Bobrow's dissertation (titled Natural Language Input for a Computer Problem Solving System) showed how a computer could understand simple natural language input to solve algebra word problems. A year later, in 1965, Joseph Weizenbaum at MIT wrote ELIZA, an interactive program that carried on a dialogue in English on any topic, the most popular being psychotherapy. ELIZA worked by simple parsing and substitution of key words into canned phrases and Weizenbaum sidestepped the problem of giving the program a database of real-world knowledge or a rich lexicon. Yet ELIZA gained surprising popularity as a toy project and can be seen as a very early precursor to current commercial systems such as those used by Ask.com. In 1969, Roger Schank at Stanford University introduced the conceptual dependency theory for NLU. This model, partially influenced by the work of Sydney Lamb, was extensively used by Schank's students at Yale University, such as Robert Wilensky, Wendy Lehnert, and Janet Kolodner. In 1970, William A. Woods introduced the augmented transition network (ATN) to represent natural language input. Instead of phrase structure rules ATNs used an equivalent set of finite-state automata that were called recursively. ATNs and their more general format called "generalized ATNs" continued to be used for a number of years. In 1971, Terry Winograd finished writing SHRDLU for his PhD thesis at MIT. SHRDLU could understand simple English sentences in a restricted world of children's blocks to direct a robotic arm to move items. The successful demonstration of SHRDLU provided significant momentum for continued research in the field. Winograd continued to be a major influence in the field with the publication of his book Language as a Cognitive Process. At Stanford, Winograd would later advise Larry Page, who co-founded Google. In the 1970s and 1980s, the natural language processing group at SRI International continued research and development in the field. A number of commercial efforts based on the research were undertaken, e.g., in 1982 Gary Hendrix formed Symantec Corporation originally as a company for developing a natural language interface for database queries on personal computers. However, with the advent of mouse-driven graphical user interfaces, Symantec changed direction. A number of other commercial efforts were started around the same time, e.g., Larry R. Harris at the Artificial Intelligence Corporation and Roger Schank and his students at Cognitive Systems Corp. In 1983, Michael Dyer developed the BORIS system at Yale which bore similarities to the work of Roger Schank and W. G. Lehnert. The third millennium saw the introduction of systems using machine learning for text classification, such as the IBM Watson. However, experts debate how much "understanding" such systems demonstrate: e.g., according to John Searle, Watson did not even understand the questions. John Ball, cognitive scientist and inventor of the Patom Theory, supports this assessment. Natural language processing has made inroads for applications to support human productivity in service and e-commerce, but this has largely been made possible by narrowing the scope of the application. There are thousands of ways to request something in a human language that still defies conventional natural language processing. According to Wibe Wagemans, "To have a meaningful conversation with machines is only possible when we match every word to the correct meaning based on the meanings of the other words in the sentence – just like a 3-year-old does without guesswork." == Scope and context == The umbrella term "natural language understanding" can be applied to a diverse set of computer applications, ranging from small, relatively simple tasks such as short commands issued to robots, to highly complex endeavors such as the full comprehension of newspaper articles or poetry passages. Many real-world applications fall between the two extremes, for instance text classification for the automatic analysis of emails and their routing to a suitable department in a corporation does not require an in-depth understanding of the text, but needs to deal with a much larger vocabulary and more diverse syntax than the management of simple queries to database tables with fixed schemata. Throughout the years various attempts at processing natural language or English-like sentences presented to computers have taken place at varying degrees of complexity. Some attempts have not resulted in systems with deep understanding, but have helped overall system usability. For example, Wayne Ratliff originally developed the Vulcan program with an English-like syntax to mimic the English speaking computer in Star Trek. Vulcan later became the dBase system whose easy-to-use syntax effectively launched the personal computer database industry. Systems with an easy-to-use or English-like syntax are, however, quite distinct from systems that use a rich lexicon and include an internal representation (often as first order logic) of the semantics of natural language sentences. Hence the breadth and depth of "understanding" aimed at by a system determine both the complexity of the system (and the implied challenges) and the types of applications it can deal with. The "breadth" of a system is measured by the sizes of its vocabulary and grammar. The "depth" is measured by the degree to which its understanding approximates that of a fluent native speaker. At the narrowest and shallowest, English-like command interpreters require minimal complexity, but have a small range of applications. Narrow but deep systems explore and model mechanisms of understanding, but they still have limited application. Systems that attempt to understand the contents of a document such as a news release beyond simple keyword matching and to judge its suitability for a user are broader and require significant complexity, but they are still somewhat shallow. Systems that are both very broad and very deep are beyond the current state of the art. == Components and architecture == Regardless of the approach used, most NLU systems share some common components. The system needs a lexicon of the language and a parser and grammar rules to break sentences into an internal representation. The construction of a rich lexicon with a suitable ontology requires significant effort, e.g., the Wordnet lexicon required many person-years of effort. The system also needs theory from semantics to guide the comprehension. The interpretation capabilities of a language-understanding system depend on the semantic theory it uses. Competing semantic theories of language have specific trade-offs in their suitability as the basis of computer-automated semantic interpretation. These range from naive semantics or stochastic semantic analysis to the use of pragmatics to derive meaning from context. Semantic parsers convert natural-language texts into formal meaning representations. Advanced applications of NLU also attempt to incorporate logical inference within their framework. This is generally achieved by mapping the derived meaning into a set of assertions in predicate logic, then using logical deduction to arrive at conclusions. Therefore, systems based on functional languages such as Lisp need to include a subsystem to represent logical assertions, while logic-oriented systems such as those using the language Prolog generally rely on an extension of the built-in logical representation framework. The management of context in NLU can present special challenges. A large variety of examples and counter examples have resulted in multiple approaches to the formal modeling of context, each with specific strengths and weaknesses.

    Read more →
  • Heng Ji

    Heng Ji

    Heng Ji is a computer scientist who works on information extraction and natural language processing. She is well known for her work on joined named entity recognition and relation extraction, as well as for her work on cross-document event extraction. She has been coordinating the popular NIST TAC Knowledge Base Population task since 2010. She has been recognised as one of AI's 10 to watch by IEEE Intelligent Systems in 2013, and has won multiple awards, including a NSF Career Award in 2009, Google Research awards in 2009 and 2014, and an IBM Watson Faculty Award in 2012. == Education == Heng Ji obtained a Bachelor's and master's degree in Computational Linguistics from Tsinghua University. She subsequently obtained a MSc, then PhD in Computer Science from New York University in 2008 under the supervision of Ralph Grishman. Her PhD thesis was on the topic of information extraction, with a particular focus on joint training of multiple components in the information extraction pipeline, as well as cross-lingual learning. == Career == Upon graduating with a PhD from New York University, Ji took up a position as assistant professor at Queens College, City University of New York, where she founded the BLENDER Lab, which focuses on research on cross-lingual, cross-documents, cross-media information extraction and fusion. In 2013, she joined Rensselaer Polytechnic Institute as an Edward P. Hamilton Development Chair and Tenured associate professor in Computer Science. Since 2019, she has been a full professor at the University of Illinois at Urbana–Champaign, as well as an Amazon Scholar. == Research == Heng Ji works in the area of natural language processing, machine learning and information extraction. She has published over 300 peer-reviewed research papers. Her work is published in the proceedings of computer science conferences, including the Annual Meeting of the Association for Computational Linguistics, The Web Conference, and the ACM Conference on Knowledge Discovery and Data Mining (KDD). Ji is a leading researcher in information extraction, having coordinated the popular NIST TAC Knowledge Base Population shared task since 2010. She is most recognised for her work on modelling interactions between subtasks in information extraction, which was also the topic of her PhD thesis, and for her work on event detection using cross-document signals. == Selected honors and distinctions == 2009 NSF Career Award 2009 Google Research Award 2012 IBM Watson Faculty Award 2013 IEEE AI's 10 to Watch 2014 Google Research Award 2016 World Economic Forum, 'Young Scientist' 2017 World Economic Forum, 'Young Scientist' 2020 Annual Meeting of the Association for Computational Linguistics, best demonstration paper

    Read more →
  • AI Background Removers: Free vs Paid (2026)

    AI Background Removers: Free vs Paid (2026)

    Looking for the best AI background remover? An AI background remover is software that uses machine learning to help you get more done — it can save you hours every week by automating repetitive work. Most options offer a generous free tier, with paid plans unlocking higher limits, faster processing, and team features. Whether you are a beginner or a pro, the right AI background remover slots into your workflow and pays for itself fast. Read on for hands-on impressions, pricing tiers, and the standout features that matter.

    Read more →
  • AI Subtitle Generators Reviews: What Actually Works in 2026

    AI Subtitle Generators Reviews: What Actually Works in 2026

    Trying to pick the best AI subtitle generator? An AI subtitle generator is software that uses machine learning to help you get more done — it scales effortlessly from a single task to thousands. The best picks balance beginner-friendly simplicity with the depth power users need, and they ship updates often. Whether you are a beginner or a pro, the right AI subtitle generator slots into your workflow and pays for itself fast. Read on for hands-on impressions, pricing tiers, and the standout features that matter.

    Read more →
  • VK Video

    VK Video

    VK Video is an internet video hosting service launched by VK (formerly known as Mail.ru Group) in 2021. It is positioned as a Russian alternative to the international platform YouTube. == History == The "VK Video" service began operations on October 15, 2021, following the merger of video platforms belonging to the social networks "VKontakte" and "Odnoklassniki". The launch of "VK Video" was managed by a team of executives led by VKontakte CEO Marina Krasnova, who worked at the company until 2023. Its launch was intended as an alternative to the international platform YouTube, which Russian authorities sought to replace with "domestic analogs. Key differences of the Russian service became the presence of pirated materials. Videos from the American video hosting site were uploaded en masse to "VK Video," which even caused the service to be temporarily blocked by YouTube. From 2022, to attract users, VKontakte's management bet on working with famous bloggers, specifically purchasing the shows "What Happened Next?" (ChBD) and "Vnutri Lapenko". Among the bloggers recruited to promote the service was the popular video blogger Vlad A4. An additional advantage for creators was the availability of monetization, which had been unavailable on YouTube for users from the Russian Federation since 2022. In September 2023, a separate "VK Video" mobile app appeared. In total, by the end of 2023, the monthly audience of "VK Video" reached 67.9 million users (which is almost 30 million less than YouTube). In the summer of 2024, following the blocking of YouTube in Russia, the service's traffic grew sharply: in August, its audience increased by more than two times compared to July. In the same month, "VK Video" took second place in downloads among free apps in the App Store and third in Google Play. In December 2024, the service received its own domain: vkvideo.ru. For the first time, "VK Video" managed to surpass YouTube in monthly audience in Russia in July 2025: the Russian service attracted 76.4 million viewers, whereas YouTube's reach amounted to 74.9 million people. == Platform features == On "VK Video," a view is recorded from the first second, whereas on YouTube it is only from the thirtieth. At the same time, a significant portion of comments are left by bots. For videos from the platform's most popular bloggers, the engagement level (likes to views) does not reach 4%. The "Trends" section most often features videos from large channels where the ratio of likes to views does not exceed 2%. == Management == In April 2025, the post of General Director of "VK Video" was taken by Marianna Maksimovskaya. From June 2022 to July 2024, the development of the platform was led by Fyodor Yezhov, who was primarily responsible for its technical direction. == Awards == In 2023, VK Video was awarded the Runet Prize in the "Science, Technology and Innovation" category.

    Read more →
  • Is an AI Subtitle Generator Worth It in 2026?

    Is an AI Subtitle Generator Worth It in 2026?

    Comparing the best AI subtitle generator? An AI subtitle generator is software that uses machine learning to help you get more done — it lowers the barrier so anyone can produce professional output. Privacy matters too: check whether your data trains the model and whether a no-log or enterprise tier is available. Whether you are a beginner or a pro, the right AI subtitle generator slots into your workflow and pays for itself fast. We tested the leading options and ranked them by quality, value, and ease of use.

    Read more →
  • Nando de Freitas

    Nando de Freitas

    Nando de Freitas is a researcher in the field of machine learning, and in particular in the subfields of neural networks, Bayesian inference and Bayesian optimization, and deep learning. == Biography == De Freitas was born in Zimbabwe. He did his undergraduate studies (1991–94) and MSc (1994–96) at the University of the Witwatersrand, and his PhD at Trinity College, Cambridge (1996-2000). From 2001, he was a professor at the University of British Columbia, before joining the Department of Computer Science at the University of Oxford from 2013 to 2017. In 2014, he joined Google's DeepMind when the company acquired Oxford spinoff Dark Blue Labs. He was in charge of the team that worked on creating tools for generating audio and images at DeepMind. In September 2024, de Freitas joined Microsoft AI as VP of AI. == Awards and recognition == De Freitas has been recognised for his contributions to machine learning through the following awards: Best Paper Award at the International Conference on Machine Learning (2016) Best Paper Award at the International Conference on Learning Representations (2016) Google Faculty Research Award (2014) Distinguished Paper Award at the International Joint Conference on Artificial Intelligence (2013) Charles A. McDowell Award for Excellence in Research (2012) Mathematics of Information Technology and Complex Systems Young Researcher Award (2010)

    Read more →
  • NovelAI

    NovelAI

    NovelAI is an online cloud-based, SaaS model, and a paid subscription service for AI-assisted storywriting and text-to-image synthesis, originally launched in beta on June 15, 2021, with the image generation feature being implemented later on October 3, 2022. NovelAI is owned and operated by Anlatan, which is headquartered in Wilmington, Delaware. == Features == NovelAI uses GPT-based large language models (LLMs) to generate storywriting and prose. It has several models, such as Calliope, Sigurd, Euterpe, Krake, and Genji, with Genji being a Japanese-language model. The service also offers encrypted servers and customizable editors. For AI art generation, which generates images from text prompts, NovelAI uses a custom version of the source-available Stable Diffusion text-to-image diffusion model called NovelAI Diffusion, which is trained on a Danbooru-based dataset. NovelAI is also capable of generating a new image based on an existing image. The NovelAI terms of service states that all generated content belongs to the user, regardless if the user is an individual or a corporation. Anlatan states that generated images are not stored locally on their servers. == History == On April 28, 2021, Anlatan officially launched NovelAI. On June 15, 2021, Anlatan released their finetuned GPT-Neo-2.7B model from EleutherAI named Calliope, after the Greek Muses. A day later, they released their Opus-exclusive GPT-J-6B finetuned model named Sigurd, after the Norse/Germanic hero. On March 21, 2023, Nvidia and CoreWeave announced Anlatan being one of the first CoreWeave customers to deploy NVIDIA's H100 Tensor Core GPUs for new LLM model inferencing and training. On April 1, 2023, Anlatan added ControlNet features to their text-to-image NovelAI Diffusion model. On May 16, 2023, Anlatan announced that they named their H100 cluster Shoggy, a reference to H.P. Lovecraft's Shoggoths, which was used to pre-train an undisclosed 8192 token context LLM in-house model. == Reception and controversy == Following the implementation of image generation, NovelAI became a widely-discussed topic in Japan, with some online commentators noting that its image synthesis features are very adept at producing close impressions of anime characters, including lolicon and shotacon imagery, while others have expressed concern that it is a paid service reliant on a diffusion model, while the original machine learning training data consists of images used without the consent of the original artists. Attorney Kosuke Terauchi notes that, since a revision of the law in 2018, it is no longer illegal in Japan for machine learning models to scrape copyrighted content from the internet to use as training data; meanwhile, in the United States where NovelAI is based, there is no specific legal framework which regulates machine learning, and thus the fair use doctrine of US copyright law applies instead. Danbooru has posted an official statement in regards to NovelAI's use of the site's content for AI training, expressing that Danbooru is not affiliated with NovelAI, and does not endorse nor condone NovelAI's use of artists' artworks for machine learning. FayerWayer described NovelAI as a service capable of generating hentai. Manga artist Izumi Ū commented that while the manga style art generated by NovelAI is highly accurate, there are still imperfections in the output, although he views these as human-like in a favourable light nonetheless. In response to the topic of NovelAI, Narugami, founder of the Japanese freelance artist commissioning website Skeb, stated on October 5, 2022 that the use of AI image generation is prohibited on the platform since 2018. Illustrations using NovelAI have been posted on social media and illustration posting sites, and by October 13, 2,111 works tagged with #NovelAI were posted on Pixiv. Pixiv has stated that it is not considering a complete elimination of creations that use AI, though it requires AI-generated posts to be marked as such and allows users to filter them out. == Incidents == On October 6, 2022, NovelAI experienced a data breach where its software's source code was leaked.

    Read more →
  • Automated essay scoring

    Automated essay scoring

    Automated essay scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational setting. It is a form of educational assessment and an application of natural language processing. Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding to the possible grades, for example, the numbers 1 to 6. Therefore, it can be considered a problem of statistical classification. Several factors have contributed to a growing interest in AES. Among them are cost, accountability, standards, and technology. Rising education costs have led to pressure to hold the educational system accountable for results by imposing standards. The advance of information technology promises to measure educational achievement at reduced cost. The use of AES for high-stakes testing in education has generated significant backlash, with opponents pointing to research that computers cannot yet grade writing accurately and arguing that their use for such purposes promotes teaching writing in reductive ways (i.e. teaching to the test). == History == Most historical summaries of AES trace the origins of the field to the work of Ellis Batten Page. In 1966, he argued for the possibility of scoring essays by computer, and in 1968 he published his successful work with a program called Project Essay Grade (PEG). Using the technology of that time, computerized essay scoring would not have been cost-effective, so Page abated his efforts for about two decades. Eventually, Page sold PEG to Measurement Incorporated. By 1990, desktop computers had become so powerful and so widespread that AES was a practical possibility. As early as 1982, a UNIX program called Writer's Workbench was able to offer punctuation, spelling and grammar advice. In collaboration with several companies (notably Educational Testing Service), Page updated PEG and ran some successful trials in the early 1990s. Peter Foltz and Thomas Landauer developed a system using a scoring engine called the Intelligent Essay Assessor (IEA). IEA was first used to score essays in 1997 for their undergraduate courses. It is now a product from Pearson Educational Technologies and used for scoring within a number of commercial products and state and national exams. IntelliMetric is Vantage Learning's AES engine. Its development began in 1996. It was first used commercially to score essays in 1998. Educational Testing Service offers "e-rater", an automated essay scoring program. It was first used commercially in February 1999. Jill Burstein was the team leader in its development. ETS's Criterion Online Writing Evaluation Service uses the e-rater engine to provide both scores and targeted feedback. Lawrence Rudner has done some work with Bayesian scoring, and developed a system called BETSY (Bayesian Essay Test Scoring sYstem). Some of his results have been published in print or online, but no commercial system incorporates BETSY as yet. Under the leadership of Howard Mitzel and Sue Lottridge, Pacific Metrics developed a constructed response automated scoring engine, CRASE. Currently utilized by several state departments of education and in a U.S. Department of Education-funded Enhanced Assessment Grant, Pacific Metrics’ technology has been used in large-scale formative and summative assessment environments since 2007. Measurement Inc. acquired the rights to PEG in 2002 and has continued to develop it. In 2012, the Hewlett Foundation sponsored a competition on Kaggle called the Automated Student Assessment Prize (ASAP). 201 challenge participants attempted to predict, using AES, the scores that human raters would give to thousands of essays written to eight different prompts. The intent was to demonstrate that AES can be as reliable as human raters, or more so. The competition also hosted a separate demonstration among nine AES vendors on a subset of the ASAP data. Although the investigators reported that the automated essay scoring was as reliable as human scoring, this claim was not substantiated by any statistical tests because some of the vendors required that no such tests be performed as a precondition for their participation. Moreover, the claim that the Hewlett Study demonstrated that AES can be as reliable as human raters has since been strongly contested, including by Randy E. Bennett, the Norman O. Frederiksen Chair in Assessment Innovation at the Educational Testing Service. Some of the major criticisms of the study have been that five of the eight datasets consisted of paragraphs rather than essays, four of the eight data sets were graded by human readers for content only rather than for writing ability, and that rather than measuring human readers and the AES machines against the "true score", the average of the two readers' scores, the study employed an artificial construct, the "resolved score", which in four datasets consisted of the higher of the two human scores if there was a disagreement. This last practice, in particular, gave the machines an unfair advantage by allowing them to round up for these datasets. In 1966, Page hypothesized that, in the future, the computer-based judge will be better correlated with each human judge than the other human judges are. Despite criticizing the applicability of this approach to essay marking in general, this hypothesis was supported for marking free text answers to short questions, such as those typical of the British GCSE system. Results of supervised learning demonstrate that the automatic systems perform well when marking by different human teachers is in good agreement. Unsupervised clustering of answers showed that excellent papers and weak papers formed well-defined clusters, and the automated marking rule for these clusters worked well, whereas marks given by human teachers for the third cluster ('mixed') can be controversial, and the reliability of any assessment of works from the 'mixed' cluster can often be questioned (both human and computer-based). == Different dimensions of essay quality == According to a recent survey, modern AES systems try to score different dimensions of an essay's quality in order to provide feedback to users. These dimensions include the following items: Grammaticality: following grammar rules Usage: using of prepositions, word usage Mechanics: following rules for spelling, punctuation, capitalization Style: word choice, sentence structure variety Relevance: how relevant of the content to the prompt Organization: how well the essay is structured Development: development of ideas with examples Cohesion: appropriate use of transition phrases Coherence: appropriate transitions between ideas Thesis Clarity: clarity of the thesis Persuasiveness: convincingness of the major argument == Procedure == From the beginning, the basic procedure for AES has been to start with a training set of essays that have been carefully hand-scored. The program evaluates surface features of the text of each essay, such as the total number of words, the number of subordinate clauses, or the ratio of uppercase to lowercase letters—quantities that can be measured without any human insight. It then constructs a mathematical model that relates these quantities to the scores that the essays received. The same model is then applied to calculate scores of new essays. Recently, one such mathematical model was created by Isaac Persing and Vincent Ng. which not only evaluates essays on the above features, but also on their argument strength. It evaluates various features of the essay, such as the agreement level of the author and reasons for the same, adherence to the prompt's topic, locations of argument components (major claim, claim, premise), errors in the arguments, cohesion in the arguments among various other features. In contrast to the other models mentioned above, this model is closer in duplicating human insight while grading essays. Due to the growing popularity of deep neural networks, deep learning approaches have been adopted for automated essay scoring, generally obtaining superior results, often surpassing inter-human agreement levels. The various AES programs differ in what specific surface features they measure, how many essays are required in the training set, and most significantly in the mathematical modeling technique. Early attempts used linear regression. Modern systems may use linear regression or other machine learning techniques often in combination with other statistical techniques such as latent semantic analysis and Bayesian inference. The automated essay scoring task has also been studied in the cross-domain setting using machine learning models, where the models are trained on essays written for one prompt (topic) and tested on essays written for another prompt. Successful approaches in the cross-domain scenario are based on deep neural networks or models that combine deep and shallow features. == Criteria for success == Any method of a

    Read more →
  • Corpus linguistics

    Corpus linguistics

    Corpus linguistics is an empirical method for the study of language by text corpus (plural corpora). Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. Large collections of text, though corpora may also be small in terms of running words, allow linguists to run quantitative analyses on linguistic concepts that may be difficult to test in a qualitative manner. The text-corpus method uses the body of texts in any natural language to derive the set of abstract rules which govern that language. Those results can be used to explore the relationships between that subject language and other languages which have undergone a similar analysis. The first such corpora were manually derived from source texts, but now that work is automated. Corpora have not only been used for linguistics research, they have been increasingly used to compile dictionaries (starting with The American Heritage Dictionary of the English Language in 1969) and reference grammars, with A Comprehensive Grammar of the English Language, published in 1985, as a first. Experts in the field have differing views about the annotation of a corpus. These views range from John McHardy Sinclair, who advocates minimal annotation so texts speak for themselves, to the Survey of English Usage team (University College, London), who advocate annotation as allowing greater linguistic understanding through rigorous recording. == History == Some of the earliest efforts at grammatical description were based at least in part on corpora of particular religious or cultural significance. For example, Prātiśākhya literature described the sound patterns of Sanskrit as found in the Vedas, and Pāṇini's grammar of classical Sanskrit was based at least in part on analysis of that same corpus. Similarly, the early Arabic grammarians paid particular attention to the language of the Quran. In the Western European tradition, scholars prepared concordances to allow detailed study of the language of the Bible and other canonical texts. === English corpora === A landmark in modern corpus linguistics was the publication of Computational Analysis of Present-Day American English in 1967. Written by Henry Kučera and W. Nelson Francis, the work was based on an analysis of the Brown Corpus, which is a structured and balanced corpus of one million words of American English from the year 1961. The corpus comprises 2000 text samples, from a variety of genres. The Brown Corpus was the first computerized corpus designed for linguistic research. Kučera and Francis subjected the Brown Corpus to a variety of computational analyses and then combined elements of linguistics, language teaching, psychology, statistics, and sociology to create a rich and variegated opus. A further key publication was Randolph Quirk's "Towards a description of English Usage" in 1960 in which he introduced the Survey of English Usage. Quirk's corpus was the first modern corpus to be built with the purpose of representing the whole language. Shortly thereafter, Boston publisher Houghton-Mifflin approached Kučera to supply a million-word, three-line citation base for its new American Heritage Dictionary, the first dictionary compiled using corpus linguistics. The AHD took the innovative step of combining prescriptive elements (how language should be used) with descriptive information (how it actually is used). Other publishers followed suit. The British publisher Collins' COBUILD monolingual learner's dictionary, designed for users learning English as a foreign language, was compiled using the Bank of English. The Survey of English Usage Corpus was used in the development of one of the most important Corpus-based Grammars, which was written by Quirk et al. and published in 1985 as A Comprehensive Grammar of the English Language. The Brown Corpus has also spawned a number of similarly structured corpora: the LOB Corpus (1960s British English), Kolhapur (Indian English), Wellington (New Zealand English), Australian Corpus of English (Australian English), the Frown Corpus (early 1990s American English), and the FLOB Corpus (1990s British English). Other corpora represent many languages, varieties and modes, and include the International Corpus of English, and the British National Corpus, a 100 million word collection of a range of spoken and written texts, created in the 1990s by a consortium of publishers, universities (Oxford and Lancaster) and the British Library. For contemporary American English, work has stalled on the American National Corpus, but the 400+ million word Corpus of Contemporary American English (1990–present) is now available through a web interface. The first computerized corpus of transcribed spoken language was constructed in 1971 by the Montreal French Project, containing one million words, which inspired Shana Poplack's much larger corpus of spoken French in the Ottawa-Hull area. === Multilingual corpora === In the 1990s, many of the notable early successes on statistical methods in natural-language programming (NLP) occurred in the field of machine translation, due especially to work at IBM Research. These systems were able to take advantage of existing multilingual textual corpora that had been produced by the Parliament of Canada and the European Union as a result of laws calling for the translation of all governmental proceedings into all official languages of the corresponding systems of government. There are corpora in non-European languages as well. For example, the National Institute for Japanese Language and Linguistics in Japan has built a number of corpora of spoken and written Japanese. Sign language corpora have also been created using video data. === Ancient languages corpora === Besides these corpora of living languages, computerized corpora have also been made of collections of texts in ancient languages. An example is the Andersen-Forbes database of the Hebrew Bible, developed since the 1970s, in which every clause is parsed using graphs representing up to seven levels of syntax, and every segment tagged with seven fields of information. The Quranic Arabic Corpus is an annotated corpus for the Classical Arabic language of the Quran. This is a recent project with multiple layers of annotation including morphological segmentation, part-of-speech tagging, and syntactic analysis using dependency grammar. The Digital Corpus of Sanskrit (DCS) is a "Sandhi-split corpus of Sanskrit texts with full morphological and lexical analysis... designed for text-historical research in Sanskrit linguistics and philology." === Corpora from specific fields === Besides pure linguistic inquiry, researchers had begun to apply corpus linguistics to other academic and professional fields, such as the emerging sub-discipline of Law and Corpus Linguistics, which seeks to understand legal texts using corpus data and tools. The DBLP Discovery Dataset concentrates on computer science, containing relevant computer science publications with sentient metadata such as author affiliations, citations, or study fields. A more focused dataset was introduced by NLP Scholar, a combination of papers of the ACL Anthology and Google Scholar metadata. Corpora can also aid in translation efforts or in teaching foreign languages. == Methods == Corpus linguistics has generated a number of research methods, which attempt to trace a path from data to theory. Wallis and Nelson (2001) first introduced what they called the 3A perspective: Annotation, Abstraction and Analysis. Annotation consists of the application of a scheme to texts. Annotations may include structural markup, part-of-speech tagging, parsing, and numerous other representations. Abstraction consists of the translation (mapping) of terms in the scheme to terms in a theoretically motivated model or dataset. Abstraction typically includes linguist-directed search but may include e.g., rule-learning for parsers. Analysis consists of statistically probing, manipulating and generalising from the dataset. Analysis might include statistical evaluations, optimisation of rule-bases or knowledge discovery methods. Most lexical corpora today are part-of-speech-tagged (POS-tagged). However even corpus linguists who work with 'unannotated plain text' inevitably apply some method to isolate salient terms. In such situations annotation and abstraction are combined in a lexical search. The advantage of publishing an annotated corpus is that other users can then perform experiments on the corpus (through corpus managers). Linguists with other interests and differing perspectives than the originators' can exploit this work. By sharing data

    Read more →
  • Wolfgang Ketter

    Wolfgang Ketter

    Wolfgang Ketter (born Traben-Trarbach, Germany, 1972) is Chaired Professor of Information Systems for a Sustainable Society at the University of Cologne. and a prominent scientist in the application of artificial intelligence, machine learning and intelligent agents in the design of smart markets, including demand response mechanisms and in particular automated auctions. He is a co-founder of the open energy system platform Power TAC, an automated retail electricity trading platform that simulates the performance of retail markets in an increasingly prosumer- and renewable-energy-influenced electricity landscape. == Career == === Advisory roles === Ketter is an advisor on the energy transition to the German government, in particular, the energy-intensive German state of North Rhine-Westphalia. He is also a fellow of the World Economic Forum and member of the WEF Global Council on Future Mobility and the Global New Mobility Coalition, contributing on the use of AI and machine learning to address issues arising from growth in electrification of energy such as the use of batteries as virtual power plants, the management of electric vehicle charging to prevent grid congestion, or the potential for peer-to-peer electricity trading. Ketter has also been an advisor for over a decade to the Port of Rotterdam on the design of energy cooperatives and energy trading platforms as well as one of the largest auction companies in the world, Royal FloraHolland, where his initial research led to a redesign of auction mechanisms and decision support systems. The cumulative research project team received the Association for Information Systems Impact Award in 2020 === Research === Ketter’s research is multidisciplinary, addressing the overlap of AI and ML in the economics of retail energy and mobility markets. The industry and policy applications of his research interconnect in large-scale projects such as the EU Smart city development project Ruggedised, for which the Erasmus University-based team's publication on the optimization of the City of Rotterdam's electric transit bus network was recognized with the Institute for Operations Research and the Management Sciences Daniel H. Wagner runner-up award. His research focuses on the use of competitive benchmarking and intelligent agents in virtual world simulations of retail energy markets as part of a smart grid. A small-scale version of the Power TAC project led to a publication on demand side management, 'A simulation of household behavior under variable prices' that has several hundred citations in publications representing a variety of scientific disciplines. Two of his publications in the Management Information Systems Quarterly journal and one in Energy Economics form the foundation for the current Power TAC platform. In 2016 and 2019 he was Chair of the Workshop on Information Technologies and Systems. Ketter is Coordinator of the Key Research Initiative Sustainable Smart Energy & Mobility at the University of Cologne, where he is a chaired Professor of Information Systems for a Sustainable Society. At the Rotterdam School of Management, Erasmus University, he is Professor of Next Generation Information Systems as well as Director of the Erasmus Centre for Future Energy Business and Academic Director of Smart Cities and Smart Energy at the Erasmus Centre of Data Analytics. He has been a visiting professor at the Haas School of Business and Berkeley Institute of Data Science, University of California at Berkeley in 2016 to 2017.

    Read more →
  • Top 10 AI Avatar Generators Compared (2026)

    Top 10 AI Avatar Generators Compared (2026)

    In search of the best AI avatar generator? An AI avatar generator is software that uses machine learning to help you get more done — it turns a rough idea into a polished result in seconds. When choosing one, weigh output quality, pricing, export formats, and how well it fits the tools you already use. Whether you are a beginner or a pro, the right AI avatar generator slots into your workflow and pays for itself fast. Below we compare features, pricing, and real output so you can choose with confidence.

    Read more →
  • Nice (app)

    Nice (app)

    Nice is a photo-sharing mobile app developed by Nice App Mobile Technology Co., Ltd. (Chinese: 北京极赞科技有限公司) in China. The app allows users to tag specific locations on images, enabling detailed labeling of items such as clothing and accessories. The company received a $36 million investment in C-round funding in 2014. Nice had 30 million registered users and 12 million active users as of late 2015. As of January 2024, it remained a popular app, the 6th most-downloaded in the iOS App Store for China. == Official website == Official website

    Read more →
  • StarDict

    StarDict

    StarDict, developed by Hu Zheng (胡正), is a free GUI released under the GPL-3.0-or-later license for accessing StarDict dictionary files (a dictionary shell). It is the successor of StarDic, developed by Ma Su'an (馬蘇安), continuing its version numbers. According to StarDict's earlier homepage on SourceForge, the project has been removed from SourceForge due to copyright infringement reports. It moved to Google Code and then back to SourceForge, while development is now seemingly continued on GitHub. == Supported platforms == StarDict runs under Linux, Windows, FreeBSD, Maemo and Solaris. Dictionaries of the user's choice are installed separately. Dictionary files can be created by converting dict files. Several programs compatible with the StarDict dictionary format are available for different platforms. For the iPhone, iPod Touch and iPad, applications available in the App Store include GuruDic, TouchDict, weDict, Dictionary Universal, Alpus and others, as well as the free iStarDict, which is available for the Cydia Store. == Dictionaries available == One can find here the partial list of FreeDict dictionaries which can be converted to the StarDict format. These include, in particular, some older versions of Webster's dictionary and many dictionaries for various languages. == Features == While StarDict is in scan mode, results are displayed in a tooltip, allowing easy dictionary lookup. When combined with Freedict, StarDict will quickly provide rough translations of foreign language websites. On September 25, 2006, an online version of Stardict began operation. This online version includes access to all the major dictionaries of StarDict, as well as Wikipedia in Chinese. Previous versions of StarDict were very similar to the PowerWord dictionary program, which is developed by a Chinese company, KingSoft. Since version 2.4.2, however, StarDict has diverged from the design of PowerWord by increasing its search capabilities and adding lexicons in a variety of languages. This was assisted by the collaboration of many developers with the author. == sdcv == Evgeniy A. Dushistov produced a command line version of StarDict called sdcv. It employed all the dictionary files that belong to StarDict. It is written in C++ and licensed under the terms of the GNU General Public License. sdcv runs under Linux, FreeBSD, and Solaris. As in StarDict, dictionaries of the user's choice have to be installed separately. At the end of 2006, software developer Hu Zheng cited personal financial problems as an excuse to charge users for downloading dictionary files from his website, which temporarily aroused strong doubts and dissatisfaction in the Linux community. In the end, under the pressure of public opinion, the charging plan was forced to be canceled and ended hastily.

    Read more →
  • AI Background Removers: Free vs Paid (2026)

    AI Background Removers: Free vs Paid (2026)

    Looking for the best AI background remover? An AI background remover is software that uses machine learning to help you get more done — it can save you hours every week by automating repetitive work. Most options offer a generous free tier, with paid plans unlocking higher limits, faster processing, and team features. Whether you are a beginner or a pro, the right AI background remover slots into your workflow and pays for itself fast. Read on for hands-on impressions, pricing tiers, and the standout features that matter.

    Read more →