Digital heritage

Digital heritage

The Charter on the Preservation of Digital Heritage of UNESCO defines digital heritage as embracing "cultural, educational, scientific and administrative resources, as well as technical, legal, medical and other kinds of information created digitally, or converted into digital form from existing analogue resources". Digital heritage also includes the use of digital media in the service of understanding and preserving cultural or natural heritage. The digitization of both cultural heritage and Natural heritage serves to enable the permanent access of current and future generations to culturally important objects ranging from literature and paintings to flora, fauna, or habitats. It is also used in the preservation and access of objects with enduring or significant historical, scientific, or cultural value including buildings, archeological sites, and natural phenomena. The main idea is the transformation of a material object into a virtual copy. It should not be confused with digital humanities, which uses digitizing technology to specifically help with research. There have been several debates concerning the efficiency of the process of digitizing heritage. Some of the drawbacks refer to the deterioration and technological obsolescence due to the lack of funding for archival materials and underdeveloped policies that would regulate such a process. Another main social debate has taken place around the restricted accessibility due to the digital divide that exists around the world. Nevertheless, new technologies enable easy, instant and cross boarder access to the digitized work. Many of these technologies include spatial and surveying technology to gain aerial or 3D images. Digital heritage is also used to monitor cultural heritage sites over years to help with preservation, maintenance, and sustainable tourism. It aims to observe any changes, diseases, or deterioration that may occur on objects. == Cultural and natural heritage == Digital Heritage that is not born-digital can be divided into two separate groups—digital cultural heritage and digital natural heritage. Digital cultural heritage is the maintenance or preservation of cultural objects through digitization. These are objects, in some cases entire cities, that are considered of cultural importance. These objects are sometimes able to be digitized or physically represented in minute detail. Digital cultural heritage also includes intangible heritage. These are things such as "oral traditions, customs, value systems, skills, traditional dances, diets, performances" and other unique features of a culture. Intangible heritage is particularly vulnerable to destruction due to urbanization. There are several projects and programs which concentrate on digital cultural heritage. One such project is Mapping Gothic France, which aims to document and preserve cathedrals across France using images, VR tours, laser scans, and panoramas. This allows for scientific and historical study and preservation of the cathedrals and also provides detailed access to the sites for anyone in the world. The aim of projects like these is to help with the preservation and restoration of cultural objects. After the fire at Notre-Dame de Paris in 2019, digital scans are a major component in the ongoing restoration. Digital natural heritage pertains to objects of natural heritage that are considered of cultural, scientific, or aesthetic importance. Digital heritage in this instance is used not only to grant access to these objects, but to monitor any changes over time, such as with plant or animal habitats. Geographic information systems are a form of technology that is used primarily in the study of natural heritage. Western Australia has one such digital heritage project where they have created a digital repository of native plants important to both the region and the Aboriginal people. This is in order to protect and preserve the important biological heritage of Western Australia. == Educational impact == The digitization of these heritage objects has impacts around the world and across many disciplines. The increase of digital items means that people, especially the youth, are able to learn about new objects and cultures online through various media. They provide viewers with a more in-depth experience with an item or place, instead of just an image. The media is also able to be curated to age- or educational-level appropriateness, making learning easier. Some of the technology used in education, especially in museums, includes mobile apps, virtual reality, social media, and video games. Cultural heritage institutions are using this technology to try to expand access, increase appreciation for these items, and to gain new viewpoints on their collections. Digital heritage also helps scientists, archeologists, or other historians and specialists collect data on these objects, providing more information on the objects and the past. Digital Heritage is still currently being studied and improved by several sectors invested in cultural and intellectual preservation. It is particularly of interest to museums, governments, and academic institutions. Research by these groups are creating new concepts, methodologies, and techniques for the implementation of digital heritage to protect this type of cultural and natural heritage. As new technologies are created, museums and other heritage institutions are provided with more ways of disseminating their information and engaging with the public. A lack of resources within certain groups may still hinder everyone from accessing digital heritage. == Technologies used == The digitization of cultural heritage is attained through several means. Some of the main technology used is spatial and surveying technology. Space archaeological technology - Observations from space satellites are non-intrusive and can be integrated with other technologies on the ground. It is used to photograph vast areas of earth and help with research. Remnants of ancient civilizations or other human objects are also able to be spotted via satellite imaging. Unmanned aerial vehicles - UAV, such as drones, are commonly used in digitization of cultural heritage objects. The Great Wall of China is one such site that has been digitized and analyzed through unmanned aerial vehicle investigation. The resulting images, 3-D scans, maps, and other data are used to evaluate and maintain the Great Wall. Laser Scanning - Laser scanning is used to scan an area and recreate spatially accurate depictions, such as a 3D model. Virtual and Augmented Reality - VR is used primarily for education but does have uses for reconstruction and research. It is used to provide users with an immersive experience, as though they are actually at the site. Geographic Information systems - GIS are used primarily to study objects and sites over time. It is also important in studying the socioeconomic status of the past. 3D Modeling - 3D modeling has become more widely used due to an increase in technology that works specifically with heritage sites. It is often used in tandem with GIS to reconstruct objects for restoration, documentation, preservation, and educational purposes. Data is collected using satellite or other aerial imaging and ground-based imaging. There is some concern about the accuracy and authenticity of these types of digital reconstructions and their effects on the sites themselves. A major barrier to digital heritage is the amount of resources it takes to undertake such projects, such as money, time, and technology. Money and the lack of qualified personnel are two that are considered the most obstructive. This is especially an issue in less developed areas or within underfunded groups such as minorities. == Virtual heritage == A particular branch of digital heritage, known as "virtual heritage", is formed by the use of information technology with the aim of recreating the experience of existing cultural heritage, as in (approximations of) virtual reality. It is hard to differentiate this branch from the core contribution of digital heritage which is storing the heritage data digitally. Parsinejad et al. developed two techniques for Digital Twinning of the architectural assets and representation of the physical assets virtually in the museum context. Two techniques are hand recording and digital recording and both have challenges in adoption and implementation of Digital Twin as a revolutionary concept. == Digital heritage stewardship == Digital heritage stewardship is a form of digital curation which is modeled after collaborative curation. Digital heritage stewardship means stepping away from typical curatorial practices (e.g. discovering, arranging, and sharing information, material, and/or content) in favor of practices which allow its stakeholders the opportunity to contribute historical, political, and social context and culture. The collaborative practice encourages the creation, engagement, and maintena

Rabbit r1

The Rabbit r1 is an artificial intelligence personal assistant device developed by the American technology startup Rabbit Inc and co-designed by Teenage Engineering. It was announced at the 2024 Consumer Electronics Show as a handheld device intended to perform digital tasks through voice commands, touch interaction, and web-based AI agents. The r1 was marketed around Rabbit's concept of a "large action model" (LAM), which the company described as software able to operate websites and services on behalf of users. The device runs rabbitOS, an operating system based on the Android Open Source Project. Its services have included AI search, image recognition, voice interaction, music playback, rideshare and food-ordering integrations, and later experimental web-agent features such as LAM Playground and teach mode. Initial reviews were largely negative, with reviewers criticizing the device's limited functionality, bugs, and unclear advantages over a smartphone. Critics also questioned Rabbit's claims after the r1 software was shown to run on an Android phone. Rabbit continued to issue software updates after launch, including rabbitOS 2 in September 2025, which introduced a redesigned card-based interface, gesture navigation, and a "creations" feature for generating small software tools and experiences on the device. Rabbit Inc was founded by Jesse Lyu Cheng. == Hardware == Display: A 2.88-inch touchscreen for interactive user input. Input: push-to-talk button to activate voice commands; scroll wheel; Gyroscope; Magnetometer; Accelerometer; GPS. Camera: 8 MP single camera, with a resolution of 3264x2448, allowing for the connected external AI to use computer vision. Audio: Equipped with a speaker and dual microphones for audio interaction. Connectivity: Supports Wi-Fi and cellular connections via a SIM card slot to access internet services. Processor: Runs on a 2.3GHz MediaTek Helio P35 processor. Memory: Contains 4GB of RAM for operational tasks. Storage: Offers 128GB of internal storage for data. Ports: Utilizes a USB-C port for charging and data connections. == Software == The Rabbit r1 runs rabbitOS, which is based on the Android Open Source Project (AOSP), specifically Android 13. Rabbit founder Jesse Lyu described rabbitOS as a "very bespoke AOSP" after reports that the r1's software could be run on a conventional Android phone. Rabbit described the r1 as using a large action model (LAM), a type of AI agent intended to perform tasks across software interfaces rather than only answer questions. At launch, the device supported a limited set of services, including AI search, vision features, music playback, and some third-party integrations. Perplexity.ai was one of the AI services used to answer user queries. In 2024, Rabbit released several software updates that added features and attempted to address early criticism of the device. In July 2024, the company launched "beta rabbit", an advanced search and conversation mode for more complex queries. In October 2024, it released LAM Playground, a web-based agent feature intended to let the r1 operate websites on behalf of users. Reviewers found the feature experimental; Android Authority reported that it could perform some navigation tasks but struggled with CAPTCHAs, loops, and unintended behavior. In November 2024, Rabbit introduced a beta "teach mode", which allowed users to demonstrate web-based tasks in the Rabbithole web portal and later ask the r1 to repeat them. The company described teach mode as experimental, and The Verge noted that Rabbit warned users that results could be unpredictable and that CAPTCHA-protected sites could cause problems. Rabbit released rabbitOS 2 in September 2025. The update redesigned the interface around a card-based layout, added additional touchscreen gestures, and introduced "creations", a feature that lets users generate simple software tools, games, and interfaces through natural-language prompts. Coverage of the update described it as a major software overhaul rather than new hardware. == Reception == === Funding === Rabbit raised $20 million in funding from Khosla Ventures, Synergis Capital and Kakao Investment in October 2023. The company announced an additional $10 million in funding in December 2023. === Sales === Following its announcement at the 2024 Consumer Electronics Show, 130,000 units were sold. On August 13, 2024, Rabbit announced that sales of r1 had expanded to the entire European Union (except Malta) and United Kingdom. On August 21, 2024, sales of r1 expanded to Singapore. === Reviews === The r1 was met with strong criticism immediately after Rabbit began shipping the device. Some reviews questioned what the device was able to do that a smartphone could not, while comparing it to the similar Humane Ai Pin. YouTuber Marques Brownlee called the device "barely reviewable". Android Authority's Mishaal Rahman managed to install Rabbit r1's software on a Pixel 6a smartphone, after a tipster shared an APK file. The Verge echoed the claims made by Rahman. In response, Lyu published statements confirming its use of Android, but denying that the r1 is an Android app. Mashable called its Vision features impressive, but said that "these praise-worthy features are overshadowed by buggy performance". Ars Technica wrote a blog post claiming "the company is blocking access from bootleg APKs". TechCrunch gave a slightly more positive review, calling the device a "fun peep at a possible future", but could not "advise anyone to buy one now." Shortly after the launch of r1, Rabbit began a weekly cadence of software updates to address much of the criticism from the early reviews, including "battery and GPS performance, time zone selection, and more". Digital Trends said the Magic Camera feature "takes the most mundane, ordinary, and badly composed photos and makes something fun and eye-catching from them." Mashable said the "beta rabbit" feature "makes Rabbit R1 more conversational and intelligent". Later coverage noted that Rabbit continued to update the r1 after its poorly received launch. The Verge reported in September 2024 that about 5,000 of roughly 100,000 purchasers were using the device at any given moment, citing Lyu, and described the product as having launched before it was ready. In 2025, coverage of rabbitOS 2 described the update as an attempt to reset the device's software experience after the criticism of its original release. == Controversies == === GAMA project === Rabbit Inc has garnered attention due to allegations surrounding its funding and the company's past projects. The company came under scrutiny when Stephen Findeisen, known as Coffeezilla on YouTube, published a video in May 2024, alleging that Rabbit Incorporation was "built on a scam". Rabbit Incorporation, initially named Cyber Manufacturing Co, rebranded just two months before launching the Rabbit R1. The company, under its former name, raised $6 million in November 2021 for a project called GAMA, described as a "Next Generation NFT Project." Jesse Lyu, the CEO of Rabbit Incorporation, referred to GAMA as a "fun little project." Coffeezilla, who investigates influencer scams, highlighted old Clubhouse recordings of Jesse Lyu discussing the GAMA project. In these recordings, Lyu emphasized the substantial funding behind GAMA and its potential to be a revolutionary, carbon-negative cryptocurrency. Coffeezilla questioned the whereabouts of the funds raised for GAMA, estimating that approximately $1 million in refunds to investors remained unresolved. He suggested that the rebranding to Rabbit Incorporation and the shift to developing the Rabbit R1 were attempts to divert from the GAMA project's issues. In response to Coffeezilla's inquiries, Rabbit Incorporation stated that the $6 million raised was used for the GAMA project. The company said that NFTs cannot be refunded unless the owner agrees to "burn" them on the blockchain. Rabbit Incorporation also said that the GAMA project was open-sourced and returned to the community, aligning with community feedback. They also mentioned that efforts to buy back NFTs were made to counteract malicious trading and maintain market stability. === Security === In June 2024, Engadget reported that the Rabbitude team, a community reverse engineering project, had gained access to the r1's codebase revealing that r1's software contained several hardcoded API keys in its code for ElevenLabs, Microsoft Azure, Yelp, and Google Maps, potentially allowing unauthorized access to r1 responses, including those containing the users' personal information. For a short time, Rabbit immediately began revoking and rotating those secrets and confirmed that the code was leaked by an employee who had "been terminated and remains under investigation". In July 2024, the company revealed that all user chats and device pairing data were logged on the r1 with no ability to delete them. This meant that lost or stolen devices could be used to extract user

WYSIWYM (interaction technique)

What you see is what you meant (WYSIWYM) is a text editing interaction technique that emerged from two projects at University of Brighton. It allows users to create abstract knowledge representations such as those required by the Semantic Web using a natural language interface. Natural language understanding (NLU) technology is not employed. Instead, natural language generation (NLG) is used in a highly interactive manner. The text editor accepts repeated refinement of a selected span of text as it becomes progressively less vacuous of authored semantics. Using a mouse, a text property held in the evolving text can be further refined by a set of options derived by NLG from a built-in ontology. An invisible representation of the semantic knowledge is created which can be used for multilingual document generation, formal knowledge formation, or any other task that requires formally specified information. The two projects at Brighton worked in the field of Conceptual Authoring to lay a foundation for further research and development of a Semantic Web Authoring Tool (SWAT). This tool has been further explored as a means for developing a knowledge base by those without prior experience with Controlled Natural Language tools.

Common Crawl

The Common Crawl Foundation (Common Crawl) is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl was founded by Gil Elbaz. The data had mostly been primarily used by researchers and some startups until the 2020s, when AI companies started training large language models using the data. In November 2025, an investigation by The Atlantic revealed that Common Crawl misled publishers when it claimed it respected paywalls in its scraping and it was not honoring requests from publishers to have their content removed from its databases. == History == Common Crawl was founded in 2007 in San Francisco. It began publishing its crawls in 2011. By 2013, sites like TinEye were building their products off of Common Crawl. The crawl reduces the reliance of companies and researchers on Google, which has the biggest dataset. Common Crawl was designed to have more and fresher data that was more efficient to analyze and utilize than the Wayback Machine created by the Internet Archive. By 2015, 1.8 billion webpages were on the Common Crawl, which started by crawling a list of URLs donated by the search engine Blekko. They use Amazon Web Services, which provides some of its services for free, allowing computing costs to average $2-4000/month. The Common Crawl website listed 30 studies based on Common Crawl data. Before 2023, Common Crawl was not very well known outside of academic researchers who utilize the data. Common Crawl received its first requests to redact information in 2023 and increasingly started seeing its crawler, CCBot, blocked. In 2023, it began receiving significant financial support from AI companies, including Anthropic and OpenAI, each of which donated $250,000. It was also used to train Google DeepMind's large language model Gemini. By April 2023, Common Crawl was capturing 3.1 billion webpages, with an estimated 5% of pages before 2021 containing hate speech or slurs. As of 2024, Common Crawl had been cited in more than 10,000 academic studies. By 2024, The Pile and Common Crawl had been the two main training datasets being used to train AI models. In November 2025, an investigation by technology journalist Alex Reisner for The Atlantic revealed that Common Crawl misled publishers when it claimed it respected paywalls in its scraping and when it said that it was honoring requests from publishers to have their content removed from its databases. It included misleading results in the public search function on its website that showed no entries for websites that had requested their archives be removed, when in fact those sites were still included in its scrapes used by AI companies. As of 2025, Reisner found that CCBot was the most widely-blocked bot by the top 1000 websites. A 2026 article in LWN.net discussed an advantage to services like Common Crawl being that it can limit the scraping costs to websites by allowing companies and researchers to download the data from Common Crawl instead of scraping it themselves. In April 2026, Common Crawl experimentally began to distribute its data through Hugging Face Storage Bucket, in addition to its standard storage on Amazon S3. == Organization == Peter Norvig and Joi Ito have served on the advisory board. Rich Skrenta is the executive director. It has received funding almost exclusively from the Elbaz Family Foundation Trust until 2023 when it started receiving donations from the AI industry. == Refined versions == A number of organizations take raw Common Crawl data and refine it into datasets that exclude edgy content or are otherwise higher-quality for their purposes, such as FineWeb, DCLM and C4. === Colossal Clean Crawled Corpus === Google version of the Common Crawl is called the Colossal Clean Crawled Corpus, or C4 for short. It was constructed for the training of the T5 language model series in 2019. As of 2023, there were some concerns over copyrighted content in the C4 as well as racist content. A 2024 study found that 45% of content was explicitly restricted by websites' terms of service to be used for purposes like AI training by for-profit companies.

Fei-Fei Li

Fei-Fei Li (Chinese: 李飞飞; pinyin: Lǐ Fēifēi; born July 3, 1976) is a Chinese-born American computer scientist best known for establishing ImageNet, the dataset that enabled rapid advances in computer vision in the 2010s. She is a professor of computer science at Stanford University, with research expertise in artificial intelligence, machine learning, deep learning, computer vision, and cognitive neuroscience. Li is a co-director of the Stanford Institute for Human-Centered Artificial Intelligence and a co-director of the Stanford Vision and Learning Lab, and served as Chief Scientist of AI/ML at Google Cloud and the director of the Stanford Artificial Intelligence Laboratory from 2013 to 2018. In 2017, she co-founded AI4ALL, a nonprofit organization working to increase diversity in the field of artificial intelligence. In 2023, Li was named one of the Time 100 AI Most Influential People. Li received the Intel Lifetime Achievements Innovation Award in 2017 for her contributions to artificial intelligence, and was elected member of the National Academy of Engineering, the National Academy of Medicine in 2020 and the American Academy of Arts and Sciences in 2021. In 2025, she was named as one of the "Architects of AI" for Time's Person of the Year. On August 3, 2023, Li was appointed to the United Nations Scientific Advisory Board, established by Secretary-General Antonio Guterres. In 2024, Li was included on the Gold House's most influential Asian A100 list. In 2024, she raised $230 million for a startup called World Labs, which she and three colleagues founded to develop a "spatial intelligence" AI technology that can understand how the three-dimensional physical world works. In 2026, World Labs raised $1 Billion. == Early life and education == Li was born in Beijing, China, in 1976 and grew up in Chengdu, Sichuan. She studied at Sichuan Chengdu No.7 High School. When she was 12, her father immigrated to Parsippany, New Jersey. When she was 16, Li and her mother joined him in the United States. While attending Parsippany High School, Li worked weekends at her family's dry-cleaning shop. She graduated from Parsippany High School in 1995. She was inducted into the hall of fame at Parsippany High School in 2017. Li pursued undergraduate study at Princeton University, where she received a Bachelor of Arts with a major in physics in 1999. Li completed her senior thesis, "Auditory binaural correlogram difference: a new computational model for Huggins dichotic pitch", under the supervision of Bradley Dickinson, professor of electrical engineering. During her years at Princeton, Li returned home most weekends to help run her family's dry cleaning business and worked as a dishwasher to supplement the family income. Li pursued graduate study at the California Institute of Technology, where she received a Master of Science in electrical engineering in 2001 and a Doctor of Philosophy in electrical engineering in 2005. Li completed her dissertation, "Visual Recognition: Computational Models and Human Psychophysics", under the primary supervision of Pietro Perona and secondary supervision of Christof Koch. Her graduate studies were supported by the National Science Foundation Graduate Research Fellowship and The Paul & Daisy Soros Fellowships for New Americans. == Career and research == From 2005 to 2006, Li was an assistant professor in the Electrical and Computer Engineering Department at the University of Illinois Urbana-Champaign, and from 2007 to 2009, she was an assistant professor in the Computer Science Department at Princeton University. She joined Stanford in 2009 as an assistant professor, and was promoted to associate professor with tenure in 2012, and then full professor in 2018. At Stanford, Li served as the director of Stanford Artificial Intelligence Lab (SAIL) from 2013 to 2018. Her research has focused on computer vision, deep learning, and cognitive neuroscience, with over 300 peer-reviewed publications. She became the founding co-director of Stanford's University-level initiative - the Human-Centered AI Institute, along with co-director Dr. John Etchemendy, former provost of Stanford University. The institute aligns with Li's aims to advance AI research, education, policy, and practice to improve the human condition. While at Princeton in 2007, Li led the development of ImageNet, a massive visual database designed to advance object recognition in AI. The project involved labeling over 14 million images using Amazon Mechanical Turk and inspired the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which catalyzed progress in deep learning and led to dramatic improvements in image classification performance. The database addressed a key bottleneck in computer vision: the lack of large, annotated datasets for training machine learning models. Today, ImageNet is credited as a cornerstone innovation that underpins advancements in autonomous vehicles, facial recognition, and medical imaging. On her sabbatical from Stanford University from January 2017 to fall of 2018, Li joined Google Cloud as its Chief Scientist of AI/ML and Vice President. At Google, her team focused on democratizing AI technology and lowering the barrier for entrance to businesses and developers, including the developments of products like AutoML. In September 2017, Google secured a contract from the Department of Defense called Project Maven, which aimed to use AI techniques to interpret images captured by drone cameras. Google told employees who protested the company's work on Project Maven that their role was "specifically scoped to be for non-offensive purposes". In June 2018, Google told employees it would not seek renewal of the contract. In internal emails which were later leaked to reporters, Li expressed enthusiasm for the Google Cloud role in Project Maven, but warned against mentioning its AI component, saying that military AI is linked in the public mind with the danger of autonomous weapons. Asked about those leaked emails, Li told The New York Times, "I believe in human-centered AI to benefit people in positive and benevolent ways. It is deeply against my principles to work on any project that I think is to weaponize AI." In the fall of 2018, Li left Google and returned to Stanford University to continue her professorship. In 2023, Li co-led the launch of the RAISE-Health (Responsible AI for Safe and Equitable Health) initiative at Stanford University in collaboration with Stanford medicine. The initiative aims to develop frameworks for the responsible use of artificial intelligence in healthcare, including clinical care, biomedical research, and patient safety. According to her Stanford profile, she has been on partial academic leave from January 2024 through the end of 2025 to focus on entrepreneurial ventures. In 2024, Li said there was a disparity between private-sector investment in AI and support for academic and government research, and called for greater public funding for scientific uses of the technology and for studying its risks. Li is also known for her non-profit work as the co-founder and chairperson of nonprofit organization AI4ALL, whose mission is to educate the next generation of AI technologists, thinkers and leaders by promoting diversity and inclusion through human-centered AI principles. The program was created in collaboration with Melinda French Gates and Jensen Huang. Prior to establishing AI4ALL in 2017, Li and her former student Olga Russakovsky, currently an assistant professor in Princeton University, co-founded and co-directed the precursor program at Stanford called SAILORS (Stanford AI Lab OutReach Summers). SAILORS was an annual summer camp at Stanford dedicated to 9th grade high school girls in AI education and research, established in 2015 till it changed its name to AI4ALL @Stanford in 2017. In 2018, AI4ALL has successfully launched five more summer programs in addition to Stanford, including Princeton University, Carnegie Mellon University, Boston University, University of California Berkeley, and Canada's Simon Fraser University. We are at a turning point. AI's influence continues to grow, but representation and inclusion of a diversity of researchers in the field does not. It's critical that we seize this moment to create structures that will support long-term, positive changes. This won't happen via a single mechanism or quick fix. It starts with early education and extends to the existing structures of power within academia, work cultures among current AI researchers, and gatekeeping functions of research publishing, to name a few levers of change. Li has been described as a "researcher bringing humanity to AI". Li was elected as a member of the American Academy of Arts and Sciences in 2021, the National Academy of Engineering in 2020, and the National Academy of Medicine in 2020. In a November 2023 interview with The Guardian, Li said that while she would not refer to herself as the "godmother

Spell checker

In software, a spell checker (or spelling checker or spell check) is a software feature that checks for misspellings in a text. Spell-checking features are often embedded in software or services, such as a word processor, email client, electronic dictionary, or search engine. == Design == A basic spell checker carries out the following processes: It scans the text and extracts the words contained in it. It then compares each word with a known list of correctly spelled words (i.e. a dictionary). This might contain just a list of words, or it might also contain additional information, such as hyphenation points or lexical and grammatical attributes. An additional step is a language-dependent algorithm for handling morphology. Even for a lightly inflected language like English, the spell checker will need to consider different forms of the same word, such as plurals, verbal forms, contractions, and possessives. For many other languages, such as those featuring agglutination and more complex declension and conjugation, this part of the process is more complicated. It is unclear whether morphological analysis—allowing for many forms of a word depending on its grammatical role—provides a significant benefit for English, though its benefits for highly synthetic languages such as German, Hungarian, or Turkish are clear. As an adjunct to these components, the program's user interface allows users to approve or reject replacements and modify the program's operation. Spell checkers can use approximate string matching algorithms such as Levenshtein distance to find correct spellings of misspelled words. An alternative type of spell checker uses solely statistical information, such as n-grams, to recognize errors instead of correctly-spelled words. This approach usually requires a lot of effort to obtain sufficient statistical information. Key advantages include needing less runtime storage and the ability to correct errors in words that are not included in a dictionary. In some cases, spell checkers use a fixed list of misspellings and suggestions for those misspellings; this less flexible approach is often used in paper-based correction methods, such as the see also entries of encyclopedias. Clustering algorithms have also been used for spell checking combined with phonetic information. == History == === Pre-PC === In 1961, Les Earnest, who headed the research on this budding technology, saw it necessary to include the first spell checker that accessed a list of 10,000 acceptable words. Ralph Gorin, a graduate student under Earnest at the time, created the first true spelling checker program written as an applications program (rather than research) for general English text: SPELL for the DEC PDP-10 at Stanford University's Artificial Intelligence Laboratory, in February 1971. Gorin wrote SPELL in assembly language, for faster action; he made the first spelling corrector by searching the word list for plausible correct spellings that differ by a single letter or adjacent letter transpositions and presenting them to the user. Gorin made SPELL publicly accessible, as was done with most SAIL (Stanford Artificial Intelligence Laboratory) programs, and it soon spread around the world via the new ARPAnet, about ten years before personal computers came into general use. SPELL, its algorithms and data structures inspired the Unix ispell program. The first spell checkers were widely available on mainframe computers in the late 1970s. A group of six linguists from Georgetown University developed the first spell-check system for the IBM corporation. Henry Kučera invented one for the VAX machines of Digital Equipment Corp in 1981. === Unix === The International Ispell program commonly used in Unix is based on R. E. Gorin's SPELL. It was converted to C by Pace Willisson at MIT. The GNU project has its spell checker GNU Aspell. Aspell's main improvement is that it can more accurately suggest correct alternatives for misspelled English words. Due to the inability of traditional spell checkers to check words in complex inflected languages, Hungarian László Németh developed Hunspell, a spell checker that supports agglutinative languages and complex compound words. Hunspell also uses Unicode in its dictionaries. Hunspell replaced the previous MySpell in OpenOffice.org in version 2.0.2. Enchant is another general spell checker, derived from AbiWord. Its goal is to combine programs supporting different languages such as Aspell, Hunspell, Nuspell, Hspell (Hebrew), Voikko (Finnish), Zemberek (Turkish) and AppleSpell under one interface. === PCs === The first spell checkers for personal computers appeared in 1980, such as "WordCheck" for Commodore systems which was released in late 1980 in time for advertisements to go to print in January 1981. Developers such as Maria Mariani and Random House rushed OEM packages or end-user products into the rapidly expanding software market. On the pre-Windows PCs, these spell checkers were standalone programs, many of which could be run in terminate-and-stay-resident mode from within word-processing packages on PCs with sufficient memory. However, the market for standalone packages was short-lived, as by the mid-1980s developers of popular word-processing packages like WordStar and WordPerfect had incorporated spell checkers in their packages, mostly licensed from the above companies, who quickly expanded support from just English to many European and eventually even Asian languages. However, this required increasing sophistication in the morphology routines of the software, particularly with regard to heavily-agglutinative languages like Hungarian and Finnish. Although the size of the word-processing market in a country like Iceland might not have justified the investment of implementing a spell checker, companies like WordPerfect nonetheless strove to localize their software for as many national markets as possible as part of their global marketing strategy. When Apple developed "a system-wide spelling checker" for Mac OS X so that "the operating system took over spelling fixes," it was a first: one "didn't have to maintain a separate spelling checker for each" program. Mac OS X's spellcheck coverage includes virtually all bundled and third party applications. Visual Tools' VT Speller, introduced in 1994, was "designed for developers of applications that support Windows." It came with a dictionary but had the ability to build and incorporate use of secondary dictionaries. === Browsers === Web browsers such as Firefox and Google Chrome offer spell checking support, using Hunspell. Prior to using Hunspell, Firefox and Chrome used MySpell and GNU Aspell, respectively. === Specialties === Some spell checkers have separate support for medical dictionaries to help prevent medical errors. == Functionality == The first spell checkers were "verifiers" instead of "correctors." They offered no suggestions for incorrectly spelled words. This was helpful for typos but it was not so helpful for logical or phonetic errors. The challenge the developers faced was the difficulty in offering useful suggestions for misspelled words. This requires reducing words to a skeletal form and applying pattern-matching algorithms. It might seem logical that where spell-checking dictionaries are concerned, "the bigger, the better," so that correct words are not marked as incorrect. In practice, however, an optimal size for English appears to be around 90,000 entries. If there are more than this, incorrectly spelled words may be skipped because they are mistaken for others. For example, a linguist might determine on the basis of corpus linguistics that the word baht is more frequently a misspelling of bath or bat than a reference to the Thai currency. Hence, it would typically be more useful if a few people who write about Thai currency were slightly inconvenienced than if the spelling errors of the many more people who discuss baths were overlooked. The first MS-DOS spell checkers were mostly used in proofing mode from within word processing packages. After preparing a document, a user scanned the text looking for misspellings. Later, however, batch processing was offered in such packages as Oracle's short-lived CoAuthor and allowed a user to view the results after a document was processed and correct only the words that were known to be wrong. When memory and processing power became abundant, spell checking was performed in the background in an interactive way, such as has been the case with the Sector Software produced Spellbound program released in 1987 and Microsoft Word since Word 95. Spell checkers became increasingly sophisticated; now capable of recognizing grammatical errors. However, even at their best, they rarely catch all the errors in a text (such as homophone errors) and will flag neologisms and foreign words as misspellings. Nonetheless, spell checkers can be considered as a type of foreign language writing aid that non-native language lea

Hubert Dreyfus's views on artificial intelligence

Hubert Dreyfus was a critic of artificial intelligence research. In a series of papers and books, including Alchemy and AI (1965), What Computers Can't Do (1972; 1979; 1992) and Mind over Machine (1986), he presented a skeptical and cautious assessment of AI's progress and a critique of the philosophical foundations of the field. Dreyfus' objections are discussed in most introductions to the philosophy of artificial intelligence, including Russell & Norvig (2021), a standard AI textbook, and in Fearn (2007), a survey of contemporary philosophy. Dreyfus argued that human intelligence and expertise depend primarily on yet-to-be understood informal and unconscious processes rather than symbolic manipulation and that these essentially human skills cannot be fully captured in formal rules. His critique was based on the insights of modern continental philosophers such as Merleau-Ponty and Heidegger, and was directed at the first wave of AI research which tried to reduce intelligence to high level formal symbols. When Dreyfus' ideas were first introduced in the mid-1960s, they were met in the AI community with ridicule and outright hostility. By the 1980s, however, some of his perspectives were rediscovered by researchers working in robotics and the new field of connectionism—approaches that were called "sub-symbolic" at the time because they eschewed early AI research's emphasis on high level symbols. In the 21st century, "sub-symbolic" artificial neural networks and other statistics-based approaches to machine learning were highly successful. Historian and AI researcher Daniel Crevier wrote: "time has proven the accuracy and perceptiveness of some of Dreyfus's comments." Dreyfus said in 2007, "I figure I won and it's over—they've given up." == Dreyfus' critique == === The grandiose promises of artificial intelligence === In Alchemy and AI (1965) and What Computers Can't Do (1972), Dreyfus summarized the history of artificial intelligence and ridiculed the unbridled optimism that permeated the field. For example, Herbert A. Simon, following the success of his program General Problem Solver (1957), predicted that by 1967: A computer would be world champion in chess. A computer would discover and prove an important new mathematical theorem. Most theories in psychology will take the form of computer programs. The press dutifully reported these predictions of the imminent arrival of machine intelligence. Dreyfus felt that this optimism was unwarranted and, in 1965, argued forcefully that predictions like these would not come true. He would eventually be proven right. Pamela McCorduck explains Dreyfus' position: A great misunderstanding accounts for public confusion about thinking machines, a misunderstanding perpetrated by the unrealistic claims researchers in AI have been making, claims that thinking machines are already here, or at any rate, just around the corner. These predictions were based on the success of the cognitive revolution, which promoted an "information processing" model of the mind. It was articulated by Newell and Simon in their physical symbol systems hypothesis, and later expanded into a philosophical position known as computationalism by philosophers such as Jerry Fodor and Hilary Putnam. In AI, the approach is now called symbolic AI or "GOFAI". Dreyfus argued that "symbolic AI" was the latest version of the ancient program of rationalism in philosophy. Rationalism had come under heavy criticism in the 20th century from philosophers like Martin Heidegger and Edmund Husserl. The mind, according to modern continental philosophy, is not "rationalist" and is nothing like a digital computer. Cognitivism led early AI researchers to believe that they had successfully simulated the essential process of human thought, thus it seemed a short step to producing fully intelligent machines. Dreyfus' last paper detailed the ongoing history of the "first step fallacy", where AI researchers tend to wildly extrapolate initial success as promising, perhaps even guaranteeing, wild future successes. === Dreyfus' four assumptions of artificial intelligence research === In Alchemy and AI and What Computers Can't Do, Dreyfus identified four philosophical assumptions, at least one of which he deems necessary for AI to succeed. "In each case," Dreyfus writes, "the assumption is taken by workers in AI as an axiom, guaranteeing results, whereas it is, in fact, one hypothesis among others, to be tested by the success of such work." Dreyfus argues that AI would be impossible without accepting at least one of these four assumptions: The biological assumption The brain processes information in discrete operations by way of some biological equivalent of on/off switches. In the early days of research into neurology, scientists found that neurons fire in all-or-nothing pulses. Several researchers, such as Walter Pitts and Warren McCulloch, speculated with great confidence that neurons functioned similarly to the way Boolean logic gates operate, and so could be imitated by electronic circuitry at the level of the neuron. When digital computers became widely used in the early 50s, this argument was extended to suggest that the brain was a vast physical symbol system, manipulating the binary symbols of zero and one. Dreyfus was able to refute the biological assumption by citing research in neurology that suggested that the action and timing of neuron firing had analog components. But Daniel Crevier observes that "few still held that belief in the early 1970s, and nobody argued against Dreyfus" about the biological assumption. The psychological assumption The mind can be viewed as a device operating on bits of information according to formal rules. He refuted this assumption by showing that much of what we know about the world consists of complex attitudes or tendencies that make us lean towards one interpretation over another. He argued that, even when we use explicit symbols, we are using them against an unconscious and informal background including commonsense knowledge and that without this background our symbols cease to mean anything. This background, in Dreyfus' view, was not implemented in individual brains as explicit individual symbols with explicit individual meanings. The epistemological assumption All knowledge can be formalized. This concerns the philosophical issue of epistemology, or the study of knowledge. Even if we agree that the psychological assumption is false, AI researchers could still argue (as AI founder John McCarthy has) that it is possible for a symbol processing machine to represent all knowledge, regardless of whether human beings represent knowledge the same way. Dreyfus argued that there is no justification for this assumption, since so much of human knowledge is not symbolic or even expressible using formal constructs. The ontological assumption The world consists of independent facts that can be represented by independent symbols AI researchers (and futurists and science fiction writers) often assume that there is no limit to formal, scientific knowledge, because they assume that any phenomenon in the universe can be described by symbols or scientific theories. This assumes that everything that exists can be understood as objects, properties of objects, classes of objects, relations of objects, and so on: precisely those things that can be described by logic, language and mathematics. The study of being or existence is called ontology, and so Dreyfus calls this the ontological assumption. If this is false, then it raises doubts about what we can ultimately know and what intelligent machines will ultimately be able to help us to do. === Knowing-how vs. knowing-that: the primacy of intuition === In Mind Over Machine (1986), written (with his brother) during the heyday of expert systems, Dreyfus analyzed the difference between human expertise and the programs that claimed to capture it. This expanded on ideas from What Computers Can't Do, where he had made a similar argument criticizing the "cognitive simulation" school of AI research practiced by Allen Newell and Herbert A. Simon in the 1960s. Dreyfus argued that human problem solving and expertise depend on our background sense of the context, of what is important and interesting given the situation, rather than on the process of searching through combinations of possibilities to find what we need. Dreyfus would describe it in 1986 as the difference between "knowing-that" and "knowing-how", based on Heidegger's distinction of present-at-hand and ready-to-hand. Knowing-that is our conscious, step-by-step problem solving abilities. We use these skills when we encounter a difficult problem that requires us to stop, step back and search through ideas one at time. At moments like this, the ideas become very precise and simple: they become context free symbols, which we manipulate using logic and language. These are the skills that Newell and Simon had demonstrated with both psy