ACL Data Collection Initiative

ACL Data Collection Initiative

The ACL Data Collection Initiative (ACL/DCI) was a project established in 1989 by the Association for Computational Linguistics (ACL) to create and distribute large text and speech corpora for computational linguistics research. The initiative aimed to address the growing need for substantial text databases that could support research in areas such as natural language processing, speech recognition, and computational linguistics. By 1993, the initiative’s activities had effectively ceased, with its functions and datasets absorbed by the Linguistic Data Consortium (LDC), which was founded in 1992. == Objectives == The ACL/DCI had several key objectives: To acquire a large and diverse text corpus from various sources To transform the collected texts into a common format based on the Standard Generalized Markup Language (SGML) To make the corpus available for scientific research at low cost with minimal restrictions To provide a common database that would allow researchers to replicate or extend published results To reduce duplication of effort among researchers in obtaining and preparing text data These objectives were designed to address the growing demand for very large amounts of text arising from applications in recognition and analysis of text and speech. Its core objective was to "oversee the acquisition and preparation of a large text corpus to be made available for scientific research at cost and without royalties". == History == By the late 1980s, researchers in computational linguistics and speech recognition faced a significant problem: the lack of large-scale, accessible text corpora for developing statistical models and testing algorithms. Existing generally available text databases were too small to meet the needs of developing applications in text and speech recognition. The initiative was formed to meet this need by collecting, standardizing, and distributing large quantities of text data with minimal restrictions for scientific research. As stated by Liberman (1990), "research workers have been severely hampered by the lack of appropriate materials, and specially by the lack of a large enough body of text on which published results can be replicated or extended by others." The ACL/DCI committee was established in February 1989. The committee included members from academic and industrial research laboratories in the United States and Europe. The initiative was chaired by Mark Liberman from the University of Pennsylvania (formerly of AT&T Bell Laboratories). Other committee members included representatives from organizations such as Bellcore, IBM T.J. Watson Research Center, Cambridge University, Virginia Polytechnic Institute & State University, Northeastern University, University of Pennsylvania, SRI International, MCC, Xerox PARC, ISSCO, and University of Pisa. The project operated initially without dedicated funding, relying on volunteer efforts from committee members and their affiliated institutions. Key supporters included AT&T Bell Labs, Bellcore, IBM, Xerox, and the University of Pennsylvania, which allowed the use of their computing facilities for ACL/DCI-related work. Previously running on volunteer effort pro bono, in 1991, it obtained funding from General Electric and the National Science Foundation (IRI-9113530). == Data == As of 1990, the ACL/DCI had collected hundreds of millions of words of diverse text. The collection included: Wall Street Journal articles (25 to 50 million words); Canadian Hansard (parliamentary records) in parallel English and French versions: cleaned-up English Hansard donated by the IBM alignment models group (100 million words), and original Bilingual Hansard (from a different time period) obtained directly (200 million words). Collins English Dictionary (1979 edition), both as fulltext (3 million words) and as various "database" versions, constructed using "typographers' tape" donated by Collins, which were computer tapes containing the structured digital data used to typeset and print the 1979 edition of the dictionary; Emails from ARPANET newsletters for the ACM Special Interest Group on Information Retrieval Forum (IRLIST) and AIList Digest issues distributed over the ARPANET (AILIST) (5 million words), both collected by Edward A. Fox at VIPSU; Articles on networking (2 million words); U.S. Department of Agriculture Extension Service Fact Sheets (>1 million words); 200,000 scientific abstracts of about 1,500 words each from the Department of Energy (25 million words); Archives of the Challenger Investigation Commission, including transcripts of depositions and hearings (2.5 million words); Books from the Library of America, including works by Mark Twain, Eugene O'Neill, Ralph Waldo Emerson, Herman Melville, W.E.B. DuBois, Willa Cather, and Benjamin Franklin (130 books, 20 million words); Public domain books like the King James Bible, Tristram Shandy, The Federalist Papers; Several million words of transcribed radiologists' reports, donated by Francis Ganong at Kurzweil Applied Intelligence Inc (about 5 million words); The Child Language Data Exchange corpus of child language acquisition transcripts; U.S. Department of Justice Justice Retrieval and Inquiry System (JURIS) materials; The Swiss Civil Code in parallel German, French and Italian; Economic reports from the Union Bank of Switzerland, in parallel English, German, French and Italian; About 12K words of administrative policy manuals and 14K words of administrative memos, contributed by Geoff Pullum of U.C.S.C.; Material from various ACM journals and the ACL journal Computational Linguistics; The CSLI publications series: 50-100 reports (8K words each) and 5-10 books (80K words each). The initiative started with North American English text but expanded to include Canadian French and planned to include Japanese, Chinese, and other Asian languages. At least 5 million words from the collection were tagged under the Penn Treebank project, and those tags were distributed by DCI as well. After DCI was absorbed by the LDC, the datasets were curated under LDC. == Format == The ACL/DCI corpus was coded in a standard form based on SGML (Standard Generalized Markup Language, ISO 8879), consistent with the recommendations of the Text Encoding Initiative (TEI), of which the DCI was an affiliated project. The TEI was a joint project of the ACL, the Association for Computers and the Humanities, and the Association for Literary and Linguistic Computing, aiming to provide a common interchange format for literary and linguistic data. The initiative planned to add annotations reflecting consensually approved linguistic features like part of speech and various aspects of syntactic and semantic structure over time. == Examples == As an example of the use of ACL/DCI, consider the Wall Street Journal (WSJ) corpus for speech recognition research. The WSJ corpus was used as the basis for the DARPA Spoken Language System (SLS) community's Continuous Speech Recognition (CSR) Corpus. The WSJ corpus became a standard benchmark for evaluating speech recognition systems and has been used in numerous research papers. The WSJ CSR Corpus provided DARPA with its first general-purpose English, large vocabulary, natural language, high perplexity corpus containing speech (400 hours) and text (47 million words) during 1987–89. The text corpus was 313 MB in size. The text was preprocessed to remove ambiguity in the word sequence that a reader might choose, ensuring that the unread text used to train language models was representative of the spoken test material. The preprocessing included converting numbers into orthographics, expanding abbreviations, resolving apostrophes and quotation marks, and marking punctuation. As another example, the Yarowsky algorithm used bitext data from DCI to train a simple word-sense disambiguation model that was competitive with advanced models trained on smaller datasets. == Distribution == Materials from the ACL/DCI collection were distributed to research groups on a non-commercial basis. By 1990, about 25 research groups and individual researchers had received tapes containing various portions of the collected material. To obtain the data, researchers had to sign an agreement not to redistribute the data or make direct commercial use of it. However, commercial application of "analytical materials" derived from the text, such as statistical tables or grammar rules, was explicitly permitted. The initiative first distributed data via 12-inch reels of 9-track tape, then via CD-ROMs. Each such tape could contain 30 million words compressed via the Lempel-Ziv algorithms. The first CD-ROM distribution was in 1991, funded by Dragon Systems Inc. It contained Collins English Dictionary, WSJ, scientific abstracts provided by the U.S. Department of Energy, and the Penn Treebank.

Ghana Post GPS

GhanaPostGPS is a web and smartphone application, sponsored by the government of Ghana and developed by Vokacom, to provide a digital addresses and postal codes for every 5 squared meter location in Ghana. The digital address is a composite of the postcode (region, district & area code) plus a unique address. GhanaPostGPS is the first digital addressing system created by the government of Ghana. GhanaPost GPS is a mandatory requirement for obtaining the National Identification Card and other services.

Bookmarklet

A bookmarklet is a bookmark stored in a web browser that contains JavaScript commands that add new features to the browser. They are stored as the URL of a bookmark in a web browser or as a hyperlink on a web page. Bookmarklets are usually small snippets of JavaScript executed when a user clicks on them. When clicked, bookmarklets can perform a wide variety of operations, such as running a search query from selected text or extracting data from a table. Another name for bookmarklet is favelet or favlet, derived from favorites (synonym of bookmark). == History == Steve Kangas of bookmarklets.com coined the word bookmarklet when he started to create short scripts based on a suggestion in Netscape's JavaScript guide. Before that, Tantek Çelik called these scripts favelets and used that word as early as on 6 September 2001 (personal email). Brendan Eich, who developed JavaScript at Netscape, gave this account of the origin of bookmarklets: They were a deliberate feature in this sense: I invented the javascript: URL along with JavaScript in 1995, and intended that javascript: URLs could be used as any other kind of URL, including being bookmark-able. In particular, I made it possible to generate a new document by loading, e.g. javascript:'hello, world', but also (key for bookmarklets) to run arbitrary script against the DOM of the current document, e.g. javascript:alert(document.links[0].href). The difference is that the latter kind of URL uses an expression that evaluates to the undefined type in JS. I added the void operator to JS before Netscape 2 shipped to make it easy to discard any non-undefined value in a javascript: URL. The increased implementation of Content Security Policy (CSP) in websites has caused problems with bookmarklet execution and usage (2013–2015), with some suggesting that this hails the end or death of bookmarklets. William Donnelly created a work-around solution for this problem (in the specific instance of loading, referencing and using JavaScript library code) in early 2015 using a Greasemonkey userscript (Firefox / Pale Moon browser add-on extension) and a simple bookmarklet-userscript communication protocol. It allows (library-based) bookmarklets to be executed on any and all websites, including those using CSP and having an https:// URI scheme. However, if/when browsers support disabling/disallowing inline script execution using CSP, and if/when websites begin to implement that feature, it will "break" this "fix". == Concept == Web browsers use URIs for the href attribute of the tag and for bookmarks. The URI scheme, such as http or ftp, and which generally specifies the protocol, determines the format of the rest of the string. Browsers also implement javascript: URIs that to a parser is just like any other URI. The browser recognizes the specified javascript scheme and treats the rest of the string as a JavaScript program which is then executed. The expression result, if any, is treated as the HTML source code for a new page displayed in place of the original. The executing script has access to the current page, which it may inspect and change. If the script returns an undefined type (rather than, for example, a string), the browser will not load a new page, with the result that the script simply runs against the current page content. This permits changes such as in-place font size and color changes without a page reload. An immediately invoked function that returns no value or an expression preceded by the void operator will prevent the browser from attempting to parse the result of the evaluation as a snippet of HTML markup: == Usage == Bookmarklets are saved and used as normal bookmarks. As such, they are simple "one-click" tools which add functionality to the browser. For example, they can: Modify the appearance of a web page within the browser (e.g., change font size, background color, etc.) Extract data from a web page (e.g., hyperlinks, images, text, etc.) Remove redirects from (e.g. Google) search results, to show the actual target URL Submit the current page to a blogging service such as Posterous, link-shortening service such as bit.ly, or bookmarking service such as Delicious Query a search engine or online encyclopedia with highlighted text or by a dialog box Submit the current page to a link validation service or translation service Set commonly chosen configuration options when the page itself provides no way to do this Control HTML5 audio and video playback parameters such as speed, position, toggling looping, and showing/hiding playback controls, the first of which can be adjusted beyond HTML5 players' typical range setting. Installing a bookmarklet follows the same process as adding a normal bookmark; the only difference is that in place of the URL destination field is JavaScript code preceded by javascript:. Once created, bookmarklets can be run by clicking on them.

Atomtronics

Atomtronics is an emerging field concerning the quantum technology of matter-wave circuits which coherently guide propagating ultra-cold atoms. The systems typically include components analogous to those found in electronics, quantum electronics or optical systems, such as beam splitters, transistors, and atomic counterparts of superconducting quantum interference devices (SQUIDs). Applications range from studies of fundamental physics to the development of practical devices such as quantum superfluids for the computation of large models for artificial general intelligence. == Etymology == Atomtronics is a portmanteau of "atom" and "electronics", in reference to the creation of atomic analogues of electronic components, such as transistors and diodes, and also electronic materials such as semiconductors. The field itself has considerable overlap with atom optics and quantum simulation, and is not strictly limited to the development of electronic-like components. However, this field develops into the research of ultra-cold atoms for the applied research implications of computations in quantum science. == Methodology == Three major elements are required for an atomtronic circuit. The first is a Bose-Einstein condensate, which is needed for its coherent and superfluid properties, although an ultracold Fermi gas may also be used for certain applications. The second is a tailored trapping potential, which can be generated optically, magnetically, or using a combination of both. The final element is a method to induce the movement of atoms within the potential, which can be achieved in several ways, for various research advancements around fields not limited to distributed computing, supercomputing, and quantum computing. For example, a transistor-like atomtronic circuit may be realized by a ring-shaped trap divided into two by two moveable weak barriers, with the two separate parts of the ring acting as the drain and the source and the barriers acting as the gate. As the barriers move, atoms flow from the source to the drain. It is now possible to coherently guide matterwaves over distances of up to 40 cm in ring-shaped atomtronic matterwave guide measurement. == Applications == The field of atomtronics is still very nascent and any schemes realized thus far are proof-of-concept. Applications include: gravimetry rotational sensing via the Sagnac effect quantum computing Obstacles to the development of practical sensing devices are largely due to the technical challenges of creating Bose-Einstein condensates. They require bulky lab-based setups not easily suitable for transportation. However, creating portable experimental setups is an active area of research.

Attention inequality

Attention inequality is the inequality of distribution of attention across users on social networks, people in general, and for scientific papers. Yun Family Foundation introduced "Attention Inequality Coefficient" as a measure of inequality in attention and arguments it by the close interconnection with wealth inequality. == Relationship to economic inequality == Attention inequality is related to economic inequality since attention is an economically scarce good. The same measures and concepts as in classical economy can be applied for attention economy. The relationship develops also beyond the conceptual level—considering the AIDA process, attention is the prerequisite for real monetary income on the Internet. On data of 2018, a significant relationship between likes and comments on Facebook to donations is proven for non-profit organizations. == Attention economy == The attention economy refers to the practice of maximizing the attention users give to a product for advertising-related reasons. Attention economy remains one of the most common forms of advertising, and has been steadily increasing thanks to new technologies such as television, internet and social media. It is one of the most widely-used approaches to economy for its effectiveness for maximising the noticeability of a certain product. == Attention inequality in social media == In social media, attention inequality refers to the unequal distribution of users' attention on social media platforms. This means that instead of an equal distribution of attention, fewer sources receive a disproportionate share of attention, leaving many unnoticed. This phenomenon is possibly the result of social media algorithms, which are commonly designed to drive maximum engagement. This phenomenon is a large factor in the polarization and creation of echo-chambers. Social media algorithms tend to note content that is already performing well and display it to more users, while content that is equally engaging or well-made is not recommended to users. Posts that trigger strong emotions usually out-perform more "uncontroversial" content. When many users interact with the post, it signals the algorithm that the specific post drives engagement. The algorithm then tends to recommend that type of content to an exponential number of people, potentially outperforming "un-emotional" content. These factors, when combined, tend to create an unequal social media environment. == Attention inequality in science == According to a recent 2025 study about research inequality among scientists published in Information Processing and Management, scientific discourse is restricted to a small group of connected scientists, and is frequently not an accurate representation of the whole scientific community. Using citation-network analysis in the fields of nanoscience and chemical physics, the study claims that a group of connected scientists has a significant notability in the scientific community. The calculated connection strength between these scientists is estimated to be about 4.5, the study also says that these authors cite each other four times more often than would be predicted in a random network, whereas ordinary scientists that exist outside of this group only reach an estimated connection strength of 0.9. The study findings suggest that that scientific attention is not distributed by merit, but rather by the connectedness of the scientists involved in the research. == Extent == As data of 2008 shows, 50% of the attention is concentrated on approximately 0.2% of all hostnames, and 80% on 5% of hostnames. The Gini coefficient of attention distribution lay in 2008 at over 0.921 for such commercial domains names as ac.jp and at 0.985 for .org-domains. The Gini coefficient was measured on Twitter in 2016 for the number of followers as 0.9412, for the number of mentions as 0.9133, and for the number of retweets as 0.9034. For comparison, the world's income Gini coefficient was 0.68 in 2005 and 0.904 in 2018. More than 96% of all followers, 93% of the retweets, and 93% of all mentions are owned by 20% of Twitter. == Causes == At least for scientific papers, today's consensus states that inequality is unexplainable by variations of quality and individual talent. The Matthew effect plays a significant role in the emergence of attention inequality—those who already enjoy large amounts of attention get even more attention, and those who do not lose even more. Ranking algorithms based on relevance to the user have been found to alleviate the inequality of the number of posts across topics.

Cryptee

Cryptee is a privacy focused client-side encrypted and cross-platform productivity suite and data storage service. == History == Cryptee was founded in 2017, by John Ozbay, a cybersecurity researcher, commenter, and activist, to exclusively focus on providing a secure document editing service similar to Google Docs and Photos for everyone, with a particular focus on victims and survivors of domestic abuse, journalists and reporters. == Software == Users can write personal documents, notes, journals, store images, videos, and various kinds of other files. The source code of Cryptee is open source and publicly available to allow anyone to audit the service with ease, and help identify errors or potential vulnerabilities in a public and transparent manner. Cryptee has a few key features that differentiate it from other services in the industry, such as its Ghost Folders and Ghost Albums features, built specifically with victims and survivors of domestic abuse, journalists and reporters in mind. Cryptee allows users to hide (ghost) folders for plausible deniability also as known as deniable encryption in the field of cryptography and steganography, and ensure privacy even under coercion. === Features === Cryptee Docs' features include: To-do lists, Markdown support, KaTeX math and file attachments. cross-platform accessible, as it is a progressive web app. Bulk transfer from other note taking apps such as Evernote. Encrypted PDF and print-accurate (A4 and U.S. Letter paper-sized) text editing. Ability to edit docx files Cryptee Photos' features include: Ability to create slideshows. Ability to store original quality of photos. Ability to tag photos for organization. === Commercial strategy === The company's commercial strategy is focused on offering to its users an open source and transparent Photo Storage, Document Editor and Cloud Storage services without trackers or advertisements as it seeks to compete with Google Docs, Google Photos and similar services through its offerings. === Privacy === Cryptee utilizes zero-access storage to safe-keep all users' sensitive digital belongings. == Advocacy == === Lockdown mode === In July 2022, to fortify iPhones against the Pegasus Spyware, Apple announced a new, upcoming Lockdown Mode feature in iOS 16, welcomed by many experts. In the following weeks after Apple's announcement, in August 2022, the Founder and CEO of Cryptee, and privacy activist John Ozbay published their research detailing shortcoming of Apple's Lockdown Mode. They demonstrated that enabling Lockdown Mode makes it possible for all websites and online ads to be able to detect if users have Lockdown Mode enabled or not. This was due to the fact that disabling web fonts (an attack surface) was detectable by websites. === Confrontations against Apple === ==== On PWAs ==== In February 2024, Apple announced plans to kill progressive web apps on iOS devices in the EU, claiming it was to comply with the Digital Markets Act (DMA). The announcement was criticized as anti-competitive by many in the tech industry, including by Tim Sweeney, the CEO of Epic Games. In response, Cryptee started working together with Open Web Advocacy (OWA), an international not-for-profit digital rights group to advocate for the future of the open web, promote web browser choice on mobile operating systems through challenging Apple's anti-competitive third party browser engine ban, and to champion the use and equality of progressive web apps over native apps, by reaching out to the European Union's Digital Markets Act (DMA) team. To better understand the consequences of Apple's decision to kill web apps, the EU announced that they "seek to investigate Apple over cutting off web apps", and that they sent "requests for information to Apple and to app developers, who can provide useful information for our assessment". Apart from sending a response to the EU, Cryptee, along with the OWA, launched an open letter to Tim Cook, which in 48 hours, got thousands of signatories including European Parliament Members Karen Melchior and Patrick Breyer; and thousands of other developers and organizations from over 100 countries. Consequently, 24 hours later, Apple backed off, and reversed course on its plan to cut off progressive web apps in the EU. ==== Ozbay's representations ==== Following the events, eventually on March 18, 2024, Founder and CEO of Cryptee John Ozbay represented the Open Web Advocacy group in European Union's Digital Markets Act (DMA) hearing for Apple. At the hearing, OWA confronted Apple, accused Apple of "maliciously intending to undermine user choice", and stated that there was no defense for Apple's behavior. In response, according to the tech news outlet Ars Technica, Apple's spokesperson "seemed to dodge Ozbay's question". ==== Cooperation with the EU ==== Within a week of the hearing, the European Union announced a DMA non-compliance investigation against Apple and United States' Department of Justice filed an antitrust lawsuit against Apple. A few months later, on June 27, 2024, Cryptee, in cooperation with EDRi — an international advocacy group, along with Article 19 — a British international human rights organization, Privacy International, F-Droid, Free Software Foundation Europe, Guardian Project and others have submitted a comprehensive analysis to the European Commission about how Apple's plans to comply with the Digital Markets Act are insufficient. == Reviews == In a 2018 article, Wall Street Journal's MarketWatch reviewed Cryptee, articulating the fact that Cryptee offers zero-access storage for photos, files, documents and notes, and pointed out that: "Being based in Estonia puts Cryptee outside the “14 eyes jurisdiction,” an international surveillance alliance of European Union and North American countries, making it less likely it will be targeted with demands for data". In addition, the review highlighted Cryptee's Ghost Folders feature which ensures privacy even under coercion. In a 2019 article, Reclaim The Net named Cryptee as one of the "5 great privacy-focused Evernote alternatives to keep your notes safe", underlining that: "When it comes to security, this app is state of the art." and that "When making this app, the developers thought about every aspect of security and have taken every precaution to make it as secure as possible.". The review further underscored Cryptee's open-source nature, its strong encryption, and easy migration features. In a 2021 article, The Verge reviewed Cryptee, pointing out that Cryptee, based out of Europe, is one of the main photo storage service alternatives to Google Photos, and that it's their recommendation for users who are "concerned about privacy and like the idea of encryption" as Cryptee "offers to keep all your photos encrypted using AES-256". In a 2024 article, Beebom, enlisted Cryptee as one of the "7 best iCloud Photos Alternatives for iPhone and iPad", complimenting Cryptee's simplicity, its use of encryption to safeguard users' photos against hacking by not storing any unencrypted data. The article also provided further attention to Cryptee's additional features such as such as Ghost Albums, slideshows, easy-to-use drag and drop uploads, tagging and users' ability to store original-quality photos on Cryptee, concluding that Cryptee is "a safe bet if you are on the lookout for a privacy-centric iCloud Photos alternative".

Computer aided transceiver

Computer aided transceiver (CAT) is a non-generic serial protocol used by radio amateurs for (remotely) controlling a transceiver radio receiver equipment using a computer. Conventional transmitters are manually controlled and used to transmit voice using buttons, dials, etc. However, advances in electronics have come to market devices that can be controlled by a computer and allow digital modes such as packet radio and also the use of satellite tracking, because it can continuously change the device's frequency according to the Doppler effect. This is done by connecting a Radio receiver and a PC using a CAT interface and a CAT Program Additionally, CAT interfaces can also be used to position tracking antennas, in controllers. As a satellite moves overhead. A CAT interface is a piece of hardware that connects the PC and radio that provides a connection to allows the radio and the PC to communicate with each other. The CAT interface provides the signals to and fro via correct voltage levels and in the case of a Universal Serial Bus (USB) CAT interface it requires a "protocol" for communication but communication itself is down to the radio and the software on the PC. Software that may be called a CAT program allows a radio to be controlled through the PC. Changes made on the radio through user interactions on the CAT Program are (generally) shown on the PC's screen. The functionality of CAT equipment (software & interface) depends on the radio and what features the software writers included in the CAT software. Modern radio systems do have more CAT functionality If you run a logging program that supports CAT, then that software may take advantage of the CAT system by retrieving information from the radio to help fill in log details, such as the frequency that the contact was made. CAT is also useful on many radios where there are many sub-menus in the radios menu system, and many of the sub-menu items can be easily changed via the PC. On many HF radios, the CAT system is also used to program the memories on the radio, but you would need to use appropriate programming software. A CAT interface does not receive or transmit any DATA mode, that is the purpose of a DATA interface. Although, both may be used at the same time with the correct CAT Equipment. DATA modes, and getting audio to and from the PC is the function of a DATA interface. A completely different thing but it is easier and more useful when CAT and DATA are used at the same time. Wouldn't it be nice to have an interface that could operate Frequency-shift keying (FSK), Audio FSK (AFSK), (real) Morse Code (CW), with a CAT interface and its own sound card..... (eg. The DigiMaster Pro3).