LIVAC Synchronous Corpus

LIVAC Synchronous Corpus

LIVAC is an uncommon language corpus dynamically maintained since 1995. Different from other existing corpora, LIVAC has adopted a rigorous and regular "Windows" approach in processing and filtering massive media texts from representative Chinese speech communities such as Beijing, Hong Kong, Macau, Taipei, Singapore, Shanghai, as well as Guangzhou, and Shenzhen. The contents are thus deliberately repetitive in most cases, represented by textual samples drawn from editorials, local and international news, cross-Taiwan Strait news, as well as news on finance, sports and entertainment. By 2023, more than 3 billion characters of news media texts have been filtered, of which 700 million characters have been processed and analyzed and have yielded an expanding Pan-Chinese dictionary of 2.5 million words from the Pan-Chinese printed media. Through rigorous analysis based on computational linguistic methodology, LIVAC has at the same time accumulated a large amount of accurate and meaningful statistical data on the Chinese language and on their diverse speech communities in the Pan-Chinese context, and the results show considerable and important long standing as well as evolving variations. The "Windows" approach is the most innovative feature of LIVAC and has enabled Pan-Chinese media texts to be quantitatively analyzed according to various attributes such as locations, time and subject domains. Thus, various types of comparative studies and applications in information technology as well as development of often related innovative applications have been possible. Moreover, LIVAC has allowed longitudinal developments to be taken into account, facilitating Key Word in Context (KWIC) search and comprehensive study of target words and their underlying concepts as well as linguistic structures over the past 25 years, based on the above mentioned variables of location, time and subject. Results from the extensive and accumulative data analysis contained in LIVAC have enabled the cultivation of textual databases of proper names, place names, organization names, new words, and bi-weekly and annual rosters of media figures. Related applications have included the establishment of verb and adjective databases, the formulation of sentiment indices, and related opinion mining, to measure and compare the popularity of global media figures in the Chinese media (LIVAC Annual Pan-Chinese Celebrity Rosters, later renamed as the Pan-Chinese Newsmaker Rosters). Notable among these are the decades long periodic reviews of the 25 years of annual pan-Chinese rosters since 2000 and compilation of new word databases (LIVAC Annual Pan-Chinese New Word Rosters). On this basis, the analysis of the emergence, diffusion and transformation of new words, and the publication of dictionaries of neologisms have been made possible. A recent focus is on the relative balance between disyllabic words and growing trisyllabic words in the Chinese language, and the comparative study of light verbs in three Chinese speech communities. as well as the link between the language use and use of language as a reflection of epochal change in China. A new LIVAC version 3.1 was launched in February 2024. == Corpus data processing == Accessing media texts, manual input, etc. Text unification including conversion from simplified to traditional Chinese characters, stored as Big5 and Unicode versions Automatic word segmentation Automatic alignment of parallel texts Manual verification, part-of-speech tagging Extraction of words and addition to regional sub-corpora Combination of regional sub-corpora to update the LIVAC corpus, and master lexical database == Labeling for data curation == Categories used include general terms and proper names, such as: general names, surnames, semi titles; geographical, organizations and commercial entities, etc.; time, prepositions, locations, etc.; stack-words; loanwords; case-word; numerals, etc. Construction of databases of proper names, place names, and specific terms, etc. Generate rosters: "new word rosters", "celebrity or media personality rosters", "place name rosters", compound words and matched words Other parts of speech tagging for sub-database, such as common nouns, numerals, numeral classifiers, different types of verbs, and of adjectives, pronouns, adverbs, prepositions, conjunctions, particles marking mood, onomatopoeia, interjection, etc. == Applications == Compilation of Pan-Chinese dictionaries or local dictionaries Information technology research, such as predictive Chinese text input for mobile phones, automatic speech to text conversion, opinion mining Comparative studies on linguistic and cultural developments in the Pan-Chinese regions, especially in a critical period of history in modern China. Language teaching and learning research, and speech-to-text conversion Customized service on linguistic research and lexical search for international corporations and government agencies The above applications are provided by the following functions: Word Segmentation Search Phrase Search Example Sentence Selection Multi-word Comparison Word Cloud

Pronunciation assessment

Automatic pronunciation assessment uses computer speech recognition to determine how accurately speech has been pronounced, instead of relying on a human instructor or proctor. It is also called speech verification, pronunciation evaluation, and pronunciation scoring. This technology is used to grade speech quality, for language testing, for computer-aided pronunciation teaching (CAPT) in computer-assisted language learning (CALL), for speaking skill remediation, and for accent reduction. Pronunciation assessment is different from dictation or automatic transcription, because instead of determining unknown speech, it verifies learners' pronunciation of known word(s), often from prior transcription of the same utterance; ideally scoring the intelligibility of the learners' speech. Sometimes pronunciation assessment evaluates the prosody of the learners' speech, such as intonation, pitch, tempo, rhythm, and syllable and word stress, although those are usually not essential for being understood in most languages. Pronunciation assessment is also used in reading tutoring, for example in products from Google, Microsoft, and Amira Learning. Automatic pronunciation assessment can also be used to help diagnose and treat speech disorders such as apraxia. == Intelligibility == Intelligibility refers to how well a learner's utterance is understood by a listener, rather than how much it sounds like a native speaker. This is separate from measures of fluency, such as so-called "Goodness of Pronunciation" (GoP) scores, which estimate how closely an utterance aligns with those of native speakers. Intelligibility is widely regarded as the most important communicative goal in pronunciation teaching and assessment. For example, in the Common European Framework of Reference for Languages (CEFR) assessment criteria for "overall phonological control", intelligibility outweighs formally correct pronunciation at all levels. Studies in applied linguistics have shown that accent reduction does not always increase intelligibility because listeners can often comprehend heavily accented speech without difficulty. Pronunciation assessment systems often rely on acoustic methods such as GoP which compare learner speech to reference models to produce phoneme-level scores, which are in turn aggregated to produce word and phrase scores. While these methods are effective for identifying deviations from native speakers' utterances, they do not effectively measure how understandable speech is to human listeners. Intelligibility is influenced by broader linguistic and contextual factors such as stress placement, speech rate, and coarticulation, which are not represented in purely segmental scores. The earliest work on pronunciation assessment avoided measuring genuine listener intelligibility, a shortcoming corrected in 2011 at the Toyohashi University of Technology, and included in the Versant high-stakes English fluency assessment from Pearson and mobile apps from 17zuoye Education & Technology, but still missing in 2023 products from Google Search, Microsoft, Educational Testing Service, Speechace, and ELSA. Assessing authentic listener intelligibility is essential for avoiding inaccuracies from accent bias, especially in high-stakes assessments; from words with multiple correct pronunciations; and from phoneme coding errors in machine-readable pronunciation dictionaries. In 2022, researchers found that some newer speech-to-text systems, based on end-to-end reinforcement learning to map audio signals directly into words, produce word and phrase confidence scores (from 10-25ms audio frame logit aggregation) closely correlated with genuine listener intelligibility. Others have been able to assess intelligibility using Levenshtein or dynamic time warping distance measures from Wav2Vec2 representation of good speech. Further work through 2025 has focused specifically on measuring intelligibility. A 2025 study of 42 pronunciation and speech coaching apps (32 mobile and 10 web) found that none offered intelligibility assessment. Instead, most provided only segmental and accent-focused scoring. About two-thirds of the apps provided some form of specific pronunciation feedback, usually with phonetic transcriptions, but accompanied by visual cues (such as animations of the vocal tract or the lips and tongue from the front) in only about 5% of the apps. Less than a third provided feedback on learner perception of exemplar speech. == Evaluation == Although there are as yet no industry-standard benchmarks for evaluating pronunciation assessment accuracy, researchers occasionally release evaluation speech corpuses for others to use for improving assessment quality. Such evaluation databases often emphasize formally unaccented pronunciation to the exclusion of genuine intelligibility evident from blinded listener transcriptions. As of mid-2025, state of the art approaches for automatically transcribing phonemes typically achieve an error rate of about 10% from known good speech. The International Speech Communication Association (ISCA) 2025 Workshop on Speech and Language Technology in Education (SLaTE) administered a Speak & Improve Challenge: Spoken Language Assessment and Feedback, introducing benchmarks for evaluating pronunciation assessment and remediation systems across languages, accents, and learner populations. The challenge emphasized cross-lingual generalization and alignment with human intelligibility judgments, for more robust and interpretable assessment systems. Ethical issues in pronunciation assessment are present in both human and automatic methods. Authentic validity, fairness, and mitigating bias in evaluation are all crucial. Diverse speech data should be included in automatic pronunciation assessment models. Combining human judgments, especially blinded transcriptions from a wide diversity of listeners, with automated feedback can improve accuracy and fairness. Second language learners benefit substantially from their use of widely available speech recognition systems for dictation, virtual assistants, and AI chatbots. In such systems, users naturally try to correct their own errors evident in speech recognition results that they notice. Such use improves their grammar and vocabulary development along with their pronunciation skills. The extent to which explicit pronunciation assessment and remediation approaches improve on such self-directed interactions remains an open question. Similarly, automatic dictation results have been shown to reflect intelligibility about as well as human scorers. == Recent developments == During 2021–22, a smartphone-based CAPT system was used to sense articulation through both audible and inaudible signals, providing feedback at the phoneme level. Some promising areas for improvement which were being developed in 2024 include articulatory feature extraction and transfer learning to suppress unnecessary corrections. Other interesting advances under development include "augmented reality" interfaces for mobile devices using optical character recognition to provide pronunciation training on text found in user environments. In 2024, audio multimodal large language models were first described as assessing pronunciation. That work has been carried forward by other researchers in 2025 who report positive results. Subsequently, researchers demonstrated pronunciation scoring by providing a language model with textual descriptions of speech, including the speech-to-text transcript, phoneme sequences, pauses, and phoneme sequence matching; this approach can achieve performance similar to multimodal LLMs that analyze raw audio while avoiding their higher computational cost. In 2025, the Duolingo English Test authors published a description of their pronunciation assessment method, purportedly built to measure intelligibility rather than accent imitation. While achieving a correlation of 0.82 with expert human ratings, very close to inter-rater agreement and outperforming alternative methods, the method is nonetheless based on experts' scores along the six-point CEFR common reference levels scale, instead of actual blinded listener transcriptions. Further promising work in 2025 includes assessment feedback aligning learner speech to synthetic utterances using interpretable features, identifying continuous spans of words for remediation feedback; synthesizing corrected speech matching learners' self-perceived voices, which they prefer and imitate more accurately as corrections; and streaming such interactions. On January 21, 2026, Educational Testing Service's TOEFL iBT high-stakes English language test, required by US university admissions and employers from English as a foreign language applicants more often than all other internet-based tests combined, changed its speaking assessments. While official rubrics claim that the new scoring will be based primarily on intelligibility, the new test's technical description indicates that it ju

Open Sound Control

Open Sound Control (OSC) is a protocol for networking sound synthesizers, computers, and other multimedia devices for purposes such as musical performance or show control. OSC's advantages include interoperability, accuracy, flexibility and enhanced organization and documentation. Its disadvantages include higher bandwidth requirements, increased load on embedded processors, and lack of standardized messages/interoperability. The first specification was released in March 2002. == Motivation == OSC is a content format developed at CNMAT by Adrian Freed and Matt Wright comparable to XML, WDDX, or JSON. It was originally intended for sharing music performance data (gestures, parameters and note sequences) between musical instruments (especially electronic musical instruments such as synthesizers), computers, and other multimedia devices. OSC is sometimes used as an alternative to the 1983 MIDI standard, when higher resolution and a richer parameter space is desired. OSC messages are transported across the internet and within local subnets using UDP/IP and Ethernet. OSC messages between gestural controllers are usually transmitted over serial endpoints of USB wrapped in the SLIP protocol. == Features == OSC's main features, compared to MIDI, include: Open-ended, dynamic, URI-style symbolic naming scheme Symbolic and high-resolution numeric data Pattern matching language to specify multiple recipients of a single message High resolution time tags "Bundles" of messages whose effects must occur simultaneously == Applications == There are dozens of OSC applications, including real-time sound and media processing environments, web interactivity tools, software synthesizers, programming languages and hardware devices. OSC has achieved wide use in fields including musical expression, robotics, video performance interfaces, distributed music systems and inter-process communication. The TUIO community standard for tangible interfaces such as multitouch is built on top of OSC. Similarly the GDIF system for representing gestures integrates OSC. OSC is used extensively in experimental musical controllers, and has been built into several open source and commercial products. The Open Sound World (OSW) music programming language is designed around OSC messaging. OSC is the heart of the DSSI plugin API, an evolution of the LADSPA API, in order to make the eventual GUI interact with the core of the plugin via messaging the plugin host. LADSPA and DSSI are APIs dedicated to audio effects and synthesizers. In 2007, a standardized namespace within OSC called SYN, for communication between controllers, synthesizers and hosts, was proposed. == Design == OSC messages consist of an address pattern (such as /oscillator/4/frequency), a type tag string (such as ,fi for a float32 argument followed by an int32 argument), and the arguments themselves (which may include a time tag). Address patterns form a hierarchical name space, reminiscent of a Unix filesystem path, or a URL, and refer to "Methods" inside the server, which are invoked with the attached arguments. Type tag strings are a compact string representation of the argument types. Arguments are represented in binary form with four-byte alignment. The core types supported are 32-bit two's complement signed integers 32-bit IEEE floating point numbers Null-terminated arrays of eight-bit encoded data (C-style strings) arbitrary sized blob (e.g. audio data, or a video frame) An example message is included in the spec (with null padding bytes represented by ␀): /oscillator/4/frequency␀,f␀␀, Followed by the 4-byte float32 representation of 440.0: 0x43dc0000. Messages may be combined into bundles, which themselves may be combined into bundles, etc. Each bundle contains a timestamp, which determines whether the server should respond immediately or at some point in the future. Applications commonly employ extensions to this core set. More recently some of these extensions such as a compact Boolean type were integrated into the required core types of OSC 1.1. The advantages of OSC over MIDI are primarily internet connectivity; data type resolution; and the comparative ease of specifying a symbolic path, as opposed to specifying all connections as seven-bit numbers with seven-bit or fourteen-bit data types. This human-readability has the disadvantage of being inefficient to transmit and more difficult to parse by embedded firmware, however. The spec does not define any particular OSC Methods or OSC Containers. All messages are implementation-defined and vary from server to server.

Boba liberal

Boba liberal is a term mostly used within the Asian diaspora communities in the West, especially in the United States. It describes someone of East or Southeast Asian descent living in the West who has a shallow, surface-level liberal outlook. It is also occasionally used to describe conservatives who weaponize their East or Southeast Asian identity. The neologism emerged among the Asian American leftist community on Twitter who accused "boba liberals" of only holding their liberal beliefs to appear more white-adjacent by engaging in progressive social movements or viewpoints, while at the same time disregarding and trivializing issues concerning Asians. Mary Chao, writing for The North Jersey Record, said that "Asians call peers boba liberals when they aspire to liberal whiteness." An article in The Yale Herald described it as a term "used to describe the ethnocentric politics of Asian Americans, usually of East Asian descent, who exclusively advocate for issues that benefit themselves, without acknowledging problematic dimensions of their own history and working to support other people of color." The feminist magazine Fem said that "the faces of boba liberalism are Asian Americans that are part of the middle and upper economic class. As a result, boba liberals disregard the negative effects of capitalism because they profit from it. For instance, boba liberals tend to focus on advocating for Asian representation in white spaces, or discussing whether or not wearing chopsticks in one's hair is culture appropriation. These topics are popular within boba liberal circles, all while dialogue regarding inequality, globalization, and racial injustice are purposely neglected." UnHerd notes that conservative Asian Americans have used the term not to critique capitalism, but to "aim at a small but influential group of progressive Asian-American activists who are supposedly selling out other Asians, especially working-class Asians, in order to win brownie points from elite, generally white liberals." MRAsians have similarly used the term to attack Asian American feminists who supported the Black Lives Matter movement. The Asian identity of boba liberals has often been accused of being shallow and superficial. Boba liberals are accused of using surface-level stereotypical Asian traits such as liking boba tea to bolster their Asian credentials. Plan A Magazine, an Asian diaspora magazine, described the film Crazy Rich Asians and the sitcom Fresh Off the Boat as "boba liberal media", calling them the result of "a specific kind of atomized identity politics". Other media outlets have connected the Crazy Rich Asians film to boba liberalism. == Controversy == The term "boba liberal" was coined in 2019 by Vietnamese American Twitter user Redmond (@diaspora_is_red) to analyze a form of Asian American liberalism through a Marxist lens. Redmond has criticized the misappropriation of their neologism by stripping away the Marxist framework by failing to discuss "socialism, communism, the capitalist system, imperialism, and the diaspora bourgeoisie" and conflating "boba liberalism" with the flawed concept of "East Asian privilege". In 2024, Redmond criticized misuse of the term by conservatives and liberals, and said "The term boba liberalism can go away for all I care. It's corny and stale". === United States === One commentator described boba liberals as supporting policies that primarily benefit upper-income Asian-Americans, and not necessarily the Asian-American community as a whole. Therefore, while the word "liberal" is used in the term, it is not mutually exclusive to one specific ideology, as it may also extend to conservative-aligned Asians in some areas, as they would often take advantage of the "model minority" label by defending such measures.

Full30

Full30 was an American online video-sharing platform primarily dedicated to firearms and shooting sports-related content. The service was established in 2014 by Tim Harmsen and Mark Hammonds as a result of YouTube's increasing restrictions on gun-related videos. == History == After the 2018 Parkland high school shooting, many companies attempted to distance themselves from any association with the firearms industry. As a result, YouTube began demonetizing and sometimes outright deleting firearms-related videos, and in one case, popular YouTube poster Hickok45's channel was completely deleted but later restored. In response, Harmsen, who operates the Military Arms Channel on YouTube, decided to create his own video-hosting website to allow himself and other firearms content creators a platform free from such restrictions; he named the website Full30 — a reference to the popular 30-round STANAG magazine. In July 2020, site representatives announced the site had new ownership. By the end of 2022, the site began to be redirected to a series of other websites. By 2025, it was largely deactivated with the front page replaced by a form to be filled out to receive "updates", with no other explanation. == Contributors == Hickok45 Military Arms Channel Forgotten Weapons Bavarian Shooter Liberty Doll CloverTac

VOCEDplus

VOCEDplus is a free international research database about tertiary education, maintained and developed by staff at the c (NCVER) in Adelaide, South Australia. The focus of the database content is the relation of post-compulsory education and training to workforce needs, skills development, and social inclusion. == Structure == The content of the VOCEDplus database encompasses vocational education and training (VET), higher education, lifelong learning, informal learning, VET in schools, adult and community education, apprenticeships/traineeships, international education, providers of education and training, and workforce development. It is international in scope and contains over 84,000 English language records, many with links to full text documents. VOCEDplus contains extensive Australian materials and includes a wide range of international information, covering outcomes of tertiary education in the shape of published research, practice, policy, and statistics. Entries are included for the following types of publications: reports; annual reports; papers; discussion papers; occasional papers; working papers; books; book chapters; conference papers; conference proceedings; journals; journal articles; policy documents; published statistics; theses; podcasts; and teaching and training materials. Each database entry contains standard bibliographic information and an abstract. Many entries include full text access via the publisher's website or a digitised copy. == History == === 1989-1997 === In the early years VOCEDplus was known as VOCED. The original database was produced by a network of clearinghouses across Australia with the aim of sharing activities in the technical and further education (TAFE) sector. VOCED was produced in hardcopy and an electronic version was distributed on diskette. === 1997-2001 === 1997 - the first web version of VOCED was made available from the National Centre for Vocational Education Research (NCVER) organisational website 1998 - a major project to upgrade the database and expand its international coverage commenced 2001 - creation of VOCED's own website 2001 - VOCED endorsed as the UNESCO international database for technical and vocational education and training (TVET) research information === 2001-2009 === Many changes to the database and website occurred during this period with a focus on continuous improvement to meet the needs of users and utilise emerging technologies. 2006 - materials produced for two adult literacy and learning programs funded by the Australian Department of Education, Employment and Workplace Relations (DEEWR) - the Workplace English Language and Learning (WELL) Programme and the Adult Literacy National Project (ALNP) included in VOCED 2007 - the Australian clearinghouse network transferred most of the hardcopy collections to NCVER, to form a centralised repository of resources 2009 - materials produced by Reframing the Future (RTF) a vocational education and training workforce development initiative of the Australian, State and Territory Governments included in VOCED === 2009-2014 === A major rebuild of the database and website was undertaken during this period to take advantage of the potential of new technologies to provide improved services and incorporate Web 2.0 technologies (RSS feeds, and share and bookmark tools). 2009 - scope expanded to more fully encompass the higher education sector 2011 - launch of VOCEDplus with the name change representing the enhanced features and extended focus 2012 - a major retrospective digitisation project commenced and by the end of the 2012-2013 financial year a total of 9,328 publications (593,534 pages/microfiche frames) had been digitised, ensuring these publications are available electronically for free === 2014-2019 === A number of significant curated content products were released during this period. 2015 - release of a refreshed look to adopt the new NCVER branding plus a number of search enhancements (Guided search, Expert search, and Glossary search) were added 2015 - first in the series of 'Focus on...' pages released 2016 - launch of the 'Pod Network', a convenient and efficient platform that allows instant access to research and a multitude of resources on a range of subjects 2017 - completion of the 'Pod Network', consisting of 20 Pods (on broad subjects including Apprenticeships and traineeships, Foundation skills, Teaching and learning, Career development, and Students) and 74 Podlets (on narrow topics including Online learning, Social media, VET in schools, STEM skills, and Adult literacy) 2018 - launch of the 'Timeline of Australian VET Policy Initiatives' and the 'VET Knowledge Bank' which contains a suite of products capturing Australia's diverse, complex and ever-changing VET system 2019 - after an internal review, a refreshed, streamlined version of the 'Pod Network' was released, consisting of 13 Pods and 20 Podlets 2019 - launch of the 'VET Practitioner Resource' which contains a range of information to support VET practitioners in their work and is organised into three sections: (1) Teaching, training and assessment: standards, guidance, research and good practice resources to inform daily work; (2) Practitioners as researchers: information for undertaking practitioner-led research; and (3) The VET workforce: information about VET teachers and trainers, and the professional development needs of the VET workforce 2019 - VOCEDplus celebrated 30 years of providing information to the tertiary education sector and the homepage was refreshed to make it more modern and easier to use === 2020- === VOCEDplus continued to be accessible throughout the COVID-19 pandemic. 2020-2021 - the VET Knowledge Bank added a dedicated page, 'COVID-19 announcements', that showcases the measures introduced by the Australian, state and territory governments to mitigate the impact of the pandemic and promote economic recovery 2020-2024 - published research about the effects of the pandemic on education and training, providers, students, labour markets, employment and employees was collected and made permanently available in the database 2024 - VOCEDplus celebrated 35 years of providing information to the tertiary education sector. The homepage was refreshed and a number of enhancements and new features were implemented including a new My Profile feature, improvements to My Selection, accessible search history and saved searches, enhanced search functionality, and improved navigation.

Mini-STX

Mini-STX (mSTX, Mini Socket Technology EXtended, originally "Intel 5x5") is a computer motherboard form factor that was released by Intel in 2015 (as "Intel 5x5"). These motherboards measure 147mm by 140mm (5.8" x 5.5"), making them larger than "4x4" NUC (102x102mm / 4.01" x 4.01" inches) and Nano-ITX (120x120mm / 4.7" x 4.7") boards, but notably smaller than the more common Mini-ITX (170x170mm / 6.7" x 6.7") boards. Unlike these standards, which use a square shape, the Mini-STX form factor is 7mm longer from front-to-rear, making it slightly rectangular. == Mini-STX design elements == The Mini-STX design suggests (but does not require) support for: Socketed processors (e.g. LGA or PGA CPUs) Onboard power regulation circuitry, enabling direct DC power input IO ports embedded on the front and rear of the motherboard (akin to NUC, but unlike typical motherboards which often use headers instead to connect built-in ports on enclosures) == Adoption by manufacturers == This motherboard form factor is still not in particularly common use with consumer-PC manufacturers, although there are a few offerings: ASRock offers both DeskMini kits (that use mini-STX boards) and standalone motherboards, Asus offer VivoMini kits (that use mini-STX boards) and standalone motherboards, Gigabyte offers a few motherboards, and industrial PC suppliers (e.g. Kontron, Iesy, ASRock Industrial) also provide some options for mini-STX equipment. == Derivatives == ASRock developed a derivative of mini-STX, dubbed micro-STX, for their 'DeskMini GTX/RX' small form-factor PCs and industrial motherboards. Micro-STX adds an MXM slot which allows the use of special PCI Express expansion cards, including graphics or machine learning accelerators, but increases the width of the board to be extended two inches, resulting in measurements of 147 x 188 mm (5.8" x 7.4")