AI Analysis X Ray

AI Analysis X Ray — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

Linux Trace Toolkit

The Linux Trace Toolkit (LTT) is a set of tools that is designed to log program execution details from a patched Linux kernel and then perform various analyses on them, using console-based and graphical tools. LTT has been mostly superseded by its successor LTTng (Linux Trace Toolkit Next Generation). LTT allows the user to see in-depth information about the processes that were running during the trace period, including when context switches occurred, how long the processes were blocked for, and how much time the processes spent executing vs. how much time the processes were blocked. The data is logged to a text file and various console-based and graphical (GTK+) tools are provided for interpreting that data. In order to do data collection, LTT requires a patched Linux kernel. The authors of LTT claim that the performance hit for a patched kernel compared to a regular kernel is minimal; Their testing has reportedly shown that this is less than 2.5% on a "normal use" system (measured using batches of kernel makes) and less than 5% on a file I/O intensive system (measured using batches of tar). == Usage == === Collecting trace data === Data collection is Started by: trace 15 foo This command will cause the LTT tracedaemon to do a trace that lasts for 15 seconds, writing trace data to foo.trace and process information from the /proc filesystem to foo.proc. The trace command is actually a script which runs the program tracedaemon with some common options. It is possible to run tracedaemon directly and in that case, the user can use a number of command-line options to control the data which is collected. For the complete list of options supported by tracedaemon, see the online manual page for tracedaemon. === Viewing the results === Viewing the results of a trace can be accomplished with: traceview foo This command will launch a graphical (GTK+) traceview tool that will read from foo.trace and foo.proc. This tool can show information in various interesting ways, including Event Graph, Process Analysis, and Raw Trace. The Event Graph is perhaps the most interesting view, showing the exact timing of events like page faults, interrupts, and context switches, in a simple graphical way. The traceview command is a wrapper for a program called tracevisualizer. For the complete list of options supported by tracevisualizer, see the online manual page for tracevisualizer.
Read more →
Best AI Video Editors in 2026

Shopping for the best AI video editor? An AI video editor is software that uses machine learning to help you get more done — it keeps getting smarter as the underlying models improve. Pricing, accuracy, and the size of the model behind the tool are the three factors that most affect daily usefulness. Whether you are a beginner or a pro, the right AI video editor slots into your workflow and pays for itself fast. Below we compare features, pricing, and real output so you can choose with confidence.
Read more →
AI Marketing Tools Reviews: What Actually Works in 2026

In search of the best AI marketing tool? An AI marketing tool is software that uses machine learning to help you get more done — it turns a rough idea into a polished result in seconds. When choosing one, weigh output quality, pricing, export formats, and how well it fits the tools you already use. Whether you are a beginner or a pro, the right AI marketing tool slots into your workflow and pays for itself fast. We tested the leading options and ranked them by quality, value, and ease of use.
Read more →
Top 10 AI Headshot Generators Compared (2026)

Trying to pick the best AI headshot generator? An AI headshot generator is software that uses machine learning to help you get more done — it scales effortlessly from a single task to thousands. The best picks balance beginner-friendly simplicity with the depth power users need, and they ship updates often. Whether you are a beginner or a pro, the right AI headshot generator slots into your workflow and pays for itself fast. Read on for hands-on impressions, pricing tiers, and the standout features that matter.
Read more →
WaveMaker

WaveMaker is a Java-based low-code development platform designed for building software applications and platforms. The company, WaveMaker Inc., is based in Mountain View, California. The platform is intended to assist enterprises in speeding up their application development and IT modernization initiatives through low-code capabilities. Additionally, for independent software vendors (ISVs), WaveMaker serves as a customizable low-code component that integrates into their products. The WaveMaker Platform is a licensed software platform allowing organizations to establish their own end-to-application platform-as-a-service (PaaS) for the creation and operation of custom apps. It allows developers and business users to create apps that are customizable. These applications can seamlessly consume APIs, visualize data, and automatically adapt to multi-device responsive interfaces. WaveMaker's low-code platform allows organizations to deploy applications on either public or private cloud infrastructure. Containers can be deployed on top of virtual machines or directly on bare metal. The software features a graphical user interface (GUI) console for managing IT app infrastructure, leveraging the capabilities of Docker containerization. The solution offers functionalities for automating application deployment, managing the application lifecycle, overseeing release management, and controlling deployment workflows and access permissions: Apps for web, tablet, and smartphone interfaces Enterprise technologies like Java, Hibernate, Spring, AngularJS, JQuery Docker-provided APIs and CLI Software stack packaging, container provisioning, stack and app upgrading, replication, and fault tolerance == WaveMaker Studio == WaveMaker RAD Platform is built around WaveMaker Studio, a WYSIWYG rapid development tool that allows business users to compose an application using a drag-and-drop method. WaveMaker Studio supports rapid application development (RAD) for the web, similar to what products like PowerBuilder and Lotus Notes provided for client-server computing. WaveMaker Studio allows developers to produce an application once, then automatically adjust it for a particular target platform, whether a PC, mobile phone, or tablet. Applications created using the WaveMaker Studio follow a model–view–controller architecture. WaveMaker Studio has been downloaded more than two million times. The Studio community consists of 30,000 registered users. Applications generated by WaveMaker Studio are licensed under the Apache license. Studio 8 was released on September 25, 2015. The prior version, Studio 7, has some notable development milestones. It was based on AngularJS framework, previous Studio versions (6.7, 6.6, 6.5) use the Dojo Toolkit. Some of the features WaveMaker Studio 7 include: Automatic generation of Hibernate mapping, and Hibernate queries from database schema import. Automatic creation of Enterprise Data Widgets based on schema import. Each widget can display data from a database table as a grid or edit form. Edit form implements create, update, and delete functions automatically. WYSIWYG Ajax development studio runs in a browser. Deployment to Tomcat, IBM WebSphere, Weblogic, JBoss. Mashup tool to assemble web applications based on SOAP, REST and RSS web services, Java Services and databases. Supports existing CSS, HTML and Java code. The ability to deploy a standard Java .war file. == Technologies and frameworks == WaveMaker allows users to build applications that run on "Open Systems Stack" based on the following technologies and frameworks: AngularJS, Bootstrap, NVD3, HTML, CSS, Apache Cordova, Hibernate, Spring, Spring Security, Java. The various supported integrations include: Databases: Oracle, MySQL, Microsoft SQL Server, PostgreSQL, IBM DB2, HSQLDB Authentication: LDAP, Active Directory, CAS, Custom Java Service, Database Version Control: Bitbucket (or Stash), GitHub, Apache Subversion Deployment: Amazon AWS, Microsoft Azure, WaveMaker Private Cloud (Docker containerization), IBM Web Sphere, Apache Tomcat, SpringSource tcServer, Oracle WebLogic Server, JBoss(WildFly), GlassFish App Stores: Google Play, Apple App Store, Windows Store == History == In 2003, WaveMaker was founded as ActiveGrid. Then, in 2007, it was rebranded as Wavemaker. It was acquired by VMware in 2011. In March 2013, support for the WaveMaker project was discontinued. In May 2013, Pramati Technologies acquired the assets of WaveMaker. In February 2014, Wavemaker Studio 6.7 was released, which was the last open source version of Studio. In September 2014 WaveMaker Inc. launched the WaveMaker RAD Platform, which allowed organizations to run their own application platform for building and running apps. In March 2023, WaveMaker released version 11.5, which includes enhanced low-code development capabilities and new AI-driven tools to streamline the application development process.
Read more →
AI Logo Makers Reviews: What Actually Works in 2026

Shopping for the best AI logo maker? An AI logo maker is software that uses machine learning to help you get more done — it keeps getting smarter as the underlying models improve. Pricing, accuracy, and the size of the model behind the tool are the three factors that most affect daily usefulness. Whether you are a beginner or a pro, the right AI logo maker slots into your workflow and pays for itself fast. We tested the leading options and ranked them by quality, value, and ease of use.
Read more →
AI Analytics Tools: Free vs Paid (2026)

In search of the best AI analytics tool? An AI analytics tool is software that uses machine learning to help you get more done — it turns a rough idea into a polished result in seconds. When choosing one, weigh output quality, pricing, export formats, and how well it fits the tools you already use. Whether you are a beginner or a pro, the right AI analytics tool slots into your workflow and pays for itself fast. We tested the leading options and ranked them by quality, value, and ease of use.
Read more →
Jian Ma (computational biologist)

Jian Ma (Chinese: 马坚) is an American computer scientist and computational biologist. He is the Ray and Stephanie Lane Professor of Computational Biology in the School of Computer Science at Carnegie Mellon University. He is a faculty member in the Ray and Stephanie Lane Computational Biology Department. His lab develops AI/ML methods to study the structure and function of the human genome and cellular organization and their implications for health and disease. During his Ph.D. and postdoc training, he developed algorithms to reconstruct the ancestral mammalian genome and evolutionary history. His research group has recently pioneered a series of new machine learning solutions for 3D genome organization, single-cell epigenomics, spatial omics, and complex molecular interactions. His lab also explores large language models to uncover gene regulatory mechanisms and the intricate connections among cellular components, with the aim of driving discovery and guiding experimentation. He received an NSF CAREER award in 2011. In 2020, he was awarded a Guggenheim Fellowship in Computer Science. He received the Allen Newell Award for Research Excellence (2025). He is an elected Fellow of the American Association for the Advancement of Science, the American Institute for Medical and Biological Engineering, the International Society for Computational Biology, and the Association for Computing Machinery. He leads an NIH 4D Nucleome Center to develop machine learning algorithms to better understand the cell nucleus. He served as the Program Chair for RECOMB 2024. He is also a member of the Scientific Advisory Board of the Chan Zuckerberg Biohub Chicago (CZ Biohub Chicago) and the RECOMB Steering Committee. In 2024, he launched the Center for AI-Driven Biomedical Research (AI4BIO) at CMU, which will be a catalyst for innovations at the intersection of AI and biomedicine across the School of Computer Science and campus. == Selected Recent Publications == Chen V#, Yang M#, Cui W, Kim JS, Talwalkar A, and Ma J. Applying interpretable machine learning in computational biology - pitfalls, recommendations and opportunities for new developments. Nature Methods, 21(8):1454-1461, 2024. Xiong K#, Zhang R#, and Ma J. scGHOST: Identifying single-cell 3D genome subcompartments. Nature Methods, 21(5):814-822, 2024. Zhou T, Zhang R, Jia D, Doty RT, Munday AD, Gao D, Xin L, Abkowitz JL, Duan Z, and Ma J. GAGE-seq concurrently profiles multiscale 3D genome organization and gene expression in single cells. Nature Genetics, 56(8):1701-1711, 2024. Zhang Y, Boninsegna L, Yang M, Misteli T, Alber F, and Ma J. Computational methods for analysing multiscale 3D genome organization. Nature Reviews Genetics, 5(2):123-141, 2024. Chidester B#, Zhou T#, Alam S, and Ma J. SPICEMIX enables integrative single-cell spatial modeling of cell identity. Nature Genetics, 55(1):78-88, 2023. [Cover Article] Zhang R#, Zhou T#, and Ma J. Ultrafast and interpretable single-cell 3D genome analysis with Fast-Higashi. Cell Systems, 13(10):P798-807.E6, 2022. [Cover Article] Zhu X#, Zhang Y#, Wang Y, Tian D, Belmont AS, Swedlow JR, and Ma J. Nucleome Browser: An integrative and multimodal data navigation platform for 4D Nucleome. Nature Methods, 19(8):911-913, 2022. Zhang R, Zhou T, and Ma J. Multiscale and integrative single-cell Hi-C analysis with Higashi. Nature Biotechnology, 40:254–261, 2022.
Read more →
Grammar systems theory

Grammar systems theory is a field of theoretical computer science that studies systems of finite collections of formal grammars generating a formal language. Each grammar works on a string, a so-called sequential form that represents an environment. Grammar systems can thus be used as a formalization of decentralized or distributed systems of agents in artificial intelligence. Let A {\displaystyle \mathbb {A} } be a simple reactive agent moving on the table and trying not to fall down from the table with two reactions, t for turning and ƒ for moving forward. The set of possible behaviors of A {\displaystyle \mathbb {A} } can then be described as formal language L A = { ( f m t n f r ) + : 1 ≤ m ≤ k ; 1 ≤ n ≤ ℓ ; 1 ≤ r ≤ k } , {\displaystyle \mathbb {L_{A}} =\{(f^{m}t^{n}f^{r})^{+}:1\leq m\leq k;1\leq n\leq \ell ;1\leq r\leq k\},} where ƒ can be done maximally k times and t can be done maximally ℓ times considering the dimensions of the table. Let G A {\displaystyle \mathbb {G_{A}} } be a formal grammar which generates language L A {\displaystyle \mathbb {L_{A}} } . The behavior of A {\displaystyle \mathbb {A} } is then described by this grammar. Suppose the A {\displaystyle \mathbb {A} } has a subsumption architecture; each component of this architecture can be then represented as a formal grammar, too, and the final behavior of the agent is then described by this system of grammars. The schema on the right describes such a system of grammars which shares a common string representing an environment. The shared sequential form is sequentially rewritten by each grammar, which can represent either a component or generally an agent. If grammars communicate together and work on a shared sequential form, it is called a Cooperating Distributed (DC) grammar system. Shared sequential form is a similar concept to the blackboard approach in AI, which is inspired by an idea of experts solving some problem together while they share their proposals and ideas on a shared blackboard. Each grammar in a grammar system can also work on its own string and communicate with other grammars in a system by sending their sequential forms on request. Such a grammar system is then called a Parallel Communicating (PC) grammar system. PC and DC are inspired by distributed AI. If there is no communication between grammars, the system is close to the decentralized approaches in AI. These kinds of grammar systems are sometimes called colonies or Eco-Grammar systems, depending (besides others) on whether the environment is changing on its own (Eco-Grammar system) or not (colonies).
Read more →
AI Chatbots: Free vs Paid (2026)

In search of the best AI chatbot? An AI chatbot is software that uses machine learning to help you get more done — it turns a rough idea into a polished result in seconds. When choosing one, weigh output quality, pricing, export formats, and how well it fits the tools you already use. Whether you are a beginner or a pro, the right AI chatbot slots into your workflow and pays for itself fast. We tested the leading options and ranked them by quality, value, and ease of use.
Read more →
AI Paraphrasing Tools Reviews: What Actually Works in 2026

Comparing the best AI paraphrasing tool? An AI paraphrasing tool is software that uses machine learning to help you get more done — it lowers the barrier so anyone can produce professional output. Privacy matters too: check whether your data trains the model and whether a no-log or enterprise tier is available. Whether you are a beginner or a pro, the right AI paraphrasing tool slots into your workflow and pays for itself fast. We tested the leading options and ranked them by quality, value, and ease of use.
Read more →
Danqi Chen

Danqi Chen (Chinese: 陈丹琦; pinyin: Chén Dānqí, IPA: [ʈ͡ʂʰə̌n tan t͡ɕʰǐ]; born in Changsha, China) is a Chinese computer scientist and assistant professor at Princeton University specializing in the AI field of natural language processing (NLP). In 2019, she joined the Princeton NLP group, alongside Sanjeev Arora, Christiane Fellbaum, and Karthik Narasimhan. She was previously a visiting scientist at Facebook AI Research (FAIR). She earned her Ph.D. at Stanford University and her BS from Tsinghua University. Chen is the author of Neural Reading Comprehension and Beyond, a dissertation on using artificial intelligence to access knowledge in ordinary and structured documents. She is the author or co-author of a number of journal articles, including Reading Wikipedia to Answer Open-Domain Questions. Google's SyntaxNet is based on algorithms developed by Danqi Chen and Christopher Manning at Stanford. Her primary research interests are in text understanding and knowledge representation and reasoning. She won a gold medal at the 2008 International Informatics Olympiad. She is known among friends as CDQ. A well known algorithm in competitive programming, CDQ Divide and Conquer, is named after this acronym. She is married to Huacheng Yu, an assistant professor in theoretical computer science at Princeton University.
Read more →
Curvelet

Curvelets are a non-adaptive technique for multi-scale object representation. Being an extension of the wavelet concept, they are becoming popular in similar fields, namely in image processing and scientific computing. Wavelets generalize the Fourier transform by using a basis that represents both location and spatial frequency. For 2D or 3D signals, directional wavelet transforms go further, by using basis functions that are also localized in orientation. A curvelet transform differs from other directional wavelet transforms in that the degree of localisation in orientation varies with scale. In particular, fine-scale basis functions are long ridges; the shape of the basis functions at scale j is 2 − j {\displaystyle 2^{-j}} by 2 − j / 2 {\displaystyle 2^{-j/2}} so the fine-scale bases are skinny ridges with a precisely determined orientation. Curvelets are an appropriate basis for representing images (or other functions) which are smooth apart from singularities along smooth curves, where the curves have bounded curvature, i.e. where objects in the image have a minimum length scale. This property holds for cartoons, geometrical diagrams, and text. As one zooms in on such images, the edges they contain appear increasingly straight. Curvelets take advantage of this property, by defining the higher resolution curvelets to be more elongated than the lower resolution curvelets. However, natural images (photographs) do not have this property; they have detail at every scale. Therefore, for natural images, it is preferable to use some sort of directional wavelet transform whose wavelets have the same aspect ratio at every scale. When the image is of the right type, curvelets provide a representation that is considerably sparser than other wavelet transforms. This can be quantified by considering the best approximation of a geometrical test image that can be represented using only n {\displaystyle n} wavelets, and analysing the approximation error as a function of n {\displaystyle n} . For a Fourier transform, the squared error decreases only as O ( 1 / n ) {\displaystyle O(1/{\sqrt {n}})} . For a wide variety of wavelet transforms, including both directional and non-directional variants, the squared error decreases as O ( 1 / n ) {\displaystyle O(1/n)} . The extra assumption underlying the curvelet transform allows it to achieve O ( ( log ⁡ n ) 3 / n 2 ) {\displaystyle O({(\log n)}^{3}/{n^{2}})} . Efficient numerical algorithms exist for computing the curvelet transform of discrete data. The computational cost of the discrete curvelet transforms proposed by Candès et al. (Discrete curvelet transform based on unequally-spaced fast Fourier transforms and based on the wrapping of specially selected Fourier samples) is approximately 6–10 times that of an FFT, and has the same dependence of O ( n 2 log ⁡ n ) {\displaystyle O(n^{2}\log n)} for an image of size n × n {\displaystyle n\times n} . == Curvelet construction == To construct a basic curvelet ϕ {\displaystyle \phi } and provide a tiling of the 2-D frequency space, two main ideas should be followed: Consider polar coordinates in frequency domain Construct curvelet elements being locally supported near wedges The number of wedges is N j = 4 ⋅ 2 ⌈ j 2 ⌉ {\displaystyle N_{j}=4\cdot 2^{\left\lceil {\frac {j}{2}}\right\rceil }} at the scale 2 − j {\displaystyle 2^{-j}} , i.e., it doubles in each second circular ring. Let ξ = ( ξ 1 , ξ 2 ) T {\displaystyle {\boldsymbol {\xi }}=\left(\xi _{1},\xi _{2}\right)^{T}} be the variable in frequency domain, and r = ξ 1 2 + ξ 2 2 , ω = arctan ⁡ ξ 1 ξ 2 {\displaystyle r={\sqrt {\xi _{1}^{2}+\xi _{2}^{2}}},\omega =\arctan {\frac {\xi _{1}}{\xi _{2}}}} be the polar coordinates in the frequency domain. We use the ansatz for the dilated basic curvelets in polar coordinates: ϕ ^ j , 0 , 0 := 2 − 3 j 4 W ( 2 − j r ) V ~ N j ( ω ) , r ≥ 0 , ω ∈ [ 0 , 2 π ) , j ∈ N 0 {\displaystyle {\hat {\phi }}_{j,0,0}:=2^{\frac {-3j}{4}}W(2^{-j}r){\tilde {V}}_{N_{j}}(\omega ),r\geq 0,\omega \in [0,2\pi ),j\in N_{0}} To construct a basic curvelet with compact support near a ″basic wedge″, the two windows W {\displaystyle W} and V ~ N j {\displaystyle {\tilde {V}}_{N_{j}}} need to have compact support. Here, we can simply take W ( r ) {\displaystyle W(r)} to cover ( 0 , ∞ ) {\displaystyle (0,\infty )} with dilated curvelets and V ~ N j {\displaystyle {\tilde {V}}_{N_{j}}} such that each circular ring is covered by the translations of V ~ N j {\displaystyle {\tilde {V}}_{N_{j}}} . Then the admissibility yields ∑ j = − ∞ ∞ | W ( 2 − j r ) | 2 = 1 , r ∈ ( 0 , ∞ ) . {\displaystyle \sum _{j=-\infty }^{\infty }\left|W(2^{-j}r)\right|^{2}=1,r\in (0,\infty ).} see Window Functions for more information For tiling a circular ring into N {\displaystyle N} wedges, where N {\displaystyle N} is an arbitrary positive integer, we need a 2 π {\displaystyle 2\pi } -periodic nonnegative window V ~ N {\displaystyle {\tilde {V}}_{N}} with support inside [ − 2 π N , 2 π N ] {\displaystyle \left[{\frac {-2\pi }{N}},{\frac {2\pi }{N}}\right]} such that ∑ l = 0 N − 1 V ~ N 2 ( ω − 2 π l N ) = 1 {\displaystyle \sum _{l=0}^{N-1}{\tilde {V}}_{N}^{2}\left(\omega -{\frac {2\pi l}{N}}\right)=1} , for all ω ∈ [ 0 , 2 π ) {\displaystyle \omega \in \left[0,2\pi \right)} , V ~ N {\displaystyle {\tilde {V}}_{N}} can be simply constructed as 2 π {\displaystyle 2\pi } -periodizations of a scaled window V ( N ω 2 π ) {\displaystyle V\left({\frac {N\omega }{2\pi }}\right)} . Then, it follows that ∑ l = 0 N j − 1 | 2 3 j 4 ϕ ^ j , 0 , 0 ( r , ω − 2 π l N j ) | 2 = | W ( 2 − j r ) | 2 ∑ l = 0 N j − 1 V ~ N j 2 ( ω − 2 π l N ) = | W ( 2 − j r ) | 2 {\displaystyle \sum _{l=0}^{N_{j}-1}\left|2^{\frac {3j}{4}}{\hat {\phi }}_{j,0,0}\left(r,\omega -{\frac {2\pi l}{N_{j}}}\right)\right|^{2}=\left|W(2^{-j}r)\right|^{2}\sum _{l=0}^{N_{j}-1}{\tilde {V}}_{N_{j}}^{2}\left(\omega -{\frac {2\pi l}{N}}\right)=\left|W(2^{-j}r)\right|^{2}} For a complete covering of the frequency plane including the region around zero, we need to define a low pass element ϕ ^ − 1 := W 0 ( | ξ | ) {\displaystyle {\hat {\phi }}_{-1}:=W_{0}(\left|\xi \right|)} with W 0 2 ( r ) 2 := 1 − ∑ j = 0 ∞ W ( 2 − j r ) 2 {\displaystyle W_{0}^{2}(r)^{2}:=1-\sum _{j=0}^{\infty }W(2^{-j}r)^{2}} that is supported on the unit circle, and where we do not consider any rotation. == Applications == Image processing Seismic exploration Fluid mechanics PDEs solving Compressed sensing
Read more →
Ancient text corpora

Ancient text corpora are the entire collection of texts from the period of ancient history, defined in this article as the period from the beginning of writing up to 300 AD. These corpora are important for the study of literature, history, linguistics, and other fields, and are a fundamental component of the world's cultural heritage. Chinese, Latin, and Greek are examples of ancient languages with significant text corpora, although much of these corpora are known to us via transmission (frequently via medieval manuscript copies) rather than in their original form. These texts – both transmitted and original – provide valuable insights into the history and culture of different regions of the world, and have been studied for centuries by scholars and researchers. Other ancient texts – particularly stone inscriptions and papyrus scrolls – have been published following archaeological research, notably the cuneiform corpus of c.10 million words and the c.5 million words in ancient Egyptian. Through advances in technology and digitization, ancient text corpora are more accessible than ever before. Tools such as the Perseus Digital Library and the Digital Corpus of Sanskrit have made it easier for researchers to access and analyze these texts. == Quantifying the corpora == Two types of ancient texts are known to modern scholars – those that have only survived in younger manuscripts, but whose great age is undisputed (this applies to the bulk of the Chinese, Brahmi, Greek, Latin, Hebrew and Avestan tradition), and those known from original inscriptions, papyri and other manuscripts. Counting of the words in each corpus presents significant methodological challenges – in principle, every single occurrence of a word in the text is counted separately, but in the case of parallel transmission of literary texts, only a single transmission is taken into account. Just as the Book of the Dead and the coffin texts are only included once in the number given for the Egyptian, the Greek and Latin literary works should only be counted according to one manuscript. If, on the other hand, tombs, royal inscriptions or economic documents of certain ancient languages often show a more or less identical form, this is not evaluated as a purely "parallel tradition". Attached prepositions are counted as separate words, except in the case of the definite article in Hebrew, Aramaic and Greek since it has no equivalent in most languages, so its frequency would significantly affect the comparability of numbers. === Languages with known size estimates === === South Asian === Sanskrit (Vedic Sanskrit and Classical Sanskrit) Indus script (3,800 items, c.20,000 characters) Brahmi script Old Tamil Early Indian epigraphy and Indian epic poetry Kharosthi Pali literature List of historic Indian texts === Mesoamerican === Olmec hieroglyphs Maya script === East Asian === Old Chinese Chinese classics The pre-Qin corpus: a collection of ancient Chinese texts written before the Qin dynasty (221 BCE). The corpus includes texts from Confucianism, Taoism, Legalism, and other schools of thought. The pre-Han corpus: a collection of ancient Chinese texts written before the Han dynasty (202 BCE). The corpus includes texts from Confucianism, Taoism, Legalism, and other schools of thought. See the Chinese Text Project Chinese bronze inscriptions, Oracle bone script, Seal script, Clerical script === Central Iranian languages === Prior to 300 AD, the Central Iranian languages are mainly in the form of Sassanid stone inscriptions in the two closely related idioms Middle Persian (Pahlavi scripts and Inscriptional Parthian), there are 5000 for the corpus of Middle Persian (mostly 3rd, but also 4th/5th centuries) and for the corpus of Parthian (3rd century) 3000 words. To what extent some of the Manichaean Middle Persian literary texts may date back to the 3rd century is difficult to estimate; Mani is said to have personally written the Shabuhragan totaling about 5000 words. In any case, if we combine Middle Persian and Parthian, we come to over 10,000 words. === Proto-Sinaitic === Proto-Sinaitic script has no more than about 400 letters (number of words is unknown since the script has not been fully interpreted). To a similar extent, there are probably approximately contemporaneous Proto-Canaanite inscriptions (ibid.). === Anatolian === Luwian cuneiform, approx. 3000 words the Palaic language few hundred words. Hieroglyphic Luwian the Lycian alphabet (the best attested Anatolian successor language written in alphabetic script) with about 5000 words The Lydian alphabet 109 inscriptions comprising about 1500 words The Phrygian alphabet the in-tomb inscriptions from the 2nd and 3rd centuries AD (approx. 1000 words) and in the so-called "old Phrygian" inscriptions less than 300 words The Carian alphabets whose texts, mainly from Egypt, contain around 600 words. === Old Italic === the Umbrian language attested essentially by the sacrificial instructions of the Iguvinian Tables with 5000 words the Oscan language (ibid.) with 2000 words the Messapic language with probably a good 1000 words (the estimate is difficult because most texts in this hardly understandable language do not use word separators) the Venetic language a few hundred words the Faliscan language a few hundred words Cisalpine Celtic inscriptions amount to approximately 2000 words, to which are added a number of glosses by classical authors === Iberia === Iberian scripts, more rarely written in Greek or Latin script, approx. 2500 words Celtiberian script, which refers to Celtic language testimonies in Iberian, but also in Latin script from Spain (approx. 1000 words) Southwest Paleohispanic script, 78 inscriptions, a few hundred words Lusitanian language, three monuments in Latin script, approx. 60 words === Germanic Northern Europe === Runic inscriptions dated before the 4th century amount to about 30 pieces, which contain no more than 50 words in total === Africa === Geʽez script: comparatively few inscriptions with a total of around 1,000 words before 300 AD. Following Christianization in the 4th century, more extensive texts are known. Libyco-Berber alphabet: over 1,000 inscriptions from the Maghreb, which are dated to Roman times. Most texts do not use a word separator; Peust estimates that the total number of words could be around 5,000 Meroitic script (Ancient Nubian): about 900 texts are known, which Peust estimates may contain approximately 10,000 words, albeit with uncertainty from the fact that the word separator is not used consistently in the Meroitic script. === Aegean === The Cretan Linear A inscriptions that have not yet been deciphered are available in about 2500 texts, which contain a total of around 20,000 characters. The total number of words can hardly be determined; Peust tentatively put it in the same order of magnitude as in Meroitic. In addition to the Linear A texts, there are also inscriptions Cretan hieroglyphs of a few hundred characters and texts written in the Greek alphabet, but not in Greek, with a few dozen words Cypriot syllabary in the first millennium BC, in which mostly Greek texts were recorded. The relevant texts comprise around 100 to 200 words. === Micro corpora === There are a significant number of ancient micro-corpus languages. Estimating the total number of attested ancient languages may be as difficult as estimating their corpus size. For example, Greek and Latin sources hand down an enormous amount of foreign-language glosses, the seriousness of which is not always certain. == Preservation and curation == Historic preservation and maintaining ancient text corpora presents several challenges, including issues with preservation, translation, and digitization. Many ancient texts have been lost over time, and those that survive may be damaged or fragmented. Translating ancient languages and scripts requires specialized expertise, and digitizing texts can be time-consuming and resource-intensive. == Corpus linguistics == The field of corpus linguistics studies language as expressed in text corpora. This includes the analysis of word frequency, collocations, grammar, and semantics. Ancient text corpora provide a valuable resource for corpus linguistics research, enabling scholars to explore the evolution of language and culture over time.
Read more →
Foma (software)

Foma is a free and open source finite-state toolkit created and maintained by Mans Hulden. It includes a compiler, programming language, and C library for constructing finite-state automata and transducers (FST's) for various uses, most typically Natural Language Processing uses such as morphological analysis. Foma can replace the proprietary Xerox Finite State Toolkit for compiling and running FST's written in the lexc and xfst formalisms. The speed is comparable with the Xerox tools for most lexicons, although Foma can be 3 or 4 times slower for very large lexicons (e.g. >100,000 words). Foma is also one of the possible backends of the free and open source Helsinki Finite State Toolkit (where other backends provide support for further formalisms). There are several FOSS morphologies written in lexc/xfst compatible with foma, e.g. for the Sámi, Cornish, Faroese, Finnish, Komi, Mari, Udmurt, Buriat, Greenlandic language and Iñupiaq languages.
Read more →