AI Face Fusion

AI Face Fusion — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Automated essay scoring

    Automated essay scoring

    Automated essay scoring (AES) is the use of specialized computer programs to assign grades to essays written in an educational setting. It is a form of educational assessment and an application of natural language processing. Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding to the possible grades, for example, the numbers 1 to 6. Therefore, it can be considered a problem of statistical classification. Several factors have contributed to a growing interest in AES. Among them are cost, accountability, standards, and technology. Rising education costs have led to pressure to hold the educational system accountable for results by imposing standards. The advance of information technology promises to measure educational achievement at reduced cost. The use of AES for high-stakes testing in education has generated significant backlash, with opponents pointing to research that computers cannot yet grade writing accurately and arguing that their use for such purposes promotes teaching writing in reductive ways (i.e. teaching to the test). == History == Most historical summaries of AES trace the origins of the field to the work of Ellis Batten Page. In 1966, he argued for the possibility of scoring essays by computer, and in 1968 he published his successful work with a program called Project Essay Grade (PEG). Using the technology of that time, computerized essay scoring would not have been cost-effective, so Page abated his efforts for about two decades. Eventually, Page sold PEG to Measurement Incorporated. By 1990, desktop computers had become so powerful and so widespread that AES was a practical possibility. As early as 1982, a UNIX program called Writer's Workbench was able to offer punctuation, spelling and grammar advice. In collaboration with several companies (notably Educational Testing Service), Page updated PEG and ran some successful trials in the early 1990s. Peter Foltz and Thomas Landauer developed a system using a scoring engine called the Intelligent Essay Assessor (IEA). IEA was first used to score essays in 1997 for their undergraduate courses. It is now a product from Pearson Educational Technologies and used for scoring within a number of commercial products and state and national exams. IntelliMetric is Vantage Learning's AES engine. Its development began in 1996. It was first used commercially to score essays in 1998. Educational Testing Service offers "e-rater", an automated essay scoring program. It was first used commercially in February 1999. Jill Burstein was the team leader in its development. ETS's Criterion Online Writing Evaluation Service uses the e-rater engine to provide both scores and targeted feedback. Lawrence Rudner has done some work with Bayesian scoring, and developed a system called BETSY (Bayesian Essay Test Scoring sYstem). Some of his results have been published in print or online, but no commercial system incorporates BETSY as yet. Under the leadership of Howard Mitzel and Sue Lottridge, Pacific Metrics developed a constructed response automated scoring engine, CRASE. Currently utilized by several state departments of education and in a U.S. Department of Education-funded Enhanced Assessment Grant, Pacific Metrics’ technology has been used in large-scale formative and summative assessment environments since 2007. Measurement Inc. acquired the rights to PEG in 2002 and has continued to develop it. In 2012, the Hewlett Foundation sponsored a competition on Kaggle called the Automated Student Assessment Prize (ASAP). 201 challenge participants attempted to predict, using AES, the scores that human raters would give to thousands of essays written to eight different prompts. The intent was to demonstrate that AES can be as reliable as human raters, or more so. The competition also hosted a separate demonstration among nine AES vendors on a subset of the ASAP data. Although the investigators reported that the automated essay scoring was as reliable as human scoring, this claim was not substantiated by any statistical tests because some of the vendors required that no such tests be performed as a precondition for their participation. Moreover, the claim that the Hewlett Study demonstrated that AES can be as reliable as human raters has since been strongly contested, including by Randy E. Bennett, the Norman O. Frederiksen Chair in Assessment Innovation at the Educational Testing Service. Some of the major criticisms of the study have been that five of the eight datasets consisted of paragraphs rather than essays, four of the eight data sets were graded by human readers for content only rather than for writing ability, and that rather than measuring human readers and the AES machines against the "true score", the average of the two readers' scores, the study employed an artificial construct, the "resolved score", which in four datasets consisted of the higher of the two human scores if there was a disagreement. This last practice, in particular, gave the machines an unfair advantage by allowing them to round up for these datasets. In 1966, Page hypothesized that, in the future, the computer-based judge will be better correlated with each human judge than the other human judges are. Despite criticizing the applicability of this approach to essay marking in general, this hypothesis was supported for marking free text answers to short questions, such as those typical of the British GCSE system. Results of supervised learning demonstrate that the automatic systems perform well when marking by different human teachers is in good agreement. Unsupervised clustering of answers showed that excellent papers and weak papers formed well-defined clusters, and the automated marking rule for these clusters worked well, whereas marks given by human teachers for the third cluster ('mixed') can be controversial, and the reliability of any assessment of works from the 'mixed' cluster can often be questioned (both human and computer-based). == Different dimensions of essay quality == According to a recent survey, modern AES systems try to score different dimensions of an essay's quality in order to provide feedback to users. These dimensions include the following items: Grammaticality: following grammar rules Usage: using of prepositions, word usage Mechanics: following rules for spelling, punctuation, capitalization Style: word choice, sentence structure variety Relevance: how relevant of the content to the prompt Organization: how well the essay is structured Development: development of ideas with examples Cohesion: appropriate use of transition phrases Coherence: appropriate transitions between ideas Thesis Clarity: clarity of the thesis Persuasiveness: convincingness of the major argument == Procedure == From the beginning, the basic procedure for AES has been to start with a training set of essays that have been carefully hand-scored. The program evaluates surface features of the text of each essay, such as the total number of words, the number of subordinate clauses, or the ratio of uppercase to lowercase letters—quantities that can be measured without any human insight. It then constructs a mathematical model that relates these quantities to the scores that the essays received. The same model is then applied to calculate scores of new essays. Recently, one such mathematical model was created by Isaac Persing and Vincent Ng. which not only evaluates essays on the above features, but also on their argument strength. It evaluates various features of the essay, such as the agreement level of the author and reasons for the same, adherence to the prompt's topic, locations of argument components (major claim, claim, premise), errors in the arguments, cohesion in the arguments among various other features. In contrast to the other models mentioned above, this model is closer in duplicating human insight while grading essays. Due to the growing popularity of deep neural networks, deep learning approaches have been adopted for automated essay scoring, generally obtaining superior results, often surpassing inter-human agreement levels. The various AES programs differ in what specific surface features they measure, how many essays are required in the training set, and most significantly in the mathematical modeling technique. Early attempts used linear regression. Modern systems may use linear regression or other machine learning techniques often in combination with other statistical techniques such as latent semantic analysis and Bayesian inference. The automated essay scoring task has also been studied in the cross-domain setting using machine learning models, where the models are trained on essays written for one prompt (topic) and tested on essays written for another prompt. Successful approaches in the cross-domain scenario are based on deep neural networks or models that combine deep and shallow features. == Criteria for success == Any method of a

    Read more →
  • Information strategist

    Information strategist

    An information strategist analyses the information flow within an organisation and directs its information resources to better serve the organisation's strategic goals. They work with information technology or within a corporate library to direct high quality information from a variety of sources to users, based upon their profiles and needs. In warfare, information strategists not only seek to improve information flows for their own side but also try to disrupt the information flows of the enemy in order to demoralize and deceive them.

    Read more →
  • Artificial intelligence industry in Taiwan

    Artificial intelligence industry in Taiwan

    The artificial intelligence (AI) industry in Taiwan refers to the development, application, and commercialization of artificial intelligence technologies within Taiwan. The industry has grown alongside Taiwan's established strengths in semiconductor manufacturing and information and communications technology (ICT), and is supported by government policy, research institutions, and private sector participation. AI development in Taiwan has focused on integrating hardware capabilities with software applications across sectors such as manufacturing, healthcare, and smart infrastructure. Artificial intelligence has been identified as a strategic area of development in Taiwan since the late 2010s. While Taiwan has historically played a limited role in early theoretical and expert-system phases of AI development, its position in global electronics manufacturing has provided a foundation for participation in the contemporary era of machine learning and data-driven AI systems. Taiwan's AI industry is characterized by a strong hardware base, particularly in semiconductor production and AI server manufacturing, combined with increasing investment in software, data infrastructure, and applied AI services. The sector has been shaped by global demand for computing power, advances in deep learning, and the expansion of AI applications in industrial and commercial contexts. == Government policy and development == The Taiwanese government has promoted AI development through a series of national strategies. In 2017, the Ministry of Science and Technology launched the "AI Grand Strategy for a Small Country" initiative, investing approximately US$517 million between 2017 and 2021 to support research, infrastructure, and talent development. This initiative aimed to build a domestic AI ecosystem by funding research centers, expanding data infrastructure, and supporting industrial adoption. The Executive Yuan also introduced the AI Taiwan Action Plan 1.0 (2018–2021), which focused on integrating AI technologies into existing industries and strengthening research and development capabilities. A subsequent plan, AI Taiwan Action Plan 2.0 (2023–2026), expanded the focus to include ethical governance, regulatory frameworks, and risk management in response to the growth of generative AI technologies. In 2023, the Taiwan AI Center of Excellence (Taiwan AICoE), a government-backed hub, was established by the National Science and Technology Council to accelerate AI development, foster international collaboration, and train talent in Taiwan. It acts as a specialized think tank focusing on creating a "smart technology island" by integrating AI resources and developing trusted, human-centric AI technologies. In 2024, the Taiwan Chip-based Industrial Innovation Program (CbI) was launched by the Executive Yuan as a 10-year, NT$300 billion (US$9.3 billion) initiative to leverage Taiwan's semiconductor dominance, driving innovation in AI, smart mobility, manufacturing, and healthcare. It aims to combine generative AI with IC technology, cultivate talent, and attract global startups to build a "Silicon Island". In parallel, the Taiwanese government has explored legislative frameworks such as a proposed Artificial Intelligence Fundamental Act in December 2025, addressing issues including data protection, safety standards, and intellectual property. == Industrial structure == === Semiconductor and hardware foundation === Taiwan's AI industry is closely linked to its semiconductor sector. In 2020, Taiwan accounted for approximately 77.3% of the global wafer foundry market and 57.7% of packaging and testing, with a 20.1% share in integrated circuit (IC) design. These capabilities provide critical infrastructure for AI systems, which rely on high-performance computing hardware. Taiwanese firms are also involved in the production of AI servers and related components, contributing significantly to global supply chains for data centers and cloud computing. The integration of chip design, manufacturing, and assembly has enabled Taiwan to play a central role in providing the computational resources required for AI development. On 20 November 2025, Google established the "Google Taiwan AI Infrastructure R&D Center", second only to its US headquarters and largest AI hardware infrastructure engineering center outside of the United States. === Software and services === Compared to its hardware capabilities, Taiwan's AI software sector is less developed. The absence of large-scale global AI platform companies has been noted as a structural limitation. As a result, much of Taiwan's AI industry focuses on applied solutions, including customization of existing AI models for specific industries. Therefore, efforts to strengthen software capabilities have included investment in research institutions, startup ecosystems, and collaborations between academia and industry. == Applications == === Smart manufacturing === AI has been widely applied in Taiwan's manufacturing sector, which is a major component of the economy. Applications include process automation, predictive maintenance, quality control, and fault detection. AI-enabled smart manufacturing systems aim to improve efficiency, reduce production costs, and enhance product quality. Taiwan's manufacturing industry has incorporated AI technologies into production lines, particularly in electronics and machinery sectors. === Healthcare === The use of AI in healthcare in Taiwan has expanded in areas such as medical imaging, diagnostics, and drug development. AI systems are used to analyze CT scans, MRI data, and other clinical information to support diagnosis and treatment planning. Taiwan's healthcare sector, which includes medical devices, pharmaceuticals, and medical services, has benefited from the integration of AI technologies, particularly in precision medicine and clinical decision support systems. A notable example of AI healthcare deployment in Taiwan is the collaboration between Siemens Healthineers, Ever Fortune AI, and Asia University Hospital. === Edge computing and IoT === AI applications in Taiwan increasingly involve edge computing, where data processing occurs near the source rather than in centralized cloud systems. This approach reduces latency and bandwidth requirements and is used in smart devices, sensors, and industrial equipment. Edge AI technologies are applied in areas such as smart appliances, industrial automation, and transportation systems. == Education and talent development == Human capital development has been a key focus of Taiwan's AI strategy. The Taiwan AI Academy, established in 2018 with support from Academia Sinica and industry partners, provides training programs for professionals and students aimed at accelerating the adoption of artificial intelligence technologies across industries. The academy offers a range of courses, including executive-level programs, technical training, and specialized tracks in areas such as smart manufacturing, smart healthcare, and edge AI. These programs are designed to provide intensive and practical instruction over relatively short periods. A notable component of the curriculum is project-based learning, in which participants are required to complete proof-of-concept (POC) projects addressing real-world industrial problems. These projects are often developed further for implementation within companies, facilitating technology transfer and commercialization. Between 2018 and 2021, more than 8,000 individuals completed AI training programs across campuses in Taipei, Hsinchu, Taichung, and Tainan. Graduates of the academy have contributed to the introduction of AI systems in sectors such as manufacturing, healthcare, and finance, supporting broader industrial transformation efforts. In addition to the Taiwan AI Academy, universities and research institutions in Taiwan play a significant role in AI education and research. Leading universities have expanded programs in computer science, data science, and machine learning, while research institutes conduct applied and fundamental studies in artificial intelligence. Collaboration between academia, government, and industry is a common feature of Taiwan's AI ecosystem, with joint research projects, internship programs, and technology incubation initiatives supporting talent development. Government-supported initiatives have also sought to attract and retain AI talent, including funding for graduate education, international collaboration programs, and incentives for industry–academic partnerships. These efforts aim to address talent shortages and strengthen Taiwan's capacity in both applied and foundational AI research. == Regulation and governance == Taiwan has developed guidelines and policy frameworks to address the risks associated with AI technologies. In 2023, the Executive Yuan issued guidelines for the use of generative AI in government agencies, focusing on data security and privacy. Ongoing policy discussions hav

    Read more →
  • Microsoft SQL Server Master Data Services

    Microsoft SQL Server Master Data Services

    Microsoft SQL Server Master Data Services (MDS) is a Master Data Management (MDM) product from Microsoft that ships as a part of the Microsoft SQL Server relational database management system. Master data management (MDM) allows an organization to discover and define non-transactional lists of data, and compile maintainable, reliable master lists. Master Data Services first shipped with Microsoft SQL Server 2008 R2. Microsoft SQL Server 2016 introduced enhancements to Master Data Services, such as improved performance and security, and the ability to clear transaction logs, create custom indexes, share entity data between different models, and support for many-to-many relationships. == Overview == In Master Data Services, the model is the highest level container in the structure of your master data. You create a model to manage groups of similar data. A model contains one or more entities, and entities contain members that are the data records. An entity is similar to a table. Like other MDM products, Master Data Services aims to create a centralized data source and keep it synchronized, and thus reduce redundancies, across the applications which process the data. Sharing the architectural core with Stratature +EDM, Master Data Services uses a Microsoft SQL Server database as the physical data store. It is a part of the Master Data Hub, which uses the database to store and manage data entities. It is a database with the software to validate and manage the data, and keep it synchronized with the systems that use the data. The master data hub has to extract the data from the source system, validate, sanitize and shape the data, remove duplicates, and update the hub repositories, as well as synchronize the external sources. The entity schemas, attributes, data hierarchies, validation rules and access control information are specified as metadata to the Master Data Services runtime. Master Data Services does not impose any limitation on the data model. Master Data Services also allows custom Business rules, used for validating and sanitizing the data entering the data hub, to be defined, which is then run against the data matching the specified criteria. All changes made to the data are validated against the rules, and a log of the transaction is stored persistently. Violations are logged separately, and optionally the owner is notified, automatically. All the data entities can be versioned. Master Data Services allows the master data to be categorized by hierarchical relationships, such as employee data are a subtype of organization data. Hierarchies are generated by relating data attributes. Data can be automatically categorized using rules, and the categories are introspected programmatically. Master Data Services can also expose the data as Microsoft SQL Server views, which can be pulled by any SQL-compatible client. It uses a role-based access control system to restrict access to the data. The views are generated dynamically, so they contain the latest data entities in the master hub. It can also push out the data by writing to some external journals. Master Data Services also includes a web-based UI for viewing and managing the data. It uses ASP.NET in the back-end. The Silverlight front-end was replaced with HTML5 in SQL Server 2019. Master Data Services provides a Web service interface to expose the data, as well as an API, which internally uses the exposed web services, exposing the feature set, programmatically, to access and manipulate the data. It also integrates with Active Directory for authentication purposes. Unlike +EDM, Master Data Services supports Unicode characters, as well as support multilingual user interfaces. SQL Server 2016 introduced a significant performance increase in Master Data Services over previous versions. == Terminology == Model is the highest level of an MDS instance. It is the primary container for specific groupings of master data. In many ways it is very similar to the idea of a database. Entities are containers created within a model. Entities provide a home for members, and are in many ways analogous to database tables. (e.g. Customer) Members are analogous to the records in a database table (Entity) e.g. Will Smith. Members are contained within entities. Each member is made up of two or more attributes. Attributes are analogous to the columns within a database table (Entity) e.g. Surname. Attributes exist within entities and help describe members (the records within the table). Name and Code attributes are created by default for each entity and serve to describe and uniquely identify leaf members. Attributes can be related to other attributes from other entities which are called 'domain-based' attributes. This is similar to the concept of a foreign key. Other attributes however, will be of type 'free-form' (most common) or 'file'. Attribute Groups are explicitly defined collections of particular attributes. Say you have an entity "customer" that has 50 attributes — too much information for many of your users. Attribute groups enable the creation of custom sets of hand-picked attributes that are relevant for specific audiences. (e.g. "customer - delivery details" that would include just their name and last known delivery address). This is very similar to a database view. Hierarchies organize members into either Derived or Explicit hierarchical structures. Derived hierarchies, as the name suggests, are derived by the MDS engine based on the relationships that exist between attributes. Explicit hierarchies are created by hand using both leaf and consolidated members. Business Rules can be created and applied against model data to ensure that custom business logic is adhered to. In order to be committed into the system data must pass all business rule validations applied to them. e.g. Within the Customer Entity you may want to create a business rule that ensures all members of the 'Country' Attribute contain either the text "USA" or "Canada". The Business Rule once created and ran will then verify all the data is correct before it accepts it into the approved model. Versions provide system owners / administrators with the ability to Open, Lock or Commit a particular version of a model and the data contained within it at a particular point in time. As the content within a model varies, grows or shrinks over time versions provide a way of managing metadata so that subscribing systems can access to the correct content.

    Read more →
  • Anomaly detection

    Anomaly detection

    In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data. Anomaly detection finds application in many domains including cybersecurity, medicine, machine vision, statistics, neuroscience, law enforcement and financial fraud to name only a few. Anomalies were initially searched for clear rejection or omission from the data to aid statistical analysis, for example to compute the mean or standard deviation. They were also removed to better predictions from models such as linear regression, and more recently their removal aids the performance of machine learning algorithms. However, in many applications anomalies themselves are of interest and are the observations most desirous in the entire data set, which need to be identified and separated from noise or irrelevant outliers. Three broad categories of anomaly detection techniques exist. Supervised anomaly detection techniques require a data set that has been labeled as "normal" and "abnormal" and involves training a classifier. However, this approach is rarely used in anomaly detection due to the general unavailability of labelled data and the inherent unbalanced nature of the classes. Semi-supervised anomaly detection techniques assume that some portion of the data is labelled. This may be any combination of the normal or anomalous data, but more often than not, the techniques construct a model representing normal behavior from a given normal training data set, and then test the likelihood of a test instance to be generated by the model. Unsupervised anomaly detection techniques assume the data is unlabelled and are by far the most commonly used due to their wider and relevant application. == Definition == Many attempts have been made in the statistical and computer science communities to define an anomaly. The most prevalent ones include the following, and can be categorised into three groups: those that are ambiguous, those that are specific to a method with pre-defined thresholds usually chosen empirically, and those that are formally defined: === Ill defined === An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism. Anomalies are instances or collections of data that occur very rarely in the data set and whose features differ significantly from most of the data. An outlier is an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data. An anomaly is a point or collection of points that is relatively distant from other points in multi-dimensional space of features. Anomalies are patterns in data that do not conform to a well-defined notion of normal behaviour. === Specific === Let T be observations from a univariate Gaussian distribution and O a point from T. Then the z-score for O is greater than a pre-selected threshold if and only if O is an outlier. == History == === Intrusion detection === The concept of intrusion detection, a critical component of anomaly detection, has evolved significantly over time. Initially, it was a manual process where system administrators would monitor for unusual activities, such as a vacationing user's account being accessed or unexpected printer activity. This approach was not scalable and was soon superseded by the analysis of audit logs and system logs for signs of malicious behavior. By the late 1970s and early 1980s, the analysis of these logs was primarily used retrospectively to investigate incidents, as the volume of data made it impractical for real-time monitoring. The affordability of digital storage eventually led to audit logs being analyzed online, with specialized programs being developed to sift through the data. These programs, however, were typically run during off-peak hours due to their computational intensity. The 1990s brought the advent of real-time intrusion detection systems capable of analyzing audit data as it was generated, allowing for immediate detection of and response to attacks. This marked a significant shift towards proactive intrusion detection. As the field has continued to develop, the focus has shifted to creating solutions that can be efficiently implemented across large and complex network environments, adapting to the ever-growing variety of security threats and the dynamic nature of modern computing infrastructures. == Applications == Anomaly detection is applicable in a very large number and variety of domains, and is an important subarea of unsupervised machine learning. As such it has applications in cyber-security, intrusion detection, fraud detection, fault detection, system health monitoring, event detection in sensor networks, detecting ecosystem disturbances, defect detection in images using machine vision, medical diagnosis and law enforcement. === Intrusion detection === Anomaly detection was proposed for intrusion detection systems (IDS) by Dorothy Denning in 1986. Anomaly detection for IDS is normally accomplished with thresholds and statistics, but can also be done with soft computing, and inductive learning. Types of features proposed by 1999 included profiles of users, workstations, networks, remote hosts, groups of users, and programs based on frequencies, means, variances, covariances, and standard deviations. The counterpart of anomaly detection in intrusion detection is misuse detection. === Fintech fraud detection === Anomaly detection is vital in fintech for fraud prevention. === Preprocessing === Preprocessing data to remove anomalies can be an important step in data analysis, and is done for a number of reasons. Statistics such as the mean and standard deviation are more accurate after the removal of anomalies, and the visualisation of data can also be improved. In supervised learning, removing the anomalous data from the dataset often results in a statistically significant increase in accuracy. === Video surveillance === Anomaly detection has become increasingly vital in video surveillance to enhance security and safety. With the advent of deep learning technologies, methods using Convolutional Neural Networks (CNNs) and Simple Recurrent Units (SRUs) have shown significant promise in identifying unusual activities or behaviors in video data. These models can process and analyze extensive video feeds in real-time, recognizing patterns that deviate from the norm, which may indicate potential security threats or safety violations. An important aspect for video surveillance is the development of scalable real-time frameworks. Such pipelines are required for processing multiple video streams with low computational resources. === IT infrastructure === In IT infrastructure management, anomaly detection is crucial for ensuring the smooth operation and reliability of services. These are complex systems, composed of many interactive elements and large data quantities, requiring methods to process and reduce this data into a human and machine interpretable format. Techniques like the IT Infrastructure Library (ITIL) and monitoring frameworks are employed to track and manage system performance and user experience. Detected anomalies can help identify and pre-empt potential performance degradations or system failures, thus maintaining productivity and business process effectiveness. === IoT systems === Anomaly detection is critical for the security and efficiency of Internet of Things (IoT) systems. It helps in identifying system failures and security breaches in complex networks of IoT devices. The methods must manage real-time data, diverse device types, and scale effectively. Garg et al. have introduced a multi-stage anomaly detection framework that improves upon traditional methods by incorporating spatial clustering, density-based clustering, and locality-sensitive hashing. This tailored approach is designed to better handle the vast and varied nature of IoT data, thereby enhancing security and operational reliability in smart infrastructure and industrial IoT systems. === Petroleum industry === Anomaly detection is crucial in the petroleum industry for monitoring critical machinery. A 2015 paper proposed a novel segmentation algorithm using support vector machines to analyze sensor data for real-time anomaly detection. === Oil and gas pipeline monitoring === In the oil and gas sector, anomaly detection is not just crucial for maintenance and safety, but also for environmental protection. Aljameel et al. propose an advanced machine learning-based model for detecting minor leaks in oil and gas pipelines, a task traditional methods may miss.

    Read more →
  • Ecoinformatics

    Ecoinformatics

    Ecoinformatics, or ecological informatics, is the science of information in ecology and environmental science. It integrates environmental and information sciences to define entities and natural processes with language common to both humans and computers. However, this is a rapidly developing area in ecology and there are alternative perspectives on what constitutes ecoinformatics. A few definitions have been circulating, mostly centered on the creation of tools to access and analyze natural system data. However, the scope and aims of ecoinformatics are certainly broader than the development of metadata standards to be used in documenting datasets. Ecoinformatics aims to facilitate environmental research and management by developing ways to access, integrate databases of environmental information, and develop new algorithms enabling different environmental datasets to be combined to test ecological hypotheses. Ecoinformatics is related to the concept of ecosystem services. Ecoinformatics characterize the semantics of natural system knowledge. For this reason, much of today's ecoinformatics research relates to the branch of computer science known as knowledge representation, and active ecoinformatics projects are developing links to activities such as the Semantic Web. Current initiatives to effectively manage, share, and reuse ecological data are indicative of the increasing importance of fields like ecoinformatics to develop the foundations for effectively managing ecological information. Examples of these initiatives are National Science Foundation Datanet projects, DataONE, Data Conservancy, and Artificial Intelligence for Environment & Sustainability. == Software Development Lifecycle == Central to the concept of ecoinformatics is the Software Development Lifecycle (SDLC), a systematic framework for writing, implementing, and maintaining software products. Typically in Ecoinformatics projects, the development pipeline includes data collection, usually from several different environmental data sources, then integrating these data sources together, and then analyzing the data. Here, each step of the SDLC is described in the context of ecoinformatics, per Michener et al. It is important to note that the plan, collect, assure, describes and preserve steps refer to the data collection entity, which can be individual researchers or large data-collection networks, while the discover, integrate, and analyze steps typically refer to the individual researcher. Plan: Ecoinformatics projects require data from several databases. Each database holds different data, and therefore researchers should identify what types of environmental or ecological data they will need to answer their research question. Collect: Data is collected in several different ways. In ecoinformatics, this is usually restricted to manually entering data into a spreadsheet, and parsing data from an existing database. The growth of relational databases has made it easier for ecologists to download relevant data and integrate datasets together Assure: Data entries should be checked thoroughly to validate their accuracy and usability, such as to check for outliers and erroneous points. The same principle applies to data downloaded from datasets. This responsibility falls on both the ecologist downloading the data, and the entity that sets up the data collection system. Describe: An accurate description of the metadata of a dataset that is used in a study should include enough information to deduce the data collection and processing methodology, when the data were collected, why the data were collected, and how the data were stored. This is important for reproducibility, especially for projects that build on each other and may recycle data Preserve: After data is collected by an institutional entity, it should be archived such that it is easily accessible. Ideally, this is in databases that are maintained and not at risk of deprecation Discover: While there are good practices for discovering data to start a research project, this process is often marred by a lack of usable, published data, as researchers may collect data specific to their study, but may not publish this data for wider use. On the data collection end, this can be addressed by better data-sharing practices, such as by linking datasets when publishing papers or studies. On the data procurement end, this can be addressed by more precise data searching, such as using key words to find relevant datasets. Integrate: Synthesizing datasets together can be difficult and labor-intensive, largely due to the methodological differences in data collection. There are several approaches to this, but the best practices typically involve computational approaches, namely using R or Python, to automate the processes and prevent errors Analyze: Data analysis can take several forms, and should be tailored to the specific ecological project. However, all data analysis methods should be well-documented, including the procedure for analysis, justification for analysis methods, and any shortcomings in a specific approach. == Applications of Ecoinformatics Across Ecology == === Ecosystem Ecology === Source: Ecosystem studies, by definition, encompass interactions across the entire life sciences spectrum, from microscopic biochemical reactions to large-scale geological phenomena. As a result, big databases may not be designed specifically for any particular research question, but should be inclusive enough to support most studies. Since ecosystem-level questions require a broad perspective, data-related ecosystem projects would likely incorporate data from several databases. A common framework for incorporating data into ecosystem-level studies is the network science model, in which data collection mechanisms and resources are treated like a large, interconnected network instead of individual entities. The network may include several data collection stations within one databases, or may span across multiple databases. Currently there are several large-scale networks, but they do not generate data on the scale to consider ecology as a big data science. A current challenge for ecoinformatics in ecosystem ecology is that most funding is prioritized for generating new data rather than maintaining existing data infrastructures. Integrating data across the different spatial scales can also be difficult, since each dataset may hold different types of data. === Urban Ecology === Source: The current push for smart cities, and sensor network integration into infrastructure, has positioned as a major source of data for ecological studies. Typical urban ecology questions address the effects of urbanization on the local ecosystem, and how to drive future development to promote urban biodiversity. While sensor networks in cities typically collect environmental data to optimize city processes, they may also be used for ecological initiatives, especially with respect to understanding the complex, multi-layered relationship between cities and their local ecosystem. It can also be used to better understand the current landscape of cities, and identify avenues for rewinding of cities. For example, analyzing mobility patterns can identify areas that may lend themselves well to building parks and green spaces. Bird watching data can also be used to identify the types of bird species in a local area. === Infectious Disease === Source: Like other disciplines of ecology, emerging infectious disease and epidemiology span multiple scales, from understanding the genetics that drive disease trends to large-scale spatiotemporal analyses. As a result, infectious disease studies can incorporate everything from bioinformatics, genetic sequences, amino acid sequences, and environmental observation data. On the micro-scale, these data can then be used to predict infectivity/transmissibility, drug resistance, drug candidates, and mutation sites. On the macro-scale, it can be used to identify societal trends or environmental factors that lend themselves to spillover, locations of infection, and practices that cause disease transmission. == Databases == Source: USGS National Streamflow sensor network GBIF Neotoma Paleobiology database European Vegetation Archive USDA Forest Inventory Analysis TRY BIEN AmeriFlux TEAM iNaturalist NEON GLEON LTER CZO TERN SAEON

    Read more →
  • Geospatial metadata

    Geospatial metadata

    Geospatial metadata (also geographic metadata) is a type of metadata applicable to geographic data and information. Such objects may be stored in a geographic information system (GIS) or may simply be documents, data-sets, images or other objects, services, or related items that exist in some other native environment but whose features may be appropriate to describe in a (geographic) metadata catalog (may also be known as a data directory or data inventory). == Definition == ISO 19115:2013 "Geographic Information – Metadata" from ISO/TC 211, the industry standard for geospatial metadata, describes its scope as follows: [This standard] provides information about the identification, the extent, the quality, the spatial and temporal aspects, the content, the spatial reference, the portrayal, distribution, and other properties of digital geographic data and services. ISO 19115:2013 also provides for non-digital mediums: Though this part of ISO 19115 is applicable to digital data and services, its principles can be extended to many other types of resources such as maps, charts, and textual documents as well as non-geographic data. The U.S. Federal Geographic Data Committee (FGDC) describes geospatial metadata as follows: A metadata record is a file of information, usually presented as an XML document, which captures the basic characteristics of a data or information resource. It represents the who, what, when, where, why and how of the resource. Geospatial metadata commonly document geographic digital data such as Geographic Information System (GIS) files, geospatial databases, and earth imagery but can also be used to document geospatial resources including data catalogs, mapping applications, data models and related websites. Metadata records include core library catalog elements such as Title, Abstract, and Publication Data; geographic elements such as Geographic Extent and Projection Information; and database elements such as Attribute Label Definitions and Attribute Domain Values. == History == The growing appreciation of the value of geospatial metadata through the 1980s and 1990s led to the development of a number of initiatives to collect metadata according to a variety of formats either within agencies, communities of practice, or countries/groups of countries. For example, NASA's "DIF" metadata format was developed during an Earth Science and Applications Data Systems Workshop in 1987, and formally approved for adoption in 1988. Similarly, the U.S. FGDC developed its geospatial metadata standard over the period 1992–1994. The Spatial Information Council of Australia and New Zealand (ANZLIC), a combined body representing spatial data interests in Australia and New Zealand, released version 1 of its "metadata guidelines" in 1996. ISO/TC 211 undertook the task of harmonizing the range of formal and de facto standards over the approximate period 1999–2002, resulting in the release of ISO 19115 "Geographic Information – Metadata" in 2003 and a subsequent revision in 2013. As of 2011 individual countries, communities of practice, agencies, etc. have started re-casting their previously used metadata standards as "profiles" or recommended subsets of ISO 19115, occasionally with the inclusion of additional metadata elements as formal extensions to the ISO standard. The growth in popularity of Internet technologies and data formats, such as Extensible Markup Language (XML) during the 1990s led to the development of mechanisms for exchanging geographic metadata on the web. In 2004, the Open Geospatial Consortium released the current version (3.1) of Geography Markup Language (GML), an XML grammar for expressing geospatial features and corresponding metadata. With the growth of the Semantic Web in the 2000s, the geospatial community has begun to develop ontologies for representing semantic geospatial metadata. Some examples include the Hydrology and Administrative ontologies developed by the Ordnance Survey in the United Kingdom. == ISO 19115: Geographic information – Metadata == ISO 19115 is a standard of the International Organization for Standardization (ISO). The standard is part of the ISO geographic information suite of standards (19100 series). ISO 19115 and its parts define how to describe geographical information and associated services, including contents, spatial-temporal purchases, data quality, access and rights to use. The objective of this International Standard is to provide a clear procedure for the description of digital geographic data-sets so that users will be able to determine whether the data in a holding will be of use to them and how to access the data. By establishing a common set of metadata terminology, definitions and extension procedures, this standard promotes the proper use and effective retrieval of geographic data. ISO 19115 was revised in 2013 to accommodate growing use of the internet for metadata management, as well as add many new categories of metadata elements (referred to as codelists) and the ability to limit the extent of metadata use temporally or by user. == ISO 19139 Geographic information Metadata XML schema implementation == ISO 19139:2012 provides the XML implementation schema for ISO 19115 specifying the metadata record format and may be used to describe, validate, and exchange geospatial metadata prepared in XML. The standard is part of the ISO geographic information suite of standards (19100 series), and provides a spatial metadata XML (spatial metadata eXtensible Mark-up Language (smXML)) encoding, an XML schema implementation derived from ISO 19115, Geographic information – Metadata. The metadata includes information about the identification, constraint, extent, quality, spatial and temporal reference, distribution, lineage, and maintenance of the digital geographic data-set. == Metadata directories == Also known as metadata catalogues or data directories. (need discussion of, and subsections on GCMD, FGDC metadata gateway, ASDD, European and Canadian initiatives, etc. etc.) GIS Inventory – National GIS Inventory System which is maintained by the US-based National States Geographic Information Council (NSGIC) as a tool for the entire US GIS Community. Its primary purpose is to track data availability and the status of geographic information system (GIS) implementation in state and local governments to aid the planning and building of statewide spatial data infrastructures (SSDI). The Random Access Metadata for Online Nationwide Assessment (RAMONA) database is a critical component of the GIS Inventory. RAMONA moves its FGDC-compliant metadata (CSDGM Standard) for each data layer to a web folder and a Catalog Service for the Web (CSW) that can be harvested by Federal programs and others. This provides far greater opportunities for discovery of user information. The GIS Inventory website was originally created in 2006 by NSGIC under award NA04NOS4730011 from the Coastal Services Center, National Oceanic and Atmospheric Administration, U.S. Department of Commerce. The Department of Homeland Security has been the principal funding source since 2008 and they supported the development of the Version 5 during 2011/2012 under Order Number HSHQDC-11-P-00177. The Federal Emergency Management Agency and National Oceanic and Atmospheric Administration have provided additional resources to maintain and improve the GIS Inventory. Some US Federal programs require submission of CSDGM-Compliant Metadata for data created under grants and contracts that they issue. The GIS Inventory provides a very simple interface to create the required Metadata. GCMD - Global Change Master Directory's goal is to enable users to locate and obtain access to Earth science data sets and services relevant to global change and Earth science research. The GCMD database holds more than 20,000 descriptions of Earth science data sets and services covering all aspects of Earth and environmental sciences. ECHO - The EOS Clearing House (ECHO) is a spatial and temporal metadata registry, service registry, and order broker. It allows users to more efficiently search and access data and services through the Reverb Client or Application Programmer Interfaces (APIs). ECHO stores metadata from a variety of science disciplines and domains, totalling over 3400 Earth science data sets and over 118 million granule records. GoGeo - GoGeo is a service run by EDINA (University of Edinburgh) and is supported by Jisc. GoGeo allows users to conduct geographically targeted searches to discover geospatial datasets. GoGeo searches many data portals from the HE and FE community and beyond. GoGeo also allows users to create standards compliant metadata through its Geodoc metadata editor. == Geospatial metadata tools == There are many proprietary GIS or geospatial products that support metadata viewing and editing on GIS resources. For example, ESRI's ArcGIS Desktop, SOCET GXP, Autodesk's AutoCAD Map 3D 2008, Arcitecta's Mediaflux and Intergraph's Geo

    Read more →
  • Informedia Digital Library

    Informedia Digital Library

    The Informedia Digital Library is an ongoing research program at Carnegie Mellon University to build search engines and information visualization technology for many types of media. The program has carried out research on spoken document retrieval, video information retrieval, video segmentation, face recognition, and cross-language information retrieval. The Lycos search engine was an early product of the Informedia Digital Library Project. The project is led by Howard Wactlar. Researchers on the project have included: Michael Mauldin, Alex Hauptmann, Michael Christel, Michael Witbrock, Raj Reddy, Takeo Kanade and Scott Stevens.

    Read more →
  • Telebirr

    Telebirr

    Telebirr (Amharic: ቴሌብር) is a mobile payment service developed and was launched by Ethio telecom, the state owned telecommunication and Internet service provider in Ethiopia. It took five months to develop the end-to-end service. It facilitates the delivery of cashless transactions. The platform deployed currently has the capacity of processing up to 100 transactions per second (TPS) and can be scaled up to 1000 TPS. The service is accessible via SMS, USSD, and smartphone applications. Telebirr works in five languages. == Services == Though the service is fully accessible for any customer of Ethio telecom, the users need to register through the mobile application called Telebirr or using an authorized agent or Ethio telecom shop or Unstructured Supplementary Service Data (USSD), 127# nationally. However, Telebirr also provides a “quick registration” by using any information that already exists in Ethio telecom's system.

    Read more →
  • Discoverability

    Discoverability

    Discoverability is the degree to which something, especially a piece of content or information, can be found in a search of a file, database, or other information system. Discoverability is a concern in library and information science, many aspects of digital media, software and web development, and in marketing, since products and services cannot be used if people cannot find it or do not understand what it can be used for. In human-computer interaction the term is further used to describe the discoverability of interactions, features and interactive systems overall . Metadata, or "information about information", such as a book's title, a product's description, or a website's keywords, affects how discoverable something is on a database or online. Adding metadata to a product that is available online can make it easier for end users to find the product. For example, if a song file is made available online, making the title, band name, genre, year of release, and other pertinent information available in connection with this song means the file can be retrieved more easily. The organization of information through the implementation of alphabetical structures or the integration of content into search engines exemplifies strategies employed to enhance the discoverability of information. The concept of discoverability, while related to but distinct from accessibility and usability, which are other qualities that affect the usefulness of a piece of information, is a critical aspect of information retrieval. == Etymology == The concept of "discoverability" in an information science and online context is a loose borrowing from the concept of the similar name in the legal profession. In law, "discovery" is a pre-trial procedure in a lawsuit in which each party, through the law of civil procedure, can obtain evidence from the other party or parties by means of discovery devices such as a request for answers to interrogatories, request for production of documents, request for admissions and depositions. Discovery can be obtained from non-parties using subpoenas. When a discovery request is objected to, the requesting party may seek the assistance of the court by filing a motion to compel discovery. == Purpose == The usability of any piece of information directly relates to how discoverable it is, either in a "walled garden" database or on the open Internet. The quality of information available on this database or on the Internet depends upon the quality of the meta-information about each item, product, or service. In the case of a service, because of the emphasis placed on service reusability, opportunities should exist for reuse of this service. However, reuse is only possible if information is discoverable in the first place. To make items, products, and services discoverable, the process is as follows: Document the information about the item, product or service (the metadata) in a consistent manner. Store the documented information (metadata) in a searchable repository. while technically a human-searchable repository, such as a printed paper list would qualify, "searchable repository" is usually taken to mean a computer-searchable repository, such as a database that a human user can search using some type of search engine or "find" feature. Enable search for the documented information in an efficient manner. supports number 2, because while reading through a printed paper list by hand might be feasible in a theoretical sense, it is not time and cost-efficient in comparison with computer-based searching. Apart from increasing the reuse potential of the services, discoverability is also required to avoid development of solution logic that is already contained in an existing service. To design services that are not only discoverable but also provide interpretable information about their capabilities, the service discoverability principle provides guidelines that could be applied during the service-oriented analysis phase of the service delivery process. === Specific to digital media === In relation to audiovisual content, according to the meaning given by the Canadian Radio-television and Telecommunications Commission (CRTC) for the purpose of its 2016 Discoverability Summit, discoverability can be summed up to the intrinsic ability of given content to "stand out of the lot", or to position itself so as to be easily found and discovered. A piece of audiovisual content can be a movie, a TV series, music, a book (eBook), an audio book or podcast. When audiovisual content such as a digital file for a TV show, movie, or song, is made available online, if the content is "tagged" with identifying information such as the names of the key artists (e.g., actors, directors and screenwriters for TV shows and movies; singers, musicians and record producers for songs) and the genres (for movies genres, music genres, etc.). When users interact with online content, algorithms typically determine what types of content the user is interested in, and then a computer program suggests "more like this", which is other content that the user may be interested in. Different websites and systems have different algorithms, but one approach, used by Amazon (company) for its online store, is to indicate to a user: "customers who bought x also bought y" (affinity analysis, collaborative filtering). This example is oriented around online purchasing behaviour, but an algorithm could also be programmed to provide suggestions based on other factors (e.g., searching, viewing, etc.). Discoverability is typically referred to in connection with search engines. A highly "discoverable" piece of content would appear at the top, or near the top of a user's search results. A related concept is the role of "recommendation engines", which give a user recommendations based on his/her previous online activity. Discoverability applies to computers and devices that can access the Internet, including various console video game systems and mobile devices such as tablets and smartphones. When producers make an effort to promote content (e.g., a TV show, film, song, or video game), they can use traditional marketing (billboards, TV ads, radio ads) and digital ads (pop-up ads, pre-roll ads, etc.), or a mix of traditional and digital marketing. Even before the user's intervention by searching for a certain content or type of content, discoverability is the prime factor which contributes to whether a piece of audiovisual content will be likely to be found in the various digital modes of content consumption. As of 2017, modes of searching include looking on Netflix for movies, Spotify for music, Audible for audio books, etc., although the concept can also more generally be applied to content found on Twitter, Tumblr, Instagram, and other websites. It involves more than a content's mere presence on a given platform; it can involve associating this content with "keywords" (tags), search algorithms, positioning within different categories, metadata, etc. Thus, discoverability enables as much as it promotes. For audiovisual content broadcast or streamed on digital media using the Internet, discoverability includes the underlying concepts of information science and programming architecture, which are at the very foundation of the search for a specific product, information or content. === Human-Computer Interaction === In human–computer interaction (HCI), discoverability refers to the ability of users to perceive and comprehend a system, function, or input method upon encountering it, despite a lack of prior awareness or knowledge, whether through intentional effort or serendipitously . The concept was popularised by Don Norman, who framed it around whether users can determine what actions are possible and how to perform them . Discoverability is considered a precondition for learnability, though the two concepts are frequently conflated in the literature . == Applications == === Within a webpage === Within a specific webpage or software application ("app"), the discoverability of a feature, content or link depends on a range of factors, including the size, colour, highlighting features, and position within the page. When colour is used to communicate the importance of a feature or link, designers typically use other elements as well, such as shadows or bolding, for individuals, who cannot see certain colours. Just as traditional paper printing created other physical locations that stood out, such as being "above the fold" of a newspaper versus "below the fold", a web page or app's screenview may have certain locations that give features additional visibility to users, such as being right at the bottom of the web page or screen. The positional advantages or disadvantages of various locations depend on different cultures and languages (e.g., left to right vs. right to left). Some locations have become established, such as having toolbars at the top of a screen or webpage. Some designers have argued t

    Read more →
  • Shapiro–Senapathy algorithm

    Shapiro–Senapathy algorithm

    The Shapiro—Senapathy algorithm (S&S) is a computational method for identifying splice sites in eukaryotic genes. The algorithm employs a Position Weight Matrix (PWM) scoring formula to predict donor and acceptor splice sites in any given gene. This methodology has been used to discover splice sites and disease-causing splice site mutations in the human genome, and has become a standard tool in clinical genomics. The S&S algorithm has been cited in thousands of clinical studies, according to Google Scholar. It has also formed the basis of widely used software, including Human Splicing Finder, SROOGLE, and Alamut, which identify splice sites and splice site mutations that cause disease. The algorithm has uncovered splicing mutations in diseases ranging from cancers to inherited disorders, and predicted the deleterious effects of these mutations including exon skipping, intron retention, and cryptic splice site activation. == The algorithm == A splice site defines the boundary between a coding exon and a non-coding intron in eukaryotic genes. The S&S algorithm employs a sliding window, corresponding to the length of the splice site motif, to scan a gene sequence and detect potential splice sites. For each sliding window, the algorithm calculates a score by comparing the nucleotide sequence to a Position Weight Matrix (PWM) derived from known splice sites. This formula generates a percentile score, indicating the likelihood that a given sequence functions as a donor or acceptor splice site. The majority of disease-causing mutations in the human genome are located in splice sites. Clinical genomics studies analyze the splice site scores generated by the S&S algorithm to predict the consequences of splice site mutations including exon skipping and intron retention. The algorithm's sensitivity to single-nucleotide changes allows it to determine mutations that may impact RNA splicing and contribute to disease. In addition to identifying real splice sites, the S&S algorithm has been used to discover cryptic splice sites — alternative splice sites activated by mutations — which may disrupt normal splicing. The algorithm detects mutations that lead to the activation of cryptic splice sites, which may be located proximal to real splice sites or deep within non-coding introns. It has thus been used to determine the causes of numerous diseases that are due to cryptic splicing. == Cancer gene discovery using S&S == The S&S algorithm has been used to identify splice-site mutations in genes associated with several cancers. For example, genes causing commonly occurring cancers including breast cancer, ovarian cancer, colorectal cancer, leukemia, head and neck cancers, prostate cancer, retinoblastoma, squamous cell carcinoma, gastrointestinal cancer, melanoma, liver cancer, Lynch syndrome, skin cancer, and neurofibromatosis have been found. In addition, splicing mutations in genes causing less commonly known cancers including gastric cancer, gangliogliomas, Li-Fraumeni syndrome, Loeys–Dietz syndrome, Osteochondromas (bone tumor), Nevoid basal cell carcinoma syndrome, and Pheochromocytomas have been identified. Specific mutations in different splice sites in various genes causing breast cancer (e.g., BRCA1, PALB2), ovarian cancer (e.g., SLC9A3R1, COL7A1, HSD17B7), colon cancer (e.g., APC, MLH1, DPYD), colorectal cancer (e.g., COL3A1, APC, HLA-A), skin cancer (e.g., COL17A1, XPA, POLH), and Fanconi anemia (e.g., FANC, FANA) have been uncovered. The mutations in the donor and acceptor splice sites in different genes causing a variety of cancers that have been identified by S&S are shown in Table 1. == Discovery of genes causing inherited disorders using S&S == Specific mutations in different splice sites in various genes that cause inherited disorders, including, for example, Type 1 diabetes (e.g., PTPN22, TCF1 (HCF-1A)), hypertension (e.g., LDL, LDLR, LPL), Marfan syndrome (e.g., FBN1, TGFBR2, FBN2), cardiac diseases (e.g., COL1A2, MYBPC3, ACTC1), eye disorders (e.g., EVC, VSX1) have been uncovered. A few example mutations in the donor and acceptor splice sites in different genes causing a variety of inherited disorders identified using S&S are shown in Table 2. == Genes causing immune system disorders == More than 100 immune system disorders affect humans, including inflammatory bowel diseases, multiple sclerosis, systemic lupus erythematosus, bloom syndrome, familial cold autoinflammatory syndrome, and dyskeratosis congenita. The Shapiro–Senapathy algorithm has been used to discover genes and mutations involved in many immune disorder diseases, including Ataxia telangiectasia, B-cell defects, epidermolysis bullosa, and X-linked agammaglobulinemia. Xeroderma pigmentosum, an autosomal recessive disorder is caused by faulty proteins formed due to new preferred splice donor site identified using S&S algorithm and resulted in defective nucleotide excision repair. Type I Bartter syndrome (BS) is caused by mutations in the gene SLC12A1. S&S algorithm helped in disclosing the presence of two novel heterozygous mutations c.724 + 4A > G in intron 5 and c.2095delG in intron 16 leading to complete exon 5 skipping. Mutations in the MYH gene, which is responsible for removing the oxidatively damaged DNA lesion are cancer-susceptible in the individuals. The IVS1+5C plays a causative role in the activation of a cryptic splice donor site and the alternative splicing in intron 1, S&S algorithm shows, guanine (G) at the position of IVS+5 is well conserved (at the frequency of 84%) among primates. This also supported the fact that the G/C SNP in the conserved splice junction of the MYH gene causes the alternative splicing of intron 1 of the β type transcript. Splice site scores were calculated according to S&S to find EBV infection in X-linked lymphoproliferative disease. Identification of Familial tumoral calcinosis (FTC) is an autosomal recessive disorder characterized by ectopic calcifications and elevated serum phosphate levels and it is because of aberrant splicing. == Application of S&S in hospitals for clinical practice and research == The Shapiro–Senapathy (S&S) algorithm has played a significant role in advancing the diagnosis and treatment of human diseases through its application in modern clinical genomics. With the widespread adoption of next-generation sequencing (NGS) technologies, the S&S algorithm is now routinely integrated into clinical practice by geneticists and diagnostic laboratories. It is implemented in various computational tools such as Human Splicing Finder (HSF), Splice Site Finder (SSF), and Alamut Visual, which assist in interpreting the functional impact of genetic variants on RNA splicing. The algorithm is particularly useful in identifying pathogenic splice site mutations in cases where the clinical presentation is unclear or where conventional diagnostic methods have failed to identify a causative gene. Its utility has been demonstrated across diverse patient cohorts, including individuals from different ethnic backgrounds with various cancers and inherited genetic disorders. The following are selected examples illustrating its application in clinical research. === Cancers === === Inherited disorders === == S&S - Algorithm for identifying splice sites, exons and split genes == The Shapiro–Senapathy algorithm (SSA) was developed to identify splice sites in uncharacterized genomic sequences, with early applications in the Human Genome Project. The method introduced a Position Weight Matrix (PWM)-based approach to analyze splicing sequences across eukaryotic organisms, marking the first computational framework to systematically define splice sites using probabilistic scoring. Key innovations of the algorithm included: Exon Detection – Exons were defined as sequences bounded by acceptor and donor splice sites with S&S scores above a threshold, requiring an open reading frame (ORF) for validation. Gene Prediction – The method enabled the identification of complete genes by assembling predicted exons, forming a basis for later gene-finding tools. Mutation Analysis – The algorithm distinguishes deleterious splice-site mutations (which disrupt protein function by lowering S&S scores) from neutral variations. This capability allowed researchers to study disease-linked cryptic splice sites in humans, animals, and plants. SSA's PWM-based framework influenced subsequent computational methods, including machine learning and neural network approaches, for splice-site prediction and alternative splicing research. It remains a foundational tool in genomics and disease studies. == Discovering the mechanisms of aberrant splicing in diseases == The Shapiro–Senapathy algorithm has been used to determine the various aberrant splicing mechanisms in genes due to deleterious mutations in the splice sites, which cause numerous diseases. Deleterious splice site mutations impair the normal splicing of the gene transcripts, and thereby make the encoded protei

    Read more →
  • Terminology extraction

    Terminology extraction

    Terminology extraction (also known as term extraction, glossary extraction, term recognition, or terminology mining) is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus. In the semantic web era, a growing number of communities and networked enterprises started to access and interoperate through the internet. Modeling these communities and their information needs is important for several web applications, like topic-driven web crawlers, web services, recommender systems, etc. The development of terminology extraction is also essential to the language industry. One of the first steps to model a knowledge domain is to collect a vocabulary of domain-relevant terms, constituting the linguistic surface manifestation of domain concepts. Several methods to automatically extract technical terms from domain-specific document warehouses have been described in the literature. Typically, approaches to automatic term extraction make use of linguistic processors (part of speech tagging, phrase chunking) to extract terminological candidates, i.e. syntactically plausible terminological noun phrases. Noun phrases include compounds (e.g. "credit card"), adjective noun phrases (e.g. "local tourist information office"), and prepositional noun phrases (e.g. "board of directors"). In English, the first two (compounds and adjective noun phrases) are the most frequent. Terminological entries are then filtered from the candidate list using statistical and machine learning methods. Once filtered, because of their low ambiguity and high specificity, these terms are particularly useful for conceptualizing a knowledge domain or for supporting the creation of a domain ontology or a terminology base. Furthermore, terminology extraction is a very useful starting point for semantic similarity, knowledge management, human translation and machine translation, etc. == Bilingual terminology extraction == The methods for terminology extraction can be applied to parallel corpora. Combined with e.g. co-occurrence statistics, candidates for term translations can be obtained. Bilingual terminology can be extracted also from comparable corpora (corpora containing texts within the same text type, domain but not translations of documents between each other).

    Read more →
  • Structural risk minimization

    Structural risk minimization

    Structural risk minimization (SRM) is an inductive principle of use in machine learning. Commonly in machine learning, a generalized model must be selected from a finite data set, with the consequent problem of overfitting – the model becoming too strongly tailored to the particularities of the training set and generalizing poorly to new data. The SRM principle addresses this problem by balancing the model's complexity against its success at fitting the training data. This principle was first set out in a 1974 book by Vladimir Vapnik and Alexey Chervonenkis and uses the VC dimension. In practical terms, Structural Risk Minimization is implemented by minimizing E t r a i n + β H ( W ) {\displaystyle E_{train}+\beta H(W)} , where E t r a i n {\displaystyle E_{train}} is the train error, the function H ( W ) {\displaystyle H(W)} is called a regularization function, and β {\displaystyle \beta } is a constant. H ( W ) {\displaystyle H(W)} is chosen such that it takes large values on parameters W {\displaystyle W} that belong to high-capacity subsets of the parameter space. Minimizing H ( W ) {\displaystyle H(W)} in effect limits the capacity of the accessible subsets of the parameter space, thereby controlling the trade-off between minimizing the training error and minimizing the expected gap between the training error and test error. The SRM problem can be formulated in terms of data. Given n data points consisting of data x and labels y, the objective J ( θ ) {\displaystyle J(\theta )} is often expressed in the following manner: J ( θ ) = 1 2 n ∑ i = 1 n ( h θ ( x i ) − y i ) 2 + λ 2 ∑ j = 1 d θ j 2 {\displaystyle J(\theta )={\frac {1}{2n}}\sum _{i=1}^{n}(h_{\theta }(x^{i})-y^{i})^{2}+{\frac {\lambda }{2}}\sum _{j=1}^{d}\theta _{j}^{2}} The first term is the mean squared error (MSE) term between the value of the learned model, h θ {\displaystyle h_{\theta }} , and the given labels y {\displaystyle y} . This term is the training error, E t r a i n {\displaystyle E_{train}} , that was discussed earlier. The second term, places a prior over the weights, to favor sparsity and penalize larger weights. The trade-off coefficient, λ {\displaystyle \lambda } , is a hyperparameter that places more or less importance on the regularization term. Larger λ {\displaystyle \lambda } encourages sparser weights at the expense of a more optimal MSE, and smaller λ {\displaystyle \lambda } relaxes regularization allowing the model to fit to data. Note that as λ → ∞ {\displaystyle \lambda \to \infty } the weights become zero, and as λ → 0 {\displaystyle \lambda \to 0} , the model typically suffers from overfitting.

    Read more →
  • Artificial intelligence industry in Italy

    Artificial intelligence industry in Italy

    The artificial intelligence industry in Italy is growing and supports industrial development. In 2024 it reached a new record, reaching 1.2 billion euros with a growth of +58% compared to 2023. While in 2025, the growth of artificial intelligence in the industrial application was even greater than in 2024 both in terms of value and application to industrial sectors. == History == The roots of AI research in Italy extend back to the 1970s, when Italian scholars began exploring automated reasoning, programming language semantics, and pattern recognition. Researchers such as those involved in early projects at the National Research Council and various universities laid the groundwork for subsequent academic and industrial developments in the field. During this period, the focus was predominantly on developing algorithms for automated theorem proving and building systems to reason about complex mathematical problems. This era witnessed the birth of methodologies that would later influence numerous AI subfields, from natural language processing (NLP) to robotics. === Institutional milestones and academic contributions === A turning point in the Italian AI landscape was the formation of the Italian Association for Artificial Intelligence (AIxIA) in 1988. Founded by academics, including Luigia Carlucci Aiello, the association established a platform for collaboration between universities, research centers, and industry. Led by Aiello, AIIA played a role in promoting research, organizing national conferences, and fostering international partnerships that connected Italy's AI community to global networks. At the same time, professors such as Roberto Navigli and numerous practitioners contributed to the advancement of AI in Italy. Navigli has worked in multilingual NLP, including the creation of BabelNet, and led the Minerva project. === Industrial AI === Over recent decades, numerous national and European initiatives supported by funding from programs such as the National Recovery and Resilience Plan (PNRR) have spurred the transition from theoretical research to practical applications. Industrial sectors including manufacturing, banking, and healthcare increasingly embraced AI-driven automation, while research institutions collaborated with industrial partners to deploy cutting-edge solutions. In recent years, Italy has also seen the establishment of specialized research centers and institutes aimed at bridging the gap between academic innovation and industrial application. These initiatives indicate a broader national commitment to integrating AI into the fabric of Italian industry. == Recent developments == === Emergence of generative AI === A landmark in Italy's modern AI evolution is the development of Minerva AI. Developed by the Sapienza NLP research group at Sapienza University of Rome and led by Professor Roberto Navigli, Minerva represents the first family of large language models (LLMs) trained from scratch with a primary focus on the Italian language. ==== Minerva 7B ==== The latest iteration, Minerva 7B, has 7 billion parameters and has been trained on an extensive corpus of over 1.5 trillion words. By using advanced instruction tuning techniques, Minerva 7B is able to produce highly accurate, coherent, and contextually sensitive responses addressing common issues such as hallucinations and inappropriate content generation. This breakthrough sets a benchmark for transparent, open-source AI development in the country. Minerva's development, carried out within the FAIR (Future Artificial Intelligence Research) project in collaboration with CINECA and supported by supercomputing resources like the Leonardo (supercomputer), aligns closely with Italy's cultural and linguistic heritage. === Establishment of AI4I === The recent establishment of the Istituto Italiano per l’Intelligenza Artificiale (AI4I) is part of Italy's strategy to improve its industrial competitiveness in AI. This dedicated institute aims to bridge the gap between research institutions and industrial enterprises; promote training and R&D support to nurture the next generation of Italian AI experts; and enhance national competitiveness. This initiative is expected to serve as a hub for applied AI research, driving innovations that are tailored to the specific needs of Italian industry and public administration. === Benefits of InvestAI === Italy's AI industry stands to benefit from the European InvestAI initiative, a plan unveiled at the recent AI Action Summit in Paris. InvestAI is an effort by the European Commission to mobilize €200 billion for AI investments, with a dedicated €20 billion fund earmarked for building AI gigafactories. These gigafactories are planned as large-scale hubs for training advanced, complex AI models using approximately 100,000 last-generation AI chips. For Italy, this investment presents several major opportunities: Access to State-of-the-Art Infrastructure: Italian companies, research institutions, and start-ups can leverage the gigafactories’ immense computational resources, enabling them to train highly sophisticated language models and other AI systems. Enhanced Competitiveness and Collaboration: With InvestAI's layered funding model where EU funds help de-risk private investments Italian firms can access capital more readily. This will bolster public–private partnerships and create a more dynamic AI ecosystem that spans from academic research to industrial applications. Alignment with National and Regional Initiatives: The Istituto Italiano per l’Intelligenza Artificiale (AI4I), based in Turin, is already recognized as a strategic asset by both Italy and the European Union. As the main recipient of InvestAI funds in Italy, AI4I will play a pivotal role in implementing these investments locally, fostering innovation in sectors like manufacturing, healthcare and aerospace. Commission President Ursula von der Leyen emphasized that InvestAI is designed to democratize AI innovation throughout Europe by ensuring that even smaller companies have access to high-performance computing power. For Italy, this means not only keeping pace with global leaders but also harnessing European-scale investments to transform its AI industry and drive economic growth.

    Read more →
  • Discoverability

    Discoverability

    Discoverability is the degree to which something, especially a piece of content or information, can be found in a search of a file, database, or other information system. Discoverability is a concern in library and information science, many aspects of digital media, software and web development, and in marketing, since products and services cannot be used if people cannot find it or do not understand what it can be used for. In human-computer interaction the term is further used to describe the discoverability of interactions, features and interactive systems overall . Metadata, or "information about information", such as a book's title, a product's description, or a website's keywords, affects how discoverable something is on a database or online. Adding metadata to a product that is available online can make it easier for end users to find the product. For example, if a song file is made available online, making the title, band name, genre, year of release, and other pertinent information available in connection with this song means the file can be retrieved more easily. The organization of information through the implementation of alphabetical structures or the integration of content into search engines exemplifies strategies employed to enhance the discoverability of information. The concept of discoverability, while related to but distinct from accessibility and usability, which are other qualities that affect the usefulness of a piece of information, is a critical aspect of information retrieval. == Etymology == The concept of "discoverability" in an information science and online context is a loose borrowing from the concept of the similar name in the legal profession. In law, "discovery" is a pre-trial procedure in a lawsuit in which each party, through the law of civil procedure, can obtain evidence from the other party or parties by means of discovery devices such as a request for answers to interrogatories, request for production of documents, request for admissions and depositions. Discovery can be obtained from non-parties using subpoenas. When a discovery request is objected to, the requesting party may seek the assistance of the court by filing a motion to compel discovery. == Purpose == The usability of any piece of information directly relates to how discoverable it is, either in a "walled garden" database or on the open Internet. The quality of information available on this database or on the Internet depends upon the quality of the meta-information about each item, product, or service. In the case of a service, because of the emphasis placed on service reusability, opportunities should exist for reuse of this service. However, reuse is only possible if information is discoverable in the first place. To make items, products, and services discoverable, the process is as follows: Document the information about the item, product or service (the metadata) in a consistent manner. Store the documented information (metadata) in a searchable repository. while technically a human-searchable repository, such as a printed paper list would qualify, "searchable repository" is usually taken to mean a computer-searchable repository, such as a database that a human user can search using some type of search engine or "find" feature. Enable search for the documented information in an efficient manner. supports number 2, because while reading through a printed paper list by hand might be feasible in a theoretical sense, it is not time and cost-efficient in comparison with computer-based searching. Apart from increasing the reuse potential of the services, discoverability is also required to avoid development of solution logic that is already contained in an existing service. To design services that are not only discoverable but also provide interpretable information about their capabilities, the service discoverability principle provides guidelines that could be applied during the service-oriented analysis phase of the service delivery process. === Specific to digital media === In relation to audiovisual content, according to the meaning given by the Canadian Radio-television and Telecommunications Commission (CRTC) for the purpose of its 2016 Discoverability Summit, discoverability can be summed up to the intrinsic ability of given content to "stand out of the lot", or to position itself so as to be easily found and discovered. A piece of audiovisual content can be a movie, a TV series, music, a book (eBook), an audio book or podcast. When audiovisual content such as a digital file for a TV show, movie, or song, is made available online, if the content is "tagged" with identifying information such as the names of the key artists (e.g., actors, directors and screenwriters for TV shows and movies; singers, musicians and record producers for songs) and the genres (for movies genres, music genres, etc.). When users interact with online content, algorithms typically determine what types of content the user is interested in, and then a computer program suggests "more like this", which is other content that the user may be interested in. Different websites and systems have different algorithms, but one approach, used by Amazon (company) for its online store, is to indicate to a user: "customers who bought x also bought y" (affinity analysis, collaborative filtering). This example is oriented around online purchasing behaviour, but an algorithm could also be programmed to provide suggestions based on other factors (e.g., searching, viewing, etc.). Discoverability is typically referred to in connection with search engines. A highly "discoverable" piece of content would appear at the top, or near the top of a user's search results. A related concept is the role of "recommendation engines", which give a user recommendations based on his/her previous online activity. Discoverability applies to computers and devices that can access the Internet, including various console video game systems and mobile devices such as tablets and smartphones. When producers make an effort to promote content (e.g., a TV show, film, song, or video game), they can use traditional marketing (billboards, TV ads, radio ads) and digital ads (pop-up ads, pre-roll ads, etc.), or a mix of traditional and digital marketing. Even before the user's intervention by searching for a certain content or type of content, discoverability is the prime factor which contributes to whether a piece of audiovisual content will be likely to be found in the various digital modes of content consumption. As of 2017, modes of searching include looking on Netflix for movies, Spotify for music, Audible for audio books, etc., although the concept can also more generally be applied to content found on Twitter, Tumblr, Instagram, and other websites. It involves more than a content's mere presence on a given platform; it can involve associating this content with "keywords" (tags), search algorithms, positioning within different categories, metadata, etc. Thus, discoverability enables as much as it promotes. For audiovisual content broadcast or streamed on digital media using the Internet, discoverability includes the underlying concepts of information science and programming architecture, which are at the very foundation of the search for a specific product, information or content. === Human-Computer Interaction === In human–computer interaction (HCI), discoverability refers to the ability of users to perceive and comprehend a system, function, or input method upon encountering it, despite a lack of prior awareness or knowledge, whether through intentional effort or serendipitously . The concept was popularised by Don Norman, who framed it around whether users can determine what actions are possible and how to perform them . Discoverability is considered a precondition for learnability, though the two concepts are frequently conflated in the literature . == Applications == === Within a webpage === Within a specific webpage or software application ("app"), the discoverability of a feature, content or link depends on a range of factors, including the size, colour, highlighting features, and position within the page. When colour is used to communicate the importance of a feature or link, designers typically use other elements as well, such as shadows or bolding, for individuals, who cannot see certain colours. Just as traditional paper printing created other physical locations that stood out, such as being "above the fold" of a newspaper versus "below the fold", a web page or app's screenview may have certain locations that give features additional visibility to users, such as being right at the bottom of the web page or screen. The positional advantages or disadvantages of various locations depend on different cultures and languages (e.g., left to right vs. right to left). Some locations have become established, such as having toolbars at the top of a screen or webpage. Some designers have argued t

    Read more →