AI Coding Kya Hota Hai

AI Coding Kya Hota Hai — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Quantification (machine learning)

    Quantification (machine learning)

    In machine learning, quantification (variously called learning to quantify, or supervised prevalence estimation, or class prior estimation) is the task of using supervised learning in order to train models (quantifiers) that estimate the relative frequencies (also known as prevalence values) of the classes of interest in a sample of unlabelled data items. For instance, in a sample of 100,000 unlabelled tweets known to express opinions about a certain political candidate, a quantifier may be used to estimate the percentage of these tweets which belong to class `Positive' (i.e., which manifest a positive stance towards this candidate), and to do the same for classes `Neutral' and `Negative'. Quantification may also be viewed as the task of training predictors that estimate a (discrete) probability distribution, i.e., that generate a predicted distribution that approximates the unknown true distribution of the items across the classes of interest. Quantification is different from classification, since the goal of classification is to predict the class labels of individual data items, while the goal of quantification it to predict the class prevalence values of sets of data items. Quantification is also different from regression, since in regression the training data items have real-valued labels, while in quantification the training data items have class labels. It has been shown in multiple research works that performing quantification by classifying all unlabelled instances and then counting the instances that have been attributed to each class (the 'classify and count' method) usually leads to suboptimal quantification accuracy. This suboptimality may be seen as a direct consequence of 'Vapnik's principle', which states: If you possess a restricted amount of information for solving some problem, try to solve the problem directly and never solve a more general problem as an intermediate step. It is possible that the available information is sufficient for a direct solution but is insufficient for solving a more general intermediate problem. In our case, the problem to be solved directly is quantification, while the more general intermediate problem is classification. As a result of the suboptimality of the 'classify and count' method, quantification has evolved as a task in its own right, different (in goals, methods, techniques, and evaluation measures) from classification. == Quantification tasks == === Quantification tasks according to the set of classes === The main variants of quantification, according to the characteristics of the set of classes used, are: Binary quantification, corresponding to the case in which there are only n = 2 {\displaystyle n=2} classes and each data item belongs to exactly one of them; Single-label multiclass quantification, corresponding to the case in which there are n > 2 {\displaystyle n>2} classes and each data item belongs to exactly one of them; Multi-label multiclass quantification, corresponding to the case in which there are n ≥ 2 {\displaystyle n\geq 2} classes and each data item can belong to zero, one, or several classes at the same time; Ordinal quantification, corresponding to the single-label multiclass case in which a total order is defined on the set of classes. Regression quantification, a task which stands to 'standard' quantification as regression stands to classification. Strictly speaking, this task is not a quantification task as defined above (since the individual items do not have class labels but are labelled by real values), but has enough commonalities with other quantification tasks to be considered one of them. Most known quantification methods address the binary case or the single-label multiclass case, and only few of them address the multi-label, ordinal, and regression cases. Binary-only methods include the Mixture Model (MM) method, the HDy method, SVM(KLD), and SVM(Q). Methods that can deal with both the binary case and the single-label multiclass case include probabilistic classify and count (PCC), adjusted classify and count (ACC), probabilistic adjusted classify and count (PACC), the Saerens-Latinne-Decaestecker EM-based method (SLD), and KDEy. Methods for multi-label quantification include regression-based quantification (RQ) and label powerset-based quantification (LPQ). Methods for the ordinal case include ordinal versions of the above-mentioned ACC, PACC, and SLD methods, and ordinal versions of the above-mentioned HDy method. Methods for the regression case include Regress and splice and Adjusted regress and sum. === Quantification tasks according to the type of data === Several subtasks of quantification may be identified according to the type of data involved. Example such tasks are: Quantification of networked data. This task consists of performing quantification when the datapoints are members of a relation, i.e., are interlinked. As such, this task is a strict relative of collective classification. Quantification over time. This task consists of performing quantification on sets that become available in a temporal sequence, i.e., as a data stream, and finds application in contexts in which class prevalence values must be monitored over time. == Evaluation measures for quantification == Several evaluation measures can be used for evaluating the error of a quantification method. Since quantification consists of generating a predicted probability distribution that estimates a true probability distribution, these evaluation measures are ones that compare two probability distributions. Most evaluation measures for quantification belong to the class of divergences. Evaluation measures for binary quantification, single-label multiclass quantification, and multi-label quantification, are Absolute Error Squared Error Relative Absolute Error Kullback–Leibler divergence Pearson Divergence Evaluation measures for ordinal quantification are Normalized Match Distance (a particular case of the Earth Mover's Distance) Root Normalized Order-Aware Distance == Applications == Quantification is of special interest in fields such as the social sciences, epidemiology, market research, allocating resources, and ecological modelling, since these fields are inherently concerned with aggregate data. However, quantification is also useful as a building block for solving other downstream tasks, such as improving the accuracy of classifiers on out-of-distribution data, measuring classifier bias and ranker bias, and estimating the accuracy of classifiers on out-of-distribution data. == Resources == LQ 2021: the 1st International Workshop on Learning to Quantify LQ 2022: the 2nd International Workshop on Learning to Quantify LQ 2023: the 3rd International Workshop on Learning to Quantify LQ 2024: the 4th International Workshop on Learning to Quantify LQ 2025: the 5th International Workshop on Learning to Quantify LeQua 2022: the 1st Data Challenge on Learning to Quantify LeQua 2024: the 2nd Data Challenge on Learning to Quantify QuaPy: An open-source Python-based software library for quantification QuantificationLib: A Python library for quantification and prevalence estimation

    Read more →
  • BRS/Search

    BRS/Search

    BRS/Search is a full-text database and information retrieval system. BRS/Search uses a fully inverted indexing system to store, locate, and retrieve unstructured data. It was the search engine that in 1977 powered Bibliographic Retrieval Services (BRS) commercial operations with 20 databases (including the first national commercial availability of MEDLINE); it has changed ownership several times during its development and is currently sold as Livelink ECM Discovery Server by Open Text Corporation. == Early development == Development on what was to become BRS began as Biomedical Communications Network (BCN) at the State University of New York at Albany (SUNY). BCN, which went online in 1968, provided on-line access to nine databases, including MEDLINE and BIOSIS Previews, to large universities and medical schools primarily in the Northeast of the USA. State funding for the project was withdrawn in 1975, and Bibliographic Retrieval Services (BRS) was formed as a non-profit concern the following year. It was incorporated in May 1976 as a for-profit corporation with Ron Quake as president, Jan Egeland as vice president in charge of marketing and training, and Lloyd Palmer as vice president of systems. == BRS commercial operations == In December 1976, the First BRS User Meeting was held in Syracuse, New York, and by January 1977 BRS started commercial operations with 20 databases (including the first national commercial availability of MEDLINE) and 9 million records, using modified IBM STAIRS (STorage And Information Retrieval System) software, Telenet for telecommunications, and timesharing mainframe computers of Carrier Corporation. In October 1980 BRS was sold by Egeland and Quake to Indian Head, Inc., a subsidiary of the Dutch company Thyssen-Bornemisza Group. == 1989–1993 == In 1989 Robert Maxwell acquired BRS and the BRS/Search software; he announced the planned incorporation of the ORBIT Search Service and BRS Information Technologies and renamed the whole group Maxwell Online, Inc. At that time BRS Information Technologies was serving the medical and academic library marketplace with over 150 databases. Maxwell later bought the publishing company Macmillan and put Maxwell Online under Macmillan. In the same year BRS/LINK (hypertext connection of databases; first application delivering full text) was announced. The initial BRS/LINK application "relates the citation in a bibliographic database to its full-text article in a second database," and "eliminates the need to re-execute a search strategy in the second database in order to find the corresponding full-text article." Initially BRS/LINK supported linking only selected bibliographic databases: MEDLINE, Health Planning and Administration, and MEDLINE References on AIDS to the full-text Comprehensive Core Medical Library. At the time of Robert Maxwell’s death in 1991, Macmillan brought in Andrew Gregory to represent the company during the 2 years that Maxwell’s affairs were being settled and to prepare Maxwell Online to be able to sell the components. Maxwell Online shortly thereafter underwent yet another name change, this time to InfoPro Technologies. == Dataware Technologies ownership of BRS/SEARCH == Early in 1994, InfoPro Technologies, a subsidiary of MHC Inc. (holding company for Macmillan Inc.), the former Maxwell Online service, sold off all its subsidiaries. ORBIT Search Services went to the French-owned Questel, the dial-up BRS Search Services to CD Plus Technologies (later to become OVID), and BRS Software Products (including BRS/SEARCH) to Dataware Technologies. Almost up to the end of InfoPro Technologies, BRS Software had been the fastest growing segment of the company. At the 14th BRS North American Users Group Conference in 1999, Dave Schubmehl of Dataware Technologies presented a paper in which he stated "The purpose of this presentation is to update BRS users on upcoming releases of BRS/Search, NetAnswer, and other Dataware products. BRS/Search 7.0 will include features specifically requested by customers, as well as other enhancements. Earlier this year, Dataware acquired Sovereign Hill Software, makers of InQuery. In light of that acquisition, and Dataware's other development projects, we'll look at Dataware's plans for all products, including BRS/Search and NetAnswer." == Open Text acquisition of BRS/Search == In 2001 BRS/Search was acquired by Open Text and became LiveLink ECM Discovery Server. It is now referred to as Open Text Discovery Server. Open Text still supports both BRS/Search and NetAnswer. The core BRS/Search technology in the Open Text portfolio was augmented with other capabilities through various acquisitions. For example, Dataware's acquisition of Sovereign-Hill brought InQuery, “a probabilistic information retrieval system using an inference network”, which was developed by the University of Massachusetts Amherst Center for Intelligent Information Retrieval] out of the UMass CIIR and into the marketplace. A product re-branding table shows the range of products, their old names and their new names. InQuery is a concept search engine that uses noun phrases, parts of speech and other co-occurrence relationships in overlapping passages of text rather than single term inverted indexes of single words in documents. Open Text's portfolio has grown to include Hummingbird Content Management, and has always included BASIS. == 2003 == BRS/Search North America User's Group (BRSNAUG) website with a June 8, 2003 date listed the following features for BRS/Search. The BRSNAUG also disincorporated in 2003. Cross-references to BRS/Search on the World Wide Web point to Open Text Livelink. Engine features include: Rapid query response time. Numerical data handling and elementary statistical processing (sum, avg, min, max) Search results weighting and relevancy ranking Left- and right-truncation and expansion of search terms Superior data compression – loaded databases typically use only about 1.5 times the input stream size in disk space Large capacity databases – up to 100 million documents, each with up to 65,000 paragraphs Fine control of indexing and searching – right down to the word, sentence, and paragraph level Fine control over data security. Document access can be controlled at the database, document, and paragraph level International language support for all 7/8 bit characters sets and customizable language tables Flexible and customizable stop word lists ANSI-compatible thesauri Hypertext links within and between documents and databases (R6.x) Support for natural language parsing of queries Automatic document summarization tools Client/Server development Programming interfaces for World-Wide Web (HTTP, HTML) access to databases

    Read more →
  • Literature review

    Literature review

    A literature review is an overview of previously published works on a particular topic. The term can refer to a full scholarly paper or a section of a scholarly work such as books or articles. Either way, a literature review provides the researcher/author and the audiences with general information of an existing knowledge of a particular topic. A good literature review has a proper research question, a proper theoretical framework, and/or a chosen research method. It serves to situate the current study within the body of the relevant literature and provides context for the reader. In such cases, the review usually precedes the methodology and results sections of the work. Producing a literature review is often part of a graduate and post-graduate requirement, included in the preparation of a thesis, dissertation, or a journal article. Literature reviews are also common in a research proposal or prospectus (the document approved before a student formally begins a dissertation or thesis). A literature review can be a type of a review article. In this sense, it is a scholarly paper that presents the current knowledge including substantive findings as well as theoretical and methodological contributions to a particular topic. Literature reviews are secondary sources and do not report new or original experimental work. Most often associated with academic-oriented literature, such reviews are found in academic journals and are not to be confused with book reviews, which may also appear in the same publication. Literature reviews are a basis for research in nearly every academic field. == Types == Since the concept of a systematic review was formalized in the 1970s, a basic division among types of reviews is the dichotomy of narrative reviews versus systematic reviews. The main types of narrative reviews are evaluative, exploratory, and instrumental. A fourth type of review of literature (the scientific literature) is the systematic review but it is not called a literature review, which absent further specification, conventionally refers to narrative reviews. A systematic review focuses on a specific research question to identify, appraise, select, and synthesize all high-quality research evidence and arguments relevant to that question. A meta-analysis is typically a systematic review using statistical methods to effectively combine the data used on all selected studies to produce a more reliable result. Torraco (2016) describes an integrative literature review. The purpose of an integrative literature review is to generate new knowledge on a topic through the process of review, critique, and synthesis of the literature under investigation. George et al (2023) offer an extensive overview of review approaches. They also propose a model for selecting an approach by looking at the purpose, object, subject, community, and practices of the review. They describe six different types of review, each with their own unique purposes: Exploratory or scoping reviews focus on breadth as opposed to depth Systematic or integrative reviews integrate empirical studies on a topic Meta-narrative reviews are qualitative and use literature to compare research or practice communities Problematizing or critical reviews propose new perspectives on a concept by association with other literature Meta-analyses and meta-regressions integrate quantitative studies and identify moderators Mixed research syntheses combine other review approaches in the same paper == Process and product == Shields and Rangarajan (2013) distinguish between the process of reviewing the literature and a finished work or product known as a literature review. The process of reviewing the literature is often ongoing and informs many aspects of the empirical research project. The process of reviewing the literature requires different kinds of activities and ways of thinking. Shields and Rangarajan (2013) and Granello (2001) link the activities of doing a literature review with Benjamin Bloom's revised taxonomy of the cognitive domain (ways of thinking: remembering, understanding, applying, analyzing, evaluating, and creating). === Use of artificial intelligence in a literature review === Artificial intelligence (AI) is reshaping traditional literature reviews across various disciplines. Generative pre-trained transformers, such as ChatGPT, are often used by students and academics for review purposes. Since 2023, an increasing number of tools powered by large language models and other artificial intelligence technologies have been developed to assist, automate, or generate literature reviews. Nevertheless, the employment of ChatGPT in academic reviews is problematic due to ChatGPT's propensity to "hallucinate". In response, efforts are being made to mitigate these hallucinations through the integration of plugins. For instance, Rad et al. (2023) used ScholarAI for review in cardiothoracic surgery.

    Read more →
  • Microsoft Query

    Microsoft Query

    Microsoft Query is a visual method of creating database queries using examples based on a text string, the name of a document or a list of documents. The QBE system converts the user input into a formal database query using Structured Query Language (SQL) on the backend, allowing the user to perform powerful searches without having to explicitly compose them in SQL, and without even needing to know SQL. It is derived from Moshé M. Zloof's original Query by Example (QBE) implemented in the mid-1970s at IBM's Research Centre in Yorktown, New York. In the context of Microsoft Access, QBE is used for introducing students to database querying, and as a user-friendly database management system for small businesses. Microsoft Excel allows results of QBE queries to be embedded in spreadsheets.

    Read more →
  • Microscope image processing

    Microscope image processing

    Microscope image processing is a broad term that covers the use of digital image processing techniques to process, analyze and present images obtained from a microscope. Such processing is now commonplace in a number of diverse fields such as medicine, biological research, cancer research, drug testing, metallurgy, etc. A number of manufacturers of microscopes now specifically design in features that allow the microscopes to interface to an image processing system. == Image acquisition == Until the early 1990s, most image acquisition in video microscopy applications was typically done with an analog video camera, often simply closed circuit TV cameras. While this required the use of a frame grabber to digitize the images, video cameras provided images at full video frame rate (25-30 frames per second) allowing live video recording and processing. While the advent of solid state detectors yielded several advantages, the real-time video camera was actually superior in many respects. Today, acquisition is usually done using a CCD camera mounted in the optical path of the microscope. The camera may be full colour or monochrome. Very often, very high resolution cameras are employed to gain as much direct information as possible. Cryogenic cooling is also common, to minimise noise. Often digital cameras used for this application provide pixel intensity data to a resolution of 12-16 bits, much higher than is used in consumer imaging products. Ironically, in recent years, much effort has been put into acquiring data at video rates, or higher (25-30 frames per second or higher). What was once easy with off-the-shelf video cameras now requires special, high speed electronics to handle the vast digital data bandwidth. Higher speed acquisition allows dynamic processes to be observed in real time, or stored for later playback and analysis. Combined with the high image resolution, this approach can generate vast quantities of raw data, which can be a challenge to deal with, even with a modern computer system. While current CCD detectors allow very high image resolution, often this involves a trade-off because, for a given chip size, as the pixel count increases, the pixel size decreases. As the pixels get smaller, their well depth decreases, reducing the number of electrons that can be stored. In turn, this results in a poorer signal-to-noise ratio. For best results, one must select an appropriate sensor for a given application. Because microscope images have an intrinsic limiting resolution, it often makes little sense to use a noisy, high resolution detector for image acquisition. A more modest detector, with larger pixels, can often produce much higher quality images because of reduced noise. This is especially important in low-light applications such as fluorescence microscopy. Moreover, one must also consider the temporal resolution requirements of the application. A lower resolution detector will often have a significantly higher acquisition rate, permitting the observation of faster events. Conversely, if the observed object is motionless, one may wish to acquire images at the highest possible spatial resolution without regard to the time required to acquire a single image. == 2D image techniques == Image processing for microscopy application begins with fundamental techniques intended to most accurately reproduce the information contained in the microscopic sample. This might include adjusting the brightness and contrast of the image, averaging images to reduce image noise and correcting for illumination non-uniformities. Such processing involves only basic arithmetic operations between images (i.e. addition, subtraction, multiplication and division). The vast majority of processing done on microscope image is of this nature. Another class of common 2D operations called image convolution are often used to reduce or enhance image details. Such "blurring" and "sharpening" algorithms in most programs work by altering a pixel's value based on a weighted sum of that and the surrounding pixels (a more detailed description of kernel based convolution deserves an entry for itself) or by altering the frequency domain function of the image using Fourier Transform. Most image processing techniques are performed in the Frequency domain. Other basic two dimensional techniques include operations such as image rotation, warping, color balancing etc. At times, advanced techniques are employed with the goal of "undoing" the distortion of the optical path of the microscope, thus eliminating distortions and blurring caused by the instrumentation. This process is called deconvolution, and a variety of algorithms have been developed, some of great mathematical complexity. The end result is an image far sharper and clearer than could be obtained in the optical domain alone. This is typically a 3-dimensional operation, that analyzes a volumetric image (i.e. images taken at a variety of focal planes through the sample) and uses this data to reconstruct a more accurate 3-dimensional image. == 3D image techniques == Another common requirement is to take a series of images at a fixed position, but at different focal depths. Since most microscopic samples are essentially transparent, and the depth of field of the focused sample is exceptionally narrow, it is possible to capture images "through" a three-dimensional object using 2D equipment like confocal microscopes. Software is then able to reconstruct a 3D model of the original sample which may be manipulated appropriately. The processing turns a 2D instrument into a 3D instrument, which would not otherwise exist. In recent times this technique has led to a number of scientific discoveries in cell biology. == Analysis == Analysis of images will vary considerably according to application. Typical analysis includes determining where the edges of an object are, counting similar objects, calculating the area, perimeter length and other useful measurements of each object. A common approach is to create an image mask which only includes pixels that match certain criteria, then perform simpler scanning operations on the resulting mask. It is also possible to label objects and track their motion over a series of frames in a video sequence.

    Read more →
  • Data management plan

    Data management plan

    A data management plan or DMP is a formal document that outlines how data are to be handled both during a research project, and after the project is completed. The goal of a data management plan is to consider the many aspects of data management, metadata generation, data preservation, and analysis before the project begins; this may lead to data being well-managed in the present, and prepared for preservation in the future. DMPs were originally used in 1966 to manage aeronautical and engineering projects' data collection and analysis, and expanded across engineering and scientific disciplines in the 1970s and 1980s. Up until the early 2000s, DMPs were used "for projects of great technical complexity, and for limited mid-study data collection and processing purposes". In the 2000s and later, E-research and economic policies drove the development and uptake of DMPs. == Importance == Preparing a data management plan before data are collected is claimed to ensure that data are in the correct format, organized well, and better annotated. This could arguably save time in the long term because there is no need to re-organize, re-format, or try to remember details about data. It is also claimed to increase research efficiency since both the data collector and other researchers might be able to understand and use well-annotated data in the future. One component of a data management plan is data archiving and preservation. By deciding on an archive ahead of time, the data collector can format data during collection to make its future submission to a database easier. If data are preserved, they are more relevant since they can be re-used by other researchers. It also allows the data collector to direct requests for data to the database, rather than address requests individually. A frequent argument in favor of preservation is that data that are preserved have the potential to lead to new, unanticipated discoveries, and they prevent duplication of scientific studies that have already been conducted. Data archiving also provides insurance against loss by the data collector. In the 2010s, funding agencies increasingly required data management plans as part of the proposal and evaluation process, despite little or no evidence of their efficacy. == Major components == "There is no general and definitive list of topics that should be covered in a DMP for a research project", and researchers are often left to their own devices as to how to fill out a DMP. === Information about data and data format === A description of data to be produced by the project. This might include (but is not limited to) data that are: Experimental Observational Raw or derived Physical collections Models Simulations Curriculum materials Software Images How will the data be acquired? When and where will they be acquired? After collection, how will the data be processed? Include information about Software used Algorithms Scientific workflows File formats that will be used, justify those formats, and describe the naming conventions used. Quality assurance & quality control measures that will be taken during sample collection, analysis, and processing. If existing data are used, what are their origins? How will the data collected be combined with existing data? What is the relationship between the data collected and existing data? How will the data be managed in the short-term? Consider the following: Version control for files Backing up data and data products Security & protection of data and data products Who will be responsible for management === Metadata content and format === Metadata are the contextual details, including any information important for using data. This may include descriptions of temporal and spatial details, instruments, parameters, units, files, etc. Metadata is commonly referred to as "data about data". Issues to be considered include: How detailed has the metadata to be in order to make the data meaningful? How will the metadata be created and/or captured? Examples include lab notebooks, GPS hand-held units, Auto-saved files on instruments, etc. What format will be used for the metadata? What are the metadata standards commonly used in the respective scientific discipline? There should be justification for the format chosen. === Policies for access, sharing, and re-use === Describe any obligations that exist for sharing data collected. These may include obligations from funding agencies, institutions, other professional organizations, and legal requirements. Include information about how data will be shared, including when the data will be accessible, how long the data will be available, how access can be gained, and any rights that the data collector reserves for using data. Address any ethical or privacy issues with data sharing Address intellectual property & copyright issues. Who owns the copyright? What are the institutional, publisher, and/or funding agency policies associated with intellectual property? Are there embargoes for political, commercial, or patent reasons? Describe the intended future uses/users for the data Indicate how the data should be cited by others. How will the issue of persistent citation be addressed? For example, if the data will be deposited in a public archive, will the dataset have a persistent identifier (e.g., ARK, DOI, Handle, PURL, URN) assigned to it? === Long-term storage and data management === Researchers should identify an appropriate archive for the long-term preservation of their data. By identifying the archive early in the project, the data can be formatted, transformed, and documented appropriately to meet the requirements of the archive. Researchers should consult colleagues and professional societies in their discipline to determine the most appropriate database, and include a backup archive in their data management plan in case their first choice goes out of existence. Early in the project, the primary researcher should identify what data will be preserved in an archive. Usually, preserving the data in its most raw form is desirable, although data derivatives and products can also be preserved. An individual should be identified as the primary contact person for archived data, and ensure contact information is always kept up-to-date in case there are requests for data or information about data. === Budget === Data management and preservation costs may be considerable, depending on the nature of the project. By anticipating costs ahead of time, researchers ensure that the data will be properly managed and archived. Potential expenses that should be considered are Human resources and staff as they handle data preparation, management, documentation, and preservation Hardware and/or software needed for data management, backing up, security, documentation, and preservation Costs associated with submitting the data to an archive The data management plan should include how these costs will be paid. == NSF Data Management Plan == All grant proposals submitted to National Science Foundation (NSF) must include a Data Management Plan that is no more than two pages. This is a supplement (not part of the 15-page proposal) and should describe how the proposal will conform to the Award and Administration Guide policy (see below). It may include the following: The types of data The standards to be used for data and metadata format and content Policies for access and sharing Policies and provisions for re-use Plans for archiving data Policy summarized from the NSF Award and Administration Guide, Section 4 (Dissemination and Sharing of Research Results): Promptly publish with appropriate authorship Share data, samples, physical collections, and supporting materials with others, within a reasonable time frame Share software and inventions Investigators can keep their legal rights over their intellectual property, but they still have to make their results, data, and collections available to others Policies will be implemented via Proposal review Award negotiations and conditions Support/incentives == ESRC Data Management Plan == Since 1995, the UK's Economic and Social Research Council (ESRC) have had a research data policy in place. The current ESRC Research Data Policy states that research data created as a result of ESRC-funded research should be openly available to the scientific community to the maximum extent possible, through long-term preservation and high-quality data management. ESRC requires a data management plan for all research award applications where new data are being created. Such plans are designed to promote a structured approach to data management throughout the data lifecycle, resulting in better quality data that is ready to archive for sharing and re-use. The UK Data Service, the ESRC's flagship data service, provides practical guidance on research data management planning suitable for social science researchers in the UK and around the world. ESRC has a longstanding arrangement with the UK Data A

    Read more →
  • Non-personal data

    Non-personal data

    Non-Personal Data (NPD) is electronic data that does not contain any information that can be used to identify a natural person. Thus, it can either be data that has no personal information to begin with (such as weather data, stock prices, data from anonymous IoT sensors); or it is data that had personal data that was subsequently pseudoanonymized (for example, identifiable strings substituted with random strings) or anonymized (such as by irreversibly removing all personal data). NPD is part of the overall Data Governance Strategy of a region or country. While personal data are covered by Data Protection Legislation such as GDPR, other kinds of data would fall under the scope of NPD Regulation. == Importance of non-personal data == It has been pointed out that the future is data-driven. What this means is that much of the present innovation taking place in domains such as Machine Learning and Artificial Intelligence is fueled by data, which is needed for calibrating the complex models (comprising neural network-based as well as other kinds). The larger the volume, diversity and quality of the data, the higher is the quality of the model, leading to better predictions and explanations. However, there is a flip-side to data availability. The newly-emerging awareness of privacy and the consequent need for powerful Data Protection Regulations (such as GDPR) makes it increasingly difficult or impossible to obtain data in the quantities required. This is a contradiction, and the only way out would be to remove all personal data from data sets (either by Data anonymization or Pseudonymization coupled with noise injection, at which point it becomes NPD. Therefore, many innovation-friendly countries are coming out with regulatory regimes that would ensure that personal data is protected, while, at the same time, non-personal data can be extracted from personal data so that innovation is fostered. In other words, NPD 'unlocks' value that was locked away in data sets that have personally-identifiable information. It is expected that multiple NPD data sets will begin to be available on free or commercial basis from different providers once the regulations are in place. == Emerging regulatory frameworks == Non-Personal Data has significant uses that may be economic, social, political or security-related. Several countries and regions are in the process of regulating the use of NPD. In May 2019, the European Union operationalized its Regulation of the Free Flow of NPD. India announced a nine-member expert committee to make recommendations on the regulation of NPD in 2019, which published its first report in mid-2020. The report was opened for public comments, after which it was revised and published in December 2020. == Proposed NPD regulatory framework in India == The following were the objectives of the proposed Indian regulation as per the revised report: Sovereignty: India has rights over the data of India, its people and organisations. Benefit India: Benefits of data must accrue to India and its people. Benefits the world: Innovation, new models and algorithms for the world. Privacy: Misuse, reidentification and harms must be prevented. Simplicity: The regulations should be simple, digital and unambiguous. Innovation and entrepreneurship: The data should be freely available for innovation and entrepreneurship in India. == Concerns == The major concern in the use of NPD is if there are techniques (statistical or AI-based) by which multiple data sets can be used to extract personally-identifiable data.

    Read more →
  • Social information architecture

    Social information architecture

    Social information architecture, also known as social iA, is a sub-domain of information architecture which deals with the social aspects of conceptualizing, modeling and organizing information. It has become more relevant because of the rise of social media and Web 2.0 in recent times. == Approach == There are different approaches to the explanation of social information architecture. === Architecture model (internal space) === Architects designing a physical community space, have to consider how the architecture will shape social interactions. A long hallway of offices creates an utterly different dynamic than desks with arranged in an open space. One might foster individuality, privacy, propriety; the other: collaboration, distraction, communalism. Still, physical spaces can be flexibly repurposed and worked around if the inhabitants desire a social dynamic not instantly afforded by the space. Office doors can be left open to invite easier interaction. Partitions can be raised between adjacent desks to limit distraction and increase privacy. That's physical architecture. The information architectures of online communities are far more deterministic and far less flexible. They literally define the social architecture by pre-specifying in immutable computer code what information you have access to, who you can talk to, where you can go. In the online world, information architecture = social architecture. === Social dialogue and information model (external space) === All major brands use information architecture to market their products online, it is then commonly wrapped under the umbrella phrase 'digital strategy'. Information architecture used for strategic purposes encompasses brand SEO, strategic placement of virals, social media presence etc. Charities, news outlets and social dialogue forums can make a much more specific use of the same tools for positive and important social purposes. Social Information Architecture is perceived as the socially conscious wing of commercial information architecture and function to exchange information and ideas between people and groups. Social iA can pick up on conflicting issues that are treated with misunderstanding between cultures and leaves individuals and societies vulnerable to exploitation and manipulation. Since the net has such a far reach it is obvious to use it for meaningful and coordinated social dialogue. Example of such issues are faith, environment, politics, climate change, war, injustice and other social challenges. Information architecture can help create frameworks in which sharing information brings people together, inspires and encourages them to participate in a forward thinking and unfragmented way. One of its core activities is to spread messages that bring people from opposite sites of social and cultural spectrums together and to confront uncomfortable subject head on. == How does social information architecture work? == Social iA utilizes a variety of Web2.0 applications to filter relevant or valuable information and weave them in appropriate information repository or provide feedback to interesting channels. Social iA makes strategic use of Search Engines, Social Media, Google Algorithms, as well as websites, video & news channels. It ‘reads’ or 'listens' to social conversations and search engine queries and engages with the net actively to gather clues about the world's pulse on the internet. It assesses data, social & political trends, and respond with targeted campaigns to give people ideas, as well as help people with making sense of information. == Principals == Dan Brown in his paper 8 Principals of Social Information Architecture enlists the following principals: 1. The principle of objects: Treat content as a living, breathing thing, with a lifecycle, behaviors and attributes. 2. The principle of choices: Create pages that offer meaningful choices to users, keeping the range of choices available focused on a particular task. 3. The principle of disclosure: Show only enough information to help people understand what kinds of information they'll find as they dig deeper. 4. The principle of exemplars: Describe the contents of categories by showing examples of the contents. 5. The principle of front doors: Assume at least half of the website's visitors will come through some page other than the home page. 6. The principle of multiple classification: Offer users several different classification schemes to browse the site's content. 7. The principle of focused navigation: Don't mix apples and oranges in your navigation scheme. 8. The principle of growth: Assume the content you have today is a small fraction of the content you will have tomorrow. == What can social information architecture achieve? == Social information architecture has many potentials in terms of fostering social connections and how information is shared in social spaces on the web.

    Read more →
  • Cooliris (plugin)

    Cooliris (plugin)

    Cooliris (for Desktop), formerly known as PicLens, was a web browser extension developed by Cooliris, Inc, and later acquired by Yahoo. The plugin provides an interactive 3D-like experience for viewing digital images and videos from the web and from desktop applications. The software places a small icon atop image thumbnails that appear on a webpage. Clicking on the icon loads the Cooliris 3D Wall, a browsing environment that gives the user the effect of flying through a three-dimensional space. Released to the public in January 2008, The New York Times described Cooliris as the "new immersive approach to Web navigation". Cooliris went out to win the 2008 Crunchies Award for Best Design. The plugin has received over 50 million downloads. As of May 2014 browser plugins are unavailable from the official website. There are only links to tablet apps - for iOS and Android.

    Read more →
  • KiSAO

    KiSAO

    The Kinetic Simulation Algorithm Ontology (KiSAO) supplies information about existing algorithms available for the simulation of systems biology models, their characterization and interrelationships. KiSAO is part of the BioModels.net project and of the COMBINE initiative. == Structure == KiSAO consists of three main branches: simulation algorithm simulation algorithm characteristic simulation algorithm parameter The elements of each algorithm branch are linked to characteristic and parameter branches using has characteristic and has parameter relationships accordingly. The algorithm branch itself is hierarchically structured using relationships which denote that the descendant algorithms were derived from, or specify, more general ancestors.

    Read more →
  • Living lab

    Living lab

    The concept of the living lab has been defined in multiple ways. A definition from the European Network of Living Labs (ENoLL) is used most widely, describing them as "user-centred open innovation ecosystems” that integrate research and innovation through co-creation in real-world environments.[1] Emerging at the intersection of ambient intelligence research and user experience methodologies in the late 1990s, the concept was pioneered at the Massachusetts Institute of Technology (MIT) as a way to study human interaction with new technologies in natural settings. Over time, living labs have evolved beyond their origins as controlled research environments, becoming dynamic platforms for participatory design, collaborative experimentation, and iterative innovation across various domains, including urban development, healthcare, sustainability, and digital technology. Characterized by principles such as real-world experimentation, active user involvement, and multi-stakeholder collaboration, living labs enable the continuous adaptation and validation of solutions in everyday contexts. Today, they are implemented globally, supported by networks like the European Network of Living Labs (ENoLL), and increasingly recognized as vital tools for addressing local and global transformation agendas. == Background == The term "living lab" has emerged in parallel from the ambient intelligence (AmI) research communities context and from the discussion on experience and application research (EAR). The emergence of the term is based on the concept of user experience and ambient intelligence. The term dates back to the late 1990s when Professor William J. Mitchell, Kent Larson, and Alex (Sandy) Pentland at the Massachusetts Institute of Technology were credited with first exploring the concept of a living laboratory. It was first associated with MIT's Media Lab as a concept for studying real-life contexts, where they described a living lab as a controlled environment designed to test new information and communication technology (ICT) innovations in a simulated home setting. This was also when some of the key characteristics often assigned to living labs today began to take shape. They argued that a living lab represents a user-centric research methodology for sensing, prototyping, validating and refining complex solutions in multiple and evolving real-life contexts. Research on living labs has expanded since the 1990s, especially in the 2010s, with growing interest in co-creation and participatory design. Particularly in Europe, the living lab evolved into a model that focused on studying user interactions with technology in real-world environments. This shift was influenced by earlier experiences in participatory design and social experiments with ICT. As interest grew, the term began to encompass a broader array of initiatives and projects, leading to variations in its interpretation and implementation. Today, living labs are used in various fields, such as technology, healthcare, and urban sustainability, showing a transition from a narrow focus on their role as controlled environments to a more wide-ranging understanding of collaborative innovation addressing real societal challenges, while also being referred to with various descriptions and definitions available from different sources. == Description == The ENoLL definition that refers to living labs as "user-centred open innovation ecosystems” that integrate research and innovation through co-creation in real-world environments is the most widely accepted description of living labs in academic literature. In simple terms, living labs can be described as an organization or experimental space, that can be both virtually or physically located, bringing different stakeholders from research, business, government, and citizens together to design and test solutions to be implemented in a real world environment. A common definition for the living lab term still does not exist to this day, which is due to the fact that living labs are interpreted and implemented across different contexts and can cover a wide range of activities and organizations, leading to different understandings of how living labs should function. Living labs also often operate in various territorial contexts (e.g. city, agglomeration, region, campus), and can vary in their methodological approach integrating concurrent research and innovation processes within a public-private-people partnership. Despite these variations, common characteristics include user-centricity, real-world experimentation, multi-stakeholder collaboration, and iterative innovation processes. The systematic user co-creation approach refers to integrating research and innovation processes through the co-creation, exploration, experimentation and evaluation of innovative ideas, scenarios, concepts and related technological artefacts in real life use cases. Such use cases involve user communities, not only as observed subjects but also as a source of creation. This approach allows all involved stakeholders to concurrently consider both the global performance of a product or service and its potential adoption by users. This consideration may be made at the earlier stage of research and development and through all elements of the product life-cycle, from design up to recycling. User-centred research methods, such as action research, community informatics, contextual design, user-centered design, participatory design, empathic design, emotional design, and other usability methods, already exist but fail to sufficiently empower users for co-creating into open development environments. More recently, the Web 2.0 has demonstrated the positive impact of involving user communities in new product development (NPD) such as mass collaboration projects (e.g. crowdsourcing, Wisdom of Crowds) in collectively creating new contents and applications. Real-world experimentation emphasizes conducting activities in real-life settings to ensure that the results of the projects and solutions are applicable to actual market conditions. Multi-stakeholder collaboration refers to an approach that involved various stakeholders, such as users, businesses, researchers, and government entities, working together towards a common goal. This is an important characteristics of living lab because collaboration of these diverse groups allows for exchange of ideas and perspectives, which are thought to enhance innovation processes. Iterative innovation processes involve a cyclical method of developing products or services, where stages such as research, development, testing, and implementation are revisited multiple times based on feedback and evaluation. This process allows for continuous improvement of the innovation, product, or service being developed. In particular, the ongoing involvement of the user creates feedback mechanisms that are ultimately key to successful development and implementation of products and services. A living lab is not similar to a testbed as its philosophy is to turn users, from being traditionally considered as observed subjects for testing modules against requirements, into value creation in contributing to the co-creation and exploration of emerging ideas, breakthrough scenarios, innovative concepts and related artefacts. Hence, a living lab rather constitutes an experiential environment, which could be compared to the concept of experiential learning, where users are immersed in a creative social space for designing and experiencing their own future. Living labs could also be used by policy makers and users/citizens for designing, exploring, experiencing and refining new policies and regulations in real-life scenarios for evaluating their potential impacts before their implementations. == European Network of Living Labs (ENoLL) == The European Network of Living Labs (ENoLL) is an international, non-profit, independent association of certified living labs, which popularized the living lab concept in the aim to increase user involvement in innovation. Formed in November 2006 under the guidance of the Finnish European Presidency, ENoLL is composed of a variety of stakeholders, including municipalities and research institutes, businesses, and users. Its primary role is to support the collaboration among living labs across Europe and includes many living labs focused on user-driven innovation across sectors. ENoLL focuses on facilitating knowledge exchange, joint actions and project partnerships among its historically labelled +/- 500 members, influencing EU policies, promoting living labs and enabling their implementation worldwide. ENoLL serves as a platform for linking living labs around the globe, which enables knowledge sharing and collaborative learning among diverse cultural environments. Membership to the platform is open to organizations worldwide, and ENoLL has expanded beyond Europe to include global members. ENoLL follows an application and accreditation pro

    Read more →
  • Document capture software

    Document capture software

    Document capture software refers to applications that provide the ability and feature set to automate the process of scanning paper documents or importing electronic documents, often for the purposes of feeding advanced document classification and data collection processes. Most scanning hardware, both scanners and copiers, provides the basic ability to scan to any number of image file formats, including: PDF, TIFF, JPG, BMP, etc. This basic functionality is augmented by document capture software, which can add efficiency and standardization to the process. == Typical features == Typical features of Document Capture Software include: Barcode recognition Patch Code recognition Separation Optical Character Recognition (OCR) Optical Mark Recognition (OMR) Quality Assurance Indexing Migration === Goal for implementation of a document capture solution === The goal for implementing a document capture solution is to reduce the amount of time spent scanning, separating, enhancing, organizing, classifying, normalizing, and collecting information from document collections, and to produce metadata along with an image/PDF file, and/or OCR text. This information is then migrated to a file share, FTP site, database, Document Management or Enterprise Content Management system. These systems often provide a search function, allowing search of the assets based on the produced metadata, and then viewed using document imaging software. == General document capture system solutions == === Integration with document management system === ECM (Enterprise Content management) and their DMS component (Document Management System) are being adopted by many organizations as a corporate document management system for all types of electronic files, e.g. MS word, PDF ... However, much of the information held by organisations is on paper and this needs to be integrated within the same document repository. By converting paper documents into digital format through scanning, organizations convert paper into image formats such as TIF, JPG, and PDF, and also extract valuable index information or business data from the document using OCR technology. Digital documents and associated metadata can easily be stored in the ECM in a variety of formats. The most popular of these formats is PDF which not only provides an accurate representation of the document but also allows all the OCR text in the document to be stored behind the PDF image. This format is known as PDF with hidden text or text-searchable PDF. This allows users to search for documents by using keywords in the metadata fields or by searching the content of PDF files across the repository. ==== Advantages of scanning documents into a ECM/DMS ==== Information held on paper is usually just as valuable to organisations as the electronic documents that are generated internally. Often this information represents a large proportion of the day to day correspondence with suppliers and customers. Having the ability to manage and share this information internally through a document management system such as SharePoint or a CMIS-compatible repository improves collaboration between departments or employees and also eliminates the risk of losing this information through disasters such as floods or fire. Organisations adopting an ECM/DMS often implement electronic workflow which allows the information held on paper to be included as part of an electronic business process and incorporated into a customer record file along with other associated office documents and emails. For business critical documents, such as purchase orders and supplier invoices, digitising documents helps speed up business transactions as well as reduce manual effort involved in keying data into business systems, such as CRM, ERP and Accounting. Scanned invoices can also be routed to managers for payment approval via email or an electronic workflow. == Electronic document capture == In the earlier implementations of Document Capture Software, the technology focused solely on the digitization and capture of information from paper documents. Document images were acquired from document scanners via TWAIN/ISIS drivers. Only image-based file formats like TIF, JPG, and BMP were typically compatible with these solutions. But in recent years, as the volume of electronically-created documents and the number of proprietary file formats continues to increase at exponential rates, the need for handling documents existing in electronic formats has grown. The relevant document capture products have adapted to function with non-image file formats with the end-goal of creating a unified processing workflow capable of handling all incoming documents The ability to import files from a variety of sources is one example of such adaptation. Importing documents from ECM/DMS software solutions, email servers, FTP, and EDI is now as much of a requirement of document capture software as is paper capture. The normalization of output files to text-based PDF format is now another critical factor in long-term archival of proprietary electronic file formats. Normalization expands access and usage of files to users throughout the enterprise, rather than only those that created the original electronic file.

    Read more →
  • Ibotta

    Ibotta

    Ibotta, Inc. is an American mobile technology company headquartered in Denver, Colorado. Founded in 2011, the company offers cash back rewards on various purchases through its Ibotta Performance Network and direct to consumer app. Ibotta partners with CPG (consumer packaged goods) brands and network publishers to provide these rewards. As of 2024, the company operates solely in the United States. The company's rewards-as-a-service offering, the Ibotta Performance Network, went live in 2022. In August 2019, Ibotta received a $1 billion valuation after its Series D funding, and in 2023, the company surpassed $1.5 billion cash rewards paid to over 50 million consumers since the company's founding. Ibotta became a publicly traded company in April 2024 with a listing on the New York Stock Exchange. As of September 2025, Ibotta is trading at approximately $27.13 per share, marking a 69% decline from its initial public offering price of $88 per share on April 18, 2024. == History == === Founding through early 2019 === Ibotta was founded by current CEO Bryan Leach. The company was incorporated in 2011 and the app launched to both the App Store and Google Play stores in 2012. Early investors included entrepreneur and computer scientist Jim Clark and Tom “TJ” Jermoluk, Chairman of @Home Network. In 2015, Ibotta expanded beyond item level grocery, adding the ability to get cash back on in-store retail purchases. In 2016, in-app mobile commerce began, allowing users to navigate from the Ibotta app to its partners' apps to earn cash back on purchases. In 2016 with a Series C investment, Ibotta had raised over $73 million in funding. In March of that year, Ibotta partnered with Anheuser-Busch to offer cash back for adults who purchased its products. In May, the company partnered with LiveRamp so that companies could use their CRM data to create segmented, personalized campaigns. At the time, the company had around 200 full- and part-time employees and moved from offices in Lower Downtown Denver (LoDo) to a 40,000-square-foot office in the central Denver business district. A year later, the company had to expand to a second floor as it added almost another 100 employees. In 2017, Ibotta added cash back for Uber to its app as well as cash back rewards for online and mobile purchases. In 2018, Ibotta was listed on the Inc. 5,000 list as one of the fastest growing private companies in the U.S. A year later, in January 2019, the Ibotta app had been downloaded more than 30 million times with users receiving a reported $500 million in cash back rewards. That year, Ibotta was the largest mobile company in Colorado with six million monthly active users. === August 2019 to present === In August 2019, Ibotta was valued at $1 billion, following a Series D round of funding. The round was led by Koch Disruptive Technologies, a subsidiary of Koch Industries. 2019 was also the year the company introduced Pay with Ibotta, which allowed users to complete purchases at key retailers on the Ibotta app and earn instant cash back in the process. With that new service, users were able to enter their purchase total and use a QR code to checkout and receive immediate cash back. In 2020, the company partnered with Trees for the Future to plant up to 1 million trees as part of an Earth Month campaign to raise awareness about the waste of unused paper coupons. In response to the COVID-19 pandemic, Ibotta partnered with CPG brands in their “Here to Help” campaign and together committed over $10 million in cash back to American consumers. The company added the ability to earn cash back from online grocery pick-up and delivery orders. Later that year, Ibotta started its free Thanksgiving program, providing users with 100% cash back on select groceries needed for a Thanksgiving meal. By 2022, the company had provided approximately 10 million Thanksgiving meals. In 2021, Ibotta acquired the company OctoShop (originally InStok), a shopping browser extension company. The OctoShop app enables users to compare prices across stores and set restock and price-drop alerts. In April 2022, the Ibotta Performance Network (IPN) was launched. The IPN allows brands to deliver digital offers to consumers through third party publishers. Retailers including Walmart, Dollar General and Family Dollar, food delivery services including Instacart, and convenience stores including Shell are all part of the Ibotta Performance Network. This pay-per-sales or success-based performance network reaches over 200 million consumers. On April 18, 2024, Ibotta had its initial public offering (IPO), trading on the New York Stock Exchange (NYSE) under the ticker symbol IBTA. It was the largest technology IPO in Colorado history. In October 2025, Ibotta announced a partnership with technology and analytics company Circana, integrating Circana's Household Lift measurement into Ibotta campaigns to give CPG brands an increased understanding of the impact of their promotional campaigns. On November 3, 2025, Ibotta launched LiveLift, a tool for companies to measure the return on investment of digital promotions, in order to optimize performance marketing goals. === Athletic partnerships === Ibotta became the official jersey patch partner of the New Orleans Pelicans, a professional men's basketball team in the National Basketball Association (NBA), for the 2020–2021 and 2023–2024 seasons. Ibotta became the official jersey patch partner of the 2023 NBA champion Denver Nuggets baskeetball team beginning in the 2023–2024 season. In March 2023, F1 driver Logan Sargeant, the first U.S. racer to compete in F1 since 2015, partnered with Ibotta. The Ibotta logo was displayed on Sargeant's racing helmet throughout his F1 career. In June 2023, UConn Huskies women's basketball player Paige Bueckers entered into a "name, image, and likeness" (NIL) promotional agreement with Ibotta. According to a press release by Ibotta, the company has agreements with The Brandr Group, which finds NIL opportunities for women college athletes, and the Pearpop social media marketing platform to promote Ibotta. == Legal issues == In April 2025, shareholders filed a class action lawsuit—Fortune v. Ibotta, Inc., in the U.S. District Court for the District of Colorado (Case No. 25-cv-01213)—alleging that the registration statement in connection with Ibotta’s April 2024 initial public offering omitted material information. The complaint claims that, although Ibotta disclosed detailed terms for its contract with Walmart Inc., it failed to warn investors that its agreement with The Kroger Co., its second-largest client, was terminable at will and thus could be canceled without warning, creating a misleading impression of stability.

    Read more →
  • Very large database

    Very large database

    A very large database, (originally written very large data base) or VLDB, is a database that contains a very large amount of data, so much that it can require specialized architectural, management, processing and maintenance methodologies. == Definition == The vague adjectives of very and large allow for a broad and subjective interpretation, but attempts at defining a metric and threshold have been made. Early metrics were the size of the database in a canonical form via database normalization or the time for a full database operation like a backup. Technology improvements have continually changed what is considered very large. One definition has suggested that a database has become a VLDB when it is "too large to be maintained within the window of opportunity… the time when the database is quiet". == Sizes of a VLDB database == There is no absolute amount of data that can be cited. For example, one cannot say that any database with more than 1 TB of data is considered a VLDB. This absolute amount of data has varied over time as computer processing, storage and backup methods have become better able to handle larger amounts of data. That said, VLDB issues may start to appear when 1 TB is approached, and are more than likely to have appeared as 30 TB or so is exceeded. == VLDB challenges == Key areas where a VLDB may present challenges include configuration, storage, performance, maintenance, administration, availability and server resources. === Configuration === Careful configuration of databases that lie in the VLDB realm is necessary to alleviate or reduce issues raised by VLDB databases. === Administration === The complexities of managing a VLDB can increase exponentially for the database administrator as database size increases. === Availability and maintenance === When dealing with VLDB operations relating to maintenance and recovery such as database reorganizations and file copies which were quite practical on a non-VLDB take very significant amounts of time and resources for a VLDB database. In particular it typically infeasible to meet a typical recovery time objective (RTO), the maximum expected time a database is expected to be unavailable due to interruption, by methods which involve copying files from disk or other storage archives. To overcome these issues techniques such as clustering, cloned/replicated/standby databases, file-snapshots, storage snapshots or a backup manager may help achieve the RTO and availability, although individual methods may have limitations, caveats, license, and infrastructure requirements while some may risk data loss and not meet the recovery point objective (RPO). For many systems only geographically remote solutions may be acceptable. ==== Backup and recovery ==== Best practice is for backup and recovery to be architectured in terms of the overall availability and business continuity solution. === Performance === Given the same infrastructure there may typically be a decrease in performance, that is increase in response time as database size increases. Some accesses will simply have more data to process (scan) which will take proportionally longer (linear time); while the indexes used to access data may grow slightly in height requiring perhaps an extra storage access to reach the data (sub-linear time). Other effects can be caching becoming less efficient because proportionally less data can be cached and while some indexes such as the B+ automatically sustain well with growth others such as a hash table may need to be rebuilt. Should an increase in database size cause the number of accessors of the database to increase then more server and network resources may be consumed, and the risk of contention will increase. Some solutions to regaining performance include partitioning, clustering, possibly with sharding, or use of a database machine. ==== Partitioning ==== Partitioning may be able assist the performance of bulk operations on a VLDB including backup and recovery., bulk movements due to information lifecycle management (ILM), reducing contention as well as allowing optimization of some query processing. === Storage === In order to satisfy needs of a VLDB the database storage needs to have low access latency and contention, high throughput, and high availability. === Server resources === The increasing size of a VLDB may put pressure on server and network resources and a bottleneck may appear that may require infrastructure investment to resolve. == Relationship to big data == VLDB is not the same as big data, but the storage aspect of big data may involve a VLDB database. That said some of the storage solutions supporting big data were designed from the start to support large volumes of data, so database administrators may not encounter VLDB issues that older versions of traditional RDBMS's might encounter.

    Read more →
  • Broadcast (parallel pattern)

    Broadcast (parallel pattern)

    Broadcast is a collective communication primitive in parallel programming to distribute programming instructions or data to nodes in a cluster. It is the reverse operation of reduction. The broadcast operation is widely used in parallel algorithms, such as matrix-vector multiplication, Gaussian elimination and shortest paths. The Message Passing Interface implements broadcast in MPI_Bcast. == Definition == A message M [ 1.. m ] {\displaystyle M[1..m]} of length m {\displaystyle m} should be distributed from one node to all other p − 1 {\displaystyle p-1} nodes. T byte {\displaystyle T_{\text{byte}}} is the time it takes to send one byte. T start {\displaystyle T_{\text{start}}} is the time it takes for a message to travel to another node, independent of its length. Therefore, the time to send a package from one node to another is t = s i z e × T byte + T start {\displaystyle t=\mathrm {size} \times T_{\text{byte}}+T_{\text{start}}} . p {\displaystyle p} is the number of nodes and the number of processors. == Binomial Tree Broadcast == With Binomial Tree Broadcast the whole message is sent at once. Each node that has already received the message sends it on further. This grows exponentially as each time step the amount of sending nodes is doubled. The algorithm is ideal for short messages but falls short with longer ones as during the time when the first transfer happens only one node is busy. Sending a message to all nodes takes log 2 ⁡ ( p ) t {\displaystyle \log _{2}(p)t} time which results in a runtime of log 2 ⁡ ( p ) ( m T byte + T start ) {\displaystyle \log _{2}(p)(mT_{\text{byte}}+T_{\text{start}})} == Linear Pipeline Broadcast == The message is split up into k {\displaystyle k} packages and sent piecewise from node n {\displaystyle n} to node n + 1 {\displaystyle n+1} . The time needed to distribute the first message piece is p t = m k T byte + T start {\textstyle pt={\frac {m}{k}}T_{\text{byte}}+T_{\text{start}}} whereby t {\displaystyle t} is the time needed to send a package from one processor to another. Sending a whole message takes ( p + k ) ( m T byte k + T start ) = ( p + k ) t = p t + k t {\displaystyle (p+k)\left({\frac {mT_{\text{byte}}}{k}}+T_{\text{start}}\right)=(p+k)t=pt+kt} . Optimal is to choose k = m ( p − 2 ) T byte T start {\displaystyle k={\sqrt {\frac {m(p-2)T_{\text{byte}}}{T_{\text{start}}}}}} resulting in a runtime of approximately m T byte + p T start + m p T start T byte {\displaystyle mT_{\text{byte}}+pT_{\text{start}}+{\sqrt {mpT_{\text{start}}T_{\text{byte}}}}} The run time is dependent on not only message length but also the number of processors that play roles. This approach shines when the length of the message is much larger than the amount of processors. == Pipelined Binary Tree Broadcast == This algorithm combines Binomial Tree Broadcast and Linear Pipeline Broadcast, which makes the algorithm work well for both short and long messages. The aim is to have as many nodes work as possible while maintaining the ability to send short messages quickly. A good approach is to use Fibonacci trees for splitting up the tree, which are a good choice as a message cannot be sent to both children at the same time. This results in a binary tree structure. We will assume in the following that communication is full-duplex. The Fibonacci tree structure has a depth of about d ≈ log Φ ⁡ ( p ) {\displaystyle d\approx \log _{\Phi }(p)} whereby Φ = 1 + 5 2 {\displaystyle \Phi ={\frac {1+{\sqrt {5}}}{2}}} the golden ratio. The resulting runtime is ( m k T byte + T start ) ( d + 2 k − 2 ) {\textstyle ({\frac {m}{k}}T_{\text{byte}}+T_{\text{start}})(d+2k-2)} . Optimal is k = n ( d − 2 ) T byte 3 T start {\displaystyle k={\sqrt {\frac {n(d-2)T_{\text{byte}}}{3T_{\text{start}}}}}} . This results in a runtime of 2 m T byte + T start log Φ ⁡ ( p ) + 2 m log Φ ⁡ ( p ) T start T byte {\displaystyle 2mT_{\text{byte}}+T_{\text{start}}\log _{\Phi }(p)+{\sqrt {2m\log _{\Phi }(p)T_{\text{start}}T_{\text{byte}}}}} . == Two Tree Broadcast (23-Broadcast) == === Definition === This algorithm aims to improve on some disadvantages of tree structure models with pipelines. Normally in tree structure models with pipelines (see above methods), leaves receive just their data and cannot contribute to send and spread data. The algorithm concurrently uses two binary trees to communicate over. Those trees will be called tree A and B. Structurally in binary trees there are relatively more leave nodes than inner nodes. Basic Idea of this algorithm is to make a leaf node of tree A be an inner node of tree B. It has also the same technical function in opposite side from B to A tree. This means, two packets are sent and received by inner nodes and leaves in different steps. === Tree construction === The number of steps needed to construct two parallel-working binary trees is dependent on the amount of processors. Like with other structures one processor can is the root node who sends messages to two trees. It is not necessary to set a root node, because it is not hard to recognize that the direction of sending messages in binary tree is normally top to bottom. There is no limitation on the number of processors to build two binary trees. Let the height of the combined tree be h = ⌈log(p + 2)⌉. Tree A and B can have a height of h − 1 {\displaystyle h-1} . Especially, if the number of processors correspond to p = 2 h − 1 {\displaystyle p=2^{h}-1} , we can make both sides trees and a root node. To construct this model efficiently and easily with a fully built tree, we can use two methods called "Shifting" and "Mirroring" to get second tree. Let assume tree A is already modeled and tree B is supposed to be constructed based on tree A. We assume that we have p {\displaystyle p} processors ordered from 0 to p − 1 {\displaystyle p-1} . ==== Shifting ==== The "Shifting" method, first copies tree A and moves every node one position to the left to get tree B. The node, which will be located on -1, becomes a child of processor p − 2 {\displaystyle p-2} . ==== Mirroring ==== "Mirroring" is ideal for an even number of processors. With this method tree B can be more easily constructed by tree A, because there are no structural transformations in order to create the new tree. In addition, a symmetric process makes this approach simple. This method can also handle an odd number of processors, in this case, we can set processor p − 1 {\displaystyle p-1} as root node for both trees. For the remaining processors "Mirroring" can be used. === Coloring === We need to find a schedule in order to make sure that no processor has to send or receive two messages from two trees in a step. The edge, is a communication connection to connect two nodes, and can be labelled as either 0 or 1 to make sure that every processor can alternate between 0 and 1-labelled edges. The edges of A and B can be colored with two colors (0 and 1) such that no processor is connected to its parent nodes in A and B using edges of the same color- no processor is connected to its children nodes in A or B using edges of the same color. In every even step the edges with 0 are activated and edges with 1 are activated in every odd step. === Time complexity === In this case the number of packet k is divided in half for each tree. Both trees are working together the total number of packets k = k / 2 + k / 2 {\displaystyle k=k/2+k/2} (upper tree + bottom tree) In each binary tree sending a message to another nodes takes 2 i {\displaystyle 2i} steps until a processor has at least a packet in step i {\displaystyle i} . Therefore, we can calculate all steps as d := log 2 ⁡ ( p + 1 ) ⇒ log 2 ⁡ ( p + 1 ) ≈ log 2 ⁡ ( p ) {\displaystyle d:=\log _{2}(p+1)\Rightarrow \log _{2}(p+1)\approx \log _{2}(p)} . The resulting run time is T ( m , p , k ) ≈ ( m k T byte + T start ) ( 2 d + k − 1 ) {\textstyle T(m,p,k)\approx ({\frac {m}{k}}T_{\text{byte}}+T_{\text{start}})(2d+k-1)} . (Optimal k = m ( 2 d − 1 ) T byte / T start {\textstyle k={\sqrt {{m(2d-1)T_{\text{byte}}}/{T_{\text{start}}}}}} ) This results in a run time of T ( m , p ) ≈ m T byte + T start ⋅ 2 log 2 ⁡ ( p ) + m ⋅ 2 log 2 ⁡ ( p ) T start T byte {\displaystyle T(m,p)\approx mT_{\text{byte}}+T_{\text{start}}\cdot 2\log _{2}(p)+{\sqrt {m\cdot 2\log _{2}(p)T_{\text{start}}T_{\text{byte}}}}} . == ESBT-Broadcasting (Edge-disjoint Spanning Binomial Trees) == In this section, another broadcasting algorithm with an underlying telephone communication model will be introduced. A Hypercube creates network system with p = 2 d ( d = 0 , 1 , 2 , 3 , . . . ) {\displaystyle p=2^{d}(d=0,1,2,3,...)} . Every node is represented by binary 0 , 1 {\displaystyle {0,1}} depending on the number of dimensions. Fundamentally ESBT(Edge-disjoint Spanning Binomial Trees) is based on hypercube graphs, pipelining( m {\displaystyle m} messages are divided by k {\displaystyle k} packets) and binomial trees. The Processor 0 d {\displaystyle 0^{d}} cyclically spreads packets to roots of ESB

    Read more →