Open-source robotics is a branch of robotics where robots are developed with open-source hardware and free and open-source software, publicly sharing blueprints, schematics, and source code. It is thus closely related to the open design movement, the maker movement and open science. == Requirements == Open source robotics means that information about the hardware is easily discerned, so that others can easily rebuild it. In turn, this requires design to use only easily available standard subcomponents and tools, and for the build process to be documented in detail including a bill of materials and detailed ('Ikea style') step-by-step building and testing instructions. (A CAD file alone is not sufficient, as it does not show the steps for performing or testing the build). These requirements are standard to open source hardware in general, and are formalised by various licences, certifications, especially those defined by the peer-reviewed journals Journal of Open Hardware and HardwareX. Licensing requirements for software are the same as for any open source software. But in addition, for software components to be of practical use in real robot systems, they need to be compatible with other software, usually as defined by some robotics middleware community standard. == Hardware systems == Applications to date include: Robot arms, e.g. PARA or Thor Wheeled mobile robots. e.g. OpenScout Four-legged robots such as the Open Dynamic Robot Initiative UAV quadcopters (drones) such as Agilicious Humanoid robots, e.g. iCub, Berkeley Humanoid Lite Self-driving cars, e.g. OpenPodcar (→ Personal rapid transit) Submersible robots, eg. OpenFish Laboratory robotics such as chemical liquid handling Vertical farming Swarm robots, e.g. HeRoSwarm Domestic tasks: vacuum cleaning, floor washing and grass mowing Robot sports including robot combat and autonomous racing Education == Hardware subcomponents == Most open source hardware definitions allow non-open subcomponents to be used in modular design, as long as they are easily available. However many designs try to push openness down into as many subcomponents as possible, with the aim of ultimately reaching fully open designs. Open hardware manual-drive vehicles and their subcomponents, such as from Open Source Ecology, are often used as starting points and extended with automation systems. Open subcomponents can include open-source computing hardware as subcomponents, such as Arduino and RISC-V, as well as open source motors and drivers such as the Open Source Motor Controller and ODrive. Open hardware robotics interface boards can simplify interfacing between middleware software and physical hardware. == Software subcomponents == === Middleware === Robotics middleware is software which links multiple other software components together. In robotics, this specifically means real-time communication systems with standardized message passing protocols. The predominant open source middleware is ROS2, the robot operating system, now as version 2. Other alternatives include ROS1, YARP — used in the iCub, URBI, and Orca. Open source middleware is usually run on an open source operating system, especially the Ubuntu distribution of Linux. === Driver software === Most robot sensors and actuators require software drivers. There is little standardization of open source software at this level, because each hardware device is different. Creating open drivers for closed hardware is difficult as it requires both low level programming and reverse engineering. === Simulation software === Open source robotics simulators include Gazebo, MuJoCo and Webots. Open source 3D game engines such as Godot are also sometimes used as simulators, when equipped with suitable middleware interfaces. === Automation software === At the level of AI, many standard algorithms have open source software implementations, mostly in ROS2. Major components include: Machine vision systems such as the YOLO object detector. 3D photogrammetry Navigation including SLAM and planning such as nav2 Arm inverse kinematics such as moveIt2 == Community == The first signs of the increasing popularity of building and sharing robot designs were found with the maker culture community. What began with small competitions for remote operated vehicles (e.g. Robot combat), soon developed to the building of autonomous telepresence robots such as Sparky and then true robots (being able to take decisions themselves) as the Open Automaton Project. Several commercial companies now also produce kits for making simple robots. The community has adopted open source hardware licenses, certifications, and peer-reviewed publications, which check that source has been made correctly and permanently available under community definitions, and which validate that this has been done. These processes have become critically important due to many historical projects claiming to be open source but them reverting on the promise due to commercialisation or other pressures. As with other forms of open source hardware, the community continues to debate precise criteria for 'ease of build'. A common standard is that designs should be buildable by a technical university student, in a few days, using typical fablab tools, but definitions of all of these subterms can also be debated. Compared to other forms of open source hardware, open source robotics typically includes a large software element, so involves software as well as hardware engineers. Open source concepts are more established in open source software than hardware, so robotics is a field in which those concepts can be shared and transferred from software to hardware. While the community in open source robotics is multi-faceted with a wide range of backgrounds, a sizable sub-community uses the ROS middleware and meets at the ROSCon conferences to discuss development of ROS itself and automation components built on it.
Scale space
Scale-space theory is a framework for multi-scale signal representation developed by the computer vision, image processing and signal processing communities with complementary motivations from physics and biological vision. It is a formal theory for handling image structures at different scales, by representing an image as a one-parameter family of smoothed images, the scale-space representation, parametrized by the size of the smoothing kernel used for suppressing fine-scale structures. The parameter t {\displaystyle t} in this family is referred to as the scale parameter, with the interpretation that image structures of spatial size smaller than about t {\displaystyle {\sqrt {t}}} have largely been smoothed away in the scale-space level at scale t {\displaystyle t} . The main type of scale space is the linear (Gaussian) scale space, which has wide applicability as well as the attractive property of being possible to derive from a small set of scale-space axioms. The corresponding scale-space framework encompasses a theory for Gaussian derivative operators, which can be used as a basis for expressing a large class of visual operations for computerized systems that process visual information. This framework also allows visual operations to be made scale invariant, which is necessary for dealing with the size variations that may occur in image data, because real-world objects may be of different sizes and in addition the distance between the object and the camera may be unknown and may vary depending on the circumstances. == Definition == The notion of scale space applies to signals of arbitrary numbers of variables. The most common case in the literature applies to two-dimensional images, which is what is presented here. Consider a given image f {\displaystyle f} where f ( x , y ) {\displaystyle f(x,y)} is the greyscale value of the pixel at position ( x , y ) {\displaystyle (x,y)} . The linear (Gaussian) scale-space representation of f {\displaystyle f} is a family of derived signals L ( x , y ; t ) {\displaystyle L(x,y;t)} defined by the convolution of f ( x , y ) {\displaystyle f(x,y)} with the two-dimensional Gaussian kernel g ( x , y ; t ) = 1 2 π t e − ( x 2 + y 2 ) / 2 t {\displaystyle g(x,y;t)={\frac {1}{2\pi t}}e^{-(x^{2}+y^{2})/2t}\,} such that L ( ⋅ , ⋅ ; t ) = g ( ⋅ , ⋅ ; t ) ∗ f ( ⋅ , ⋅ ) , {\displaystyle L(\cdot ,\cdot ;t)\ =g(\cdot ,\cdot ;t)f(\cdot ,\cdot ),} where the semicolon in the argument of L {\displaystyle L} implies that the convolution is performed only over the variables x , y {\displaystyle x,y} , while the scale parameter t {\displaystyle t} after the semicolon just indicates which scale level is being defined. This definition of L {\displaystyle L} works for a continuum of scales t ≥ 0 {\displaystyle t\geq 0} , but typically only a finite discrete set of levels in the scale-space representation would be actually considered. The scale parameter t = σ 2 {\displaystyle t=\sigma ^{2}} is the variance of the Gaussian filter and as a limit for t = 0 {\displaystyle t=0} the filter g {\displaystyle g} becomes an impulse function such that L ( x , y ; 0 ) = f ( x , y ) , {\displaystyle L(x,y;0)=f(x,y),} that is, the scale-space representation at scale level t = 0 {\displaystyle t=0} is the image f {\displaystyle f} itself. As t {\displaystyle t} increases, L {\displaystyle L} is the result of smoothing f {\displaystyle f} with a larger and larger filter, thereby removing more and more of the details that the image contains. Since the standard deviation of the filter is σ = t {\displaystyle \sigma ={\sqrt {t}}} , details that are significantly smaller than this value are to a large extent removed from the image at scale parameter t {\displaystyle t} , see the following figures and for graphical illustrations. === Why a Gaussian filter? === When faced with the task of generating a multi-scale representation one may ask: could any filter g of low-pass type and with a parameter t which determines its width be used to generate a scale space? The answer is no, as it is of crucial importance that the smoothing filter does not introduce new spurious structures at coarse scales that do not correspond to simplifications of corresponding structures at finer scales. In the scale-space literature, a number of different ways have been expressed to formulate this criterion in precise mathematical terms. The conclusion from several different axiomatic derivations that have been presented is that the Gaussian scale space constitutes the canonical way to generate a linear scale space, based on the essential requirement that new structures must not be created when going from a fine scale to any coarser scale. Conditions, referred to as scale-space axioms, that have been used for deriving the uniqueness of the Gaussian kernel include linearity, shift invariance, semi-group structure, non-enhancement of local extrema, scale invariance and rotational invariance. In the works, the uniqueness claimed in the arguments based on scale invariance has been criticized, and alternative self-similar scale-space kernels have been proposed. The Gaussian kernel is, however, a unique choice according to the scale-space axiomatics based on causality or non-enhancement of local extrema. === Alternative definition === Equivalently, the scale-space family can be defined as the solution of the diffusion equation (for example in terms of the heat equation), ∂ t L = 1 2 ∇ 2 L , {\displaystyle \partial _{t}L={\frac {1}{2}}\nabla ^{2}L,} with initial condition L ( x , y ; 0 ) = f ( x , y ) {\displaystyle L(x,y;0)=f(x,y)} . This formulation of the scale-space representation L means that it is possible to interpret the intensity values of the image f as a "temperature distribution" in the image plane and that the process that generates the scale-space representation as a function of t corresponds to heat diffusion in the image plane over time t (assuming the thermal conductivity of the material equal to the arbitrarily chosen constant 1/2). Although this connection may appear superficial for a reader not familiar with differential equations, it is indeed the case that the main scale-space formulation in terms of non-enhancement of local extrema is expressed in terms of a sign condition on partial derivatives in the 2+1-D volume generated by the scale space, thus within the framework of partial differential equations. Furthermore, a detailed analysis of the discrete case shows that the diffusion equation provides a unifying link between continuous and discrete scale spaces, which also generalizes to nonlinear scale spaces, for example, using anisotropic diffusion. Hence, one may say that the primary way to generate a scale space is by the diffusion equation, and that the Gaussian kernel arises as the Green's function of this specific partial differential equation. == Motivations == The motivation for generating a scale-space representation of a given data set originates from the basic observation that real-world objects are composed of different structures at different scales. This implies that real-world objects, in contrast to idealized mathematical entities such as points or lines, may appear in different ways depending on the scale of observation. For example, the concept of a "tree" is appropriate at the scale of meters, while concepts such as leaves and molecules are more appropriate at finer scales. For a computer vision system analysing an unknown scene, there is no way to know a priori what scales are appropriate for describing the interesting structures in the image data. Hence, the only reasonable approach is to consider descriptions at multiple scales in order to be able to capture the unknown scale variations that may occur. Taken to the limit, a scale-space representation considers representations at all scales. Another motivation to the scale-space concept originates from the process of performing a physical measurement on real-world data. In order to extract any information from a measurement process, one has to apply operators of non-infinitesimal size to the data. In many branches of computer science and applied mathematics, the size of the measurement operator is disregarded in the theoretical modelling of a problem. The scale-space theory on the other hand explicitly incorporates the need for a non-infinitesimal size of the image operators as an integral part of any measurement as well as any other operation that depends on a real-world measurement. There is a close link between scale-space theory and biological vision. Many scale-space operations show a high degree of similarity with receptive field profiles recorded from the mammalian retina and the first stages in the visual cortex. In these respects, the scale-space framework can be seen as a theoretically well-founded paradigm for early vision, which in addition has been thoroughly tested by algorithms and experiments. == Gaussian derivatives == At any scale in scale space, we c
Living lab
The concept of the living lab has been defined in multiple ways. A definition from the European Network of Living Labs (ENoLL) is used most widely, describing them as "user-centred open innovation ecosystems” that integrate research and innovation through co-creation in real-world environments.[1] Emerging at the intersection of ambient intelligence research and user experience methodologies in the late 1990s, the concept was pioneered at the Massachusetts Institute of Technology (MIT) as a way to study human interaction with new technologies in natural settings. Over time, living labs have evolved beyond their origins as controlled research environments, becoming dynamic platforms for participatory design, collaborative experimentation, and iterative innovation across various domains, including urban development, healthcare, sustainability, and digital technology. Characterized by principles such as real-world experimentation, active user involvement, and multi-stakeholder collaboration, living labs enable the continuous adaptation and validation of solutions in everyday contexts. Today, they are implemented globally, supported by networks like the European Network of Living Labs (ENoLL), and increasingly recognized as vital tools for addressing local and global transformation agendas. == Background == The term "living lab" has emerged in parallel from the ambient intelligence (AmI) research communities context and from the discussion on experience and application research (EAR). The emergence of the term is based on the concept of user experience and ambient intelligence. The term dates back to the late 1990s when Professor William J. Mitchell, Kent Larson, and Alex (Sandy) Pentland at the Massachusetts Institute of Technology were credited with first exploring the concept of a living laboratory. It was first associated with MIT's Media Lab as a concept for studying real-life contexts, where they described a living lab as a controlled environment designed to test new information and communication technology (ICT) innovations in a simulated home setting. This was also when some of the key characteristics often assigned to living labs today began to take shape. They argued that a living lab represents a user-centric research methodology for sensing, prototyping, validating and refining complex solutions in multiple and evolving real-life contexts. Research on living labs has expanded since the 1990s, especially in the 2010s, with growing interest in co-creation and participatory design. Particularly in Europe, the living lab evolved into a model that focused on studying user interactions with technology in real-world environments. This shift was influenced by earlier experiences in participatory design and social experiments with ICT. As interest grew, the term began to encompass a broader array of initiatives and projects, leading to variations in its interpretation and implementation. Today, living labs are used in various fields, such as technology, healthcare, and urban sustainability, showing a transition from a narrow focus on their role as controlled environments to a more wide-ranging understanding of collaborative innovation addressing real societal challenges, while also being referred to with various descriptions and definitions available from different sources. == Description == The ENoLL definition that refers to living labs as "user-centred open innovation ecosystems” that integrate research and innovation through co-creation in real-world environments is the most widely accepted description of living labs in academic literature. In simple terms, living labs can be described as an organization or experimental space, that can be both virtually or physically located, bringing different stakeholders from research, business, government, and citizens together to design and test solutions to be implemented in a real world environment. A common definition for the living lab term still does not exist to this day, which is due to the fact that living labs are interpreted and implemented across different contexts and can cover a wide range of activities and organizations, leading to different understandings of how living labs should function. Living labs also often operate in various territorial contexts (e.g. city, agglomeration, region, campus), and can vary in their methodological approach integrating concurrent research and innovation processes within a public-private-people partnership. Despite these variations, common characteristics include user-centricity, real-world experimentation, multi-stakeholder collaboration, and iterative innovation processes. The systematic user co-creation approach refers to integrating research and innovation processes through the co-creation, exploration, experimentation and evaluation of innovative ideas, scenarios, concepts and related technological artefacts in real life use cases. Such use cases involve user communities, not only as observed subjects but also as a source of creation. This approach allows all involved stakeholders to concurrently consider both the global performance of a product or service and its potential adoption by users. This consideration may be made at the earlier stage of research and development and through all elements of the product life-cycle, from design up to recycling. User-centred research methods, such as action research, community informatics, contextual design, user-centered design, participatory design, empathic design, emotional design, and other usability methods, already exist but fail to sufficiently empower users for co-creating into open development environments. More recently, the Web 2.0 has demonstrated the positive impact of involving user communities in new product development (NPD) such as mass collaboration projects (e.g. crowdsourcing, Wisdom of Crowds) in collectively creating new contents and applications. Real-world experimentation emphasizes conducting activities in real-life settings to ensure that the results of the projects and solutions are applicable to actual market conditions. Multi-stakeholder collaboration refers to an approach that involved various stakeholders, such as users, businesses, researchers, and government entities, working together towards a common goal. This is an important characteristics of living lab because collaboration of these diverse groups allows for exchange of ideas and perspectives, which are thought to enhance innovation processes. Iterative innovation processes involve a cyclical method of developing products or services, where stages such as research, development, testing, and implementation are revisited multiple times based on feedback and evaluation. This process allows for continuous improvement of the innovation, product, or service being developed. In particular, the ongoing involvement of the user creates feedback mechanisms that are ultimately key to successful development and implementation of products and services. A living lab is not similar to a testbed as its philosophy is to turn users, from being traditionally considered as observed subjects for testing modules against requirements, into value creation in contributing to the co-creation and exploration of emerging ideas, breakthrough scenarios, innovative concepts and related artefacts. Hence, a living lab rather constitutes an experiential environment, which could be compared to the concept of experiential learning, where users are immersed in a creative social space for designing and experiencing their own future. Living labs could also be used by policy makers and users/citizens for designing, exploring, experiencing and refining new policies and regulations in real-life scenarios for evaluating their potential impacts before their implementations. == European Network of Living Labs (ENoLL) == The European Network of Living Labs (ENoLL) is an international, non-profit, independent association of certified living labs, which popularized the living lab concept in the aim to increase user involvement in innovation. Formed in November 2006 under the guidance of the Finnish European Presidency, ENoLL is composed of a variety of stakeholders, including municipalities and research institutes, businesses, and users. Its primary role is to support the collaboration among living labs across Europe and includes many living labs focused on user-driven innovation across sectors. ENoLL focuses on facilitating knowledge exchange, joint actions and project partnerships among its historically labelled +/- 500 members, influencing EU policies, promoting living labs and enabling their implementation worldwide. ENoLL serves as a platform for linking living labs around the globe, which enables knowledge sharing and collaborative learning among diverse cultural environments. Membership to the platform is open to organizations worldwide, and ENoLL has expanded beyond Europe to include global members. ENoLL follows an application and accreditation pro
Pointer jumping
Pointer jumping or path doubling is a design technique for parallel algorithms that operate on pointer structures, such as linked lists and directed graphs. Pointer jumping allows an algorithm to follow paths with a time complexity that is logarithmic with respect to the length of the longest path. It does this by "jumping" to the end of the path computed by neighbors. The basic operation of pointer jumping is to replace each neighbor in a pointer structure with its neighbor's neighbor. In each step of the algorithm, this replacement is done for all nodes in the data structure, which can be done independently in parallel. In the next step when a neighbor's neighbor is followed, the neighbor's path already followed in the previous step is added to the node's followed path in a single step. Thus, each step effectively doubles the distance traversed by the explored paths. Pointer jumping is best understood by looking at simple examples such as list ranking and root finding. == List ranking == One of the simpler tasks that can be solved by a pointer jumping algorithm is the list ranking problem. This problem is defined as follows: given a linked list of N nodes, find the distance (measured in the number of nodes) of each node to the end of the list. The distance d(n) is defined as follows, for nodes n that point to their successor by a pointer called next: If n.next is nil, then d(n) = 0. For any other node, d(n) = d(n.next) + 1. This problem can easily be solved in linear time on a sequential machine, but a parallel algorithm can do better: given n processors, the problem can be solved in logarithmic time, O(log N), by the following pointer jumping algorithm: The pointer jumping occurs in the last line of the algorithm, where each node's next pointer is reset to skip the node's direct successor. It is assumed, as in common in the PRAM model of computation, that memory access are performed in lock-step, so that each n.next.next memory fetch is performed before each n.next memory store; otherwise, processors may clobber each other's data, producing inconsistencies. The following diagram follows how the parallel list ranking algorithm uses pointer jumping for a linked list with 11 elements. As the algorithm describes, the first iteration starts initialized with all ranks set to 1 except those with a null pointer for next. The first iteration looks at immediate neighbors. Each subsequent iteration jumps twice as far as the previous. Analyzing the algorithm yields a logarithmic running time. The initialization loop takes constant time, because each of the N processors performs a constant amount of work, all in parallel. The inner loop of the main loop also takes constant time, as does (by assumption) the termination check for the loop, so the running time is determined by how often this inner loop is executed. Since the pointer jumping in each iteration splits the list into two parts, one consisting of the "odd" elements and one of the "even" elements, the length of the list pointed to by each processor's n is halved in each iteration, which can be done at most O(log N) time before each list has a length of at most one. == Root finding == Following a path in a graph is an inherently serial operation, but pointer jumping reduces the total amount of work by following all paths simultaneously and sharing results among dependent operations. Pointer jumping iterates and finds a successor — a vertex closer to the tree root — each time. By following successors computed for other vertices, the traversal down each path can be doubled every iteration, which means that the tree roots can be found in logarithmic time. Pointer doubling operates on an array successor with an entry for every vertex in the graph. Each successor[i] is initialized with the parent index of vertex i if that vertex is not a root or to i itself if that vertex is a root. At each iteration, each successor is updated to its successor's successor. The root is found when the successor's successor points to itself. The following pseudocode demonstrates the algorithm. algorithm Input: An array parent representing a forest of trees. parent[i] is the parent of vertex i or itself for a root Output: An array containing the root ancestor for every vertex for i ← 1 to length(parent) do in parallel successor[i] ← parent[i] while true for i ← 1 to length(successor) do in parallel successor_next[i] ← successor[successor[i]] if successor_next = successor then break for i ← 1 to length(successor) do in parallel successor[i] ← successor_next[i] return successor The following image provides an example of using pointer jumping on a small forest. On each iteration the successor points to the vertex following one more successor. After two iterations, every vertex points to its root node. == History and examples == Although the name pointer jumping would come later, JáJá attributes the first uses of the technique in early parallel graph algorithms and list ranking. The technique has been described with other names such as shortcutting, but by the 1990s textbooks on parallel algorithms consistently used the term pointer jumping. Today, pointer jumping is considered a software design pattern for operating on recursive data types in parallel. As a technique for following linked paths, graph algorithms are a natural fit for pointer jumping. Consequently, several parallel graph algorithms utilizing pointer jumping have been designed. These include algorithms for finding the roots of a forest of rooted trees, connected components, minimum spanning trees, and biconnected components. However, pointer jumping has also shown to be useful in a variety of other problems including computer vision, image compression, and Bayesian inference.
Synthetic data
Synthetic data are artificially generated data not produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models. Data generated by a computer simulation can be seen as synthetic data. This encompasses most applications of physical modeling, such as music synthesizers or flight simulators. The output of such systems approximates the real thing, but is fully algorithmically generated. Synthetic data is used in a variety of fields as a filter for information that would otherwise compromise the confidentiality of particular aspects of the data. In many sensitive applications, datasets theoretically exist but cannot be released to the general public; synthetic data sidesteps the privacy issues that arise from using real consumer information without permission or compensation. == Usefulness == Synthetic data is generated to meet specific needs or certain conditions that may not be found in the original, real data. One of the hurdles in applying up-to-date machine learning approaches for complex scientific tasks is the scarcity of labeled data, a gap effectively bridged by the use of synthetic data, which closely replicates real experimental data. This can be useful when designing many systems, from simulations based on theoretical value, to database processors, etc. This helps detect and solve unexpected issues such as information processing limitations. Synthetic data are often generated to represent the authentic data and allows a baseline to be set. Another benefit of synthetic data is to protect the privacy and confidentiality of authentic data, while still allowing for use in testing systems. Computer security experts claim generated synthetic data "... enables us to create realistic behavior profiles for users and attackers. The data is used to train the fraud detection system itself, thus creating the necessary adaptation of the system to a specific environment." In defense and military contexts, synthetic data is seen as a potentially valuable tool to develop and improve complex AI systems, particularly in contexts where high-quality real-world data is scarce. At the same time, synthetic data together with the testing approach can give the ability to model real-world scenarios. == History == Scientific modelling of physical systems has a long history that runs concurrent with the history of physics. For example, research into synthesis of audio and voice can be traced back to the 1930s and before, driven forward by the developments of the telephone and audio recording technologies. Digitization gave rise to software synthesizers from the 1970s onwards. In the context of privacy-preserving statistical analysis, in 1993, the idea of original fully synthetic data was created by Donald Rubin. Rubin originally designed this to synthesize the Decennial Census long form responses for the short form households. He then released samples that did not include any actual long form records - in this he preserved anonymity of the household. Later that year, the idea of original partially synthetic data was created by Little. Little used this idea to synthesize the sensitive values on the public use file. A 1993 work fitted a statistical model to 60,000 MNIST digits, then it was used to generate over 1 million examples. Those were used to train a LeNet-4 to reach state of the art performance. In 1994, Stephen Fienberg introduced 'critical refinement', in which a parametric posterior predictive distribution (instead of a Bayes bootstrap) is used to do the sampling. Later, other important contributors to the development of synthetic data generation were Trivellore Raghunathan, Jerry Reiter, Donald Rubin, John M. Abowd, and Jim Woodcock. Collectively they came up with a solution for how to treat partially synthetic data with missing data. Similarly, they developed the technique of Sequential Regression Multivariate Imputation. == Calculations == Researchers test the framework on synthetic data, which is "the only source of ground truth on which they can objectively assess the performance of their algorithms". Synthetic data can be generated through the use of random lines, having different orientations and starting positions. Datasets can get fairly complicated. A more complicated dataset can be generated by using a synthesizer build. To create a synthesizer build, first use the original data to create a model or equation that fits the data the best. This model or equation will be called a synthesizer build. This build can be used to generate more data. Constructing a synthesizer build involves constructing a statistical model. In a linear regression line example, the original data can be plotted, and a best fit linear line can be created from the data. This line is a synthesizer created from the original data. The next step will be generating more synthetic data from the synthesizer build or from this linear line equation. In this way, the new data can be used for studies and research, and it protects the confidentiality of the original data. David Jensen from the Knowledge Discovery Laboratory explains how to generate synthetic data: "Researchers frequently need to explore the effects of certain data characteristics on their data model." To help construct datasets exhibiting specific properties, such as auto-correlation or degree disparity, proximity can generate synthetic data having one of several types of graph structure: random graphs that are generated by some random process; lattice graphs having a ring structure; lattice graphs having a grid structure, etc. In all cases, the data generation process follows the same process: Generate the empty graph structure. Generate attribute values based on user-supplied prior probabilities. Since the attribute values of one object may depend on the attribute values of related objects, the attribute generation process assigns values collectively. == Applications == === Fraud detection and confidentiality systems === Testing and training fraud detection and confidentiality systems are devised using synthetic data. Specific algorithms and generators are designed to create realistic data, which then assists in teaching a system how to react to certain situations or criteria. For example, intrusion detection software is tested using synthetic data. This data is a representation of the authentic data and may include intrusion instances that are not found in the authentic data. The synthetic data allows the software to recognize these situations and react accordingly. If synthetic data was not used, the software would only be trained to react to the situations provided by the authentic data and it may not recognize another type of intrusion. === Scientific research === Researchers doing clinical trials or any other research may generate synthetic data to aid in creating a baseline for future studies and testing. Real data can contain information that researchers may not want released, so synthetic data is sometimes used to protect the privacy and confidentiality of a dataset. Using synthetic data reduces confidentiality and privacy issues since it holds no personal information and cannot be traced back to any individual. Beyond privacy protection, synthetic data is also being explored for methodological innovation in drug development. For instance, synthetic data may be used to construct synthetic control arms as an alternative to conventional external control arms based on real-world data (RWD) or randomized controlled trials (RCTs). Collectively, regulatory agencies such as the FDA and EMA appear to be at various stages of recognizing and integrating AI-generated synthetic data into their methodologies. While there is growing consensus on the potential of such data to support model development and the broader lifecycle of medicinal products, to date no drug or medical device has been approved using solely or predominantly synthetic data—particularly not as a comparator arm generated entirely via data-driven algorithms. The quality and statistical handling of synthetic data are expected to become more prominent in future regulatory discussions, particularly in contexts such as predictive modeling (e.g., digital twins), where innovative approaches have already been referenced. === Machine learning === Synthetic data is increasingly being used for machine learning applications: a model is trained on a synthetically generated dataset with the intention of transfer learning to real data. Efforts have been made to enable more data science experiments via the construction of general-purpose synthetic data generators, such as the Synthetic Data Vault. In general, synthetic data has several natural advantages: once the synthetic environment is ready, it is fast and cheap to produce as much data as needed; synthetic data can have perfectly accurate labels, including labeling that may be very expensive or impo
Database virtualization
Database virtualization is the decoupling of the database layer, which lies between the storage and application layers within the application stack. Virtualization of the database layer enables a shift away from the physical, toward the logical or virtual. Virtualization enables compute and storage resources to be pooled and allocated on demand. This enables both the sharing of single server resources for multi-tenancy, as well as the pooling of server resources into a single logical database or cluster. In both cases, database virtualization provides increased flexibility, more granular and efficient allocation of pooled resources, and more scalable computing. == Virtual data partitioning == The act of partitioning data stores as a database grows has been in use for several decades. There are two primary ways that data has been partitioned inside legacy data management systems: Shared-data databases: an architecture that assumes all database cluster nodes share a single partition. Inter-node communications are used to synchronize update activities performed by different nodes on the cluster. Shared-data data management systems are limited to single-digit node clusters. Shared-nothing databases: an architecture in which all data is segregated to internally managed partitions with clear, well-defined data location boundaries. Shared-nothing databases require manual partition management. In virtual partitioning, logical data is abstracted from physical data by autonomously creating and managing large numbers of data partitions (100s to 1000s). Because they are autonomously maintained, the resources required to manage the partitions are minimal. This kind of massive partitioning results in: Partitions that are small, efficiently managed, and load-balanced. Systems that do not require re-partitioning events to define additional partitions, even when the hardware is changed. “Shared-data” and “shared-nothing” architectures allow scalability through multiple data partitions and cross-partition querying and transaction processing without full partition scanning. == Horizontal data partitioning == Partitioning database sources from consumers is a fundamental concept. With greater numbers of database sources, inserting a horizontal data virtualization layer between the sources and consumers helps address this complexity. Rick van der Lans, the author of multiple books on SQL and relational databases, has defined data virtualization as "the process of offering data consumers a data access interface that hides the technical aspects of stored data, such as location, storage structure, API, access language, and storage technology." == Advantages == Added flexibility and agility for existing computing infrastructure. Enhanced database performance. Pooling and sharing computing resources, either splitting them (multi-tenancy) or combining them (clustering). Simplification of administration and management. Increased fault tolerance.
Algorithmic transparency
Algorithmic transparency is the principle that the factors that influence the decisions made by algorithms should be visible, or transparent, to the people who use, regulate, and are affected by systems that employ those algorithms. Although the phrase was coined in 2016 by Nicholas Diakopoulos and Michael Koliska about the role of algorithms in deciding the content of digital journalism services, the underlying principle dates back to the 1970s and the rise of automated systems for scoring consumer credit. The phrases "algorithmic transparency" and "algorithmic accountability" are sometimes used interchangeably – especially since they were coined by the same people – but they have subtly different meanings. Specifically, "algorithmic transparency" states that the inputs to the algorithm and the algorithm's use itself must be known, but they need not be fair. "Algorithmic accountability" implies that the organizations that use algorithms must be accountable for the decisions made by those algorithms, even though the decisions are being made by a machine, and not by a human being. Current research around algorithmic transparency interested in both societal effects of accessing remote services running algorithms, as well as mathematical and computer science approaches that can be used to achieve algorithmic transparency. In the United States, the Federal Trade Commission's Bureau of Consumer Protection studies how algorithms are used by consumers by conducting its own research on algorithmic transparency and by funding external research. In the European Union, the data protection laws that came into effect in May 2018 include a "right to explanation" of decisions made by algorithms, though it is unclear what this means. Furthermore, the European Union founded The European Center for Algorithmic Transparency (ECAT).