AI Generator Zootopia

AI Generator Zootopia — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • AI Security Institute

    AI Security Institute

    The AI Security Institute (AISI) is a research organisation under the Department for Science, Innovation and Technology, UK, that aims "to equip governments with a scientific understanding of the risks posed by advanced AI". It conducts research and develop and test mitigations. Previously, it was known as the AI Safety Institute. Its creation followed world's first major AI Safety Summit that was held in Bletchley Park in 2023. The institute's professed goal is "building the world's leading understanding of advanced AI risks and solutions, to inform governments so they can keep the public safe". It is designed like a startup in the government "combining the authority of government with the expertise and agility of the private sector". AISI has made access agreements with Anthropic, Google and OpenAI to test their models before release. It has an open source platform called Inspect that permits companies, governments and academics to run standardised safety tests for AI usage. Among the works AISI has done is the reported detection of multiple serious vulnerabilities that could enable development of biological weapons; the vulnerabilities were fixed before the model was launched. It conducts research on diverse fields of AI application. One study by AISI found that LLMs post-trained for political persuasiveness became systematically less accurate and up to 51% more persuasive on political issues. AISI has also worked on the usage of AI for emotional needs. It found that nearly 10 percent of UK citizens used systems like chatbots for emotional purposes on a weekly basis. It found that "systems are now outperforming PhD-level researchers on scientific knowledge tests and helping non-experts succeed at lab work that would previously have been out of reach" in a report published in December 2025. Former chief AI officer of GCHQ Adam Beaumont is the institution's interim director. UK prime minister's AI advisor Jade Leung is the chief technology officer.

    Read more →
  • Information access

    Information access

    Information access is the freedom or ability to identify, obtain and make use of database or information effectively. There are various research efforts in information access for which the objective is to simplify and make it more effective for human users to access and further process large and unwieldy amounts of data and information. == Technology == Several technologies applicable to the general area are Information Retrieval, Text Mining, Machine Translation, and Text Categorisation. During discussions on free access to information as well as on information policy, information access is understood as concerning the insurance of free and closed access to information. Information access covers many issues including copyright, open source, privacy, and security. == Groups == Groups such as the American Library Association, the American Association of Law Libraries, Ralph Nader's Taxpayers Assets Project have advocated for free access to legal information. The vendor neutral citation movement in the legal field is working to ensure that courts will accept citations from cases on the web which do not have the traditional (copyrighted) page numbers from the West Publishing company. There is a worldwide Free Access to Law Movement which advocates free access to legal information. The Wired article "Who Owns The Law" is an introduction to the access to legal information issue. Postsecondary organizations such as K-12 work to share information. They feel it is a legal and moral obligation to provide access (including to people with disabilities or impairments) to information through the services and programs they offer. Some effects of charging for information access, such as literature searches for physicians, is studied in the article "Fee or Free: The Effect of Charging on Information Demand". In this study, a $5 charge resulted in a 77% decrease in searches.

    Read more →
  • Sieve of Eratosthenes

    Sieve of Eratosthenes

    In mathematics, the sieve of Eratosthenes is an ancient algorithm for finding all prime numbers up to any given limit. It does so by iteratively marking as composite (i.e., not prime) the multiples of each prime, starting with the first prime number, 2. The multiples of a given prime are generated as a sequence of numbers starting from that prime, with constant difference between them that is equal to that prime. This is the sieve's key distinction from using trial division to sequentially test each candidate number for divisibility by each prime. Once all the multiples of each discovered prime have been marked as composites, the remaining unmarked numbers are primes. The earliest known reference to the sieve (Ancient Greek: κόσκινον Ἐρατοσθένους, kóskinon Eratosthénous) is in Nicomachus of Gerasa's Introduction to Arithmetic, an early 2nd-century CE book which attributes it to Eratosthenes of Cyrene, a 3rd-century BCE Greek mathematician, though describing the sieving by odd numbers instead of by primes. One of a number of prime number sieves, it is one of the most efficient ways to find all of the smaller primes. It may be used to find primes in arithmetic progressions. == Overview == A prime number is a natural number that has exactly two distinct natural number divisors: the number 1 and itself. To find all the prime numbers less than or equal to a given integer n by Eratosthenes's method: Create a list of consecutive integers from 2 through n: (2, 3, 4, ..., n). Initially, let p equal 2, the smallest prime number. Enumerate the multiples of p by counting in increments of p from 2p to n, and mark them in the list (these will be 2p, 3p, 4p, ...; the p itself should not be marked). Find the smallest number in the list greater than p that is not marked. If there was no such number, stop. Otherwise, let p now equal this new number (which is the next prime), and repeat from step 3. When the algorithm terminates, the numbers remaining not marked in the list are all the primes below n. The main idea here is that every value given to p will be prime, because if it were composite it would be marked as a multiple of some other, smaller prime. Note that some of the numbers may be marked more than once (e.g., 15 will be marked both for 3 and 5). The key property of the sieve is that only additions are needed, no multiplications or divisions are used. As a refinement, it is sufficient to mark the numbers in step 3 starting from p2, as all the smaller multiples of p will have already been marked at that point. This means that the algorithm is allowed to terminate in step 4 when p2 is greater than n. Another refinement is to initially list odd numbers only, (3, 5, ..., n), and count in increments of 2p in step 3, thus marking only odd multiples of p. This actually appears in the original algorithm. This can be generalized with wheel factorization, forming the initial list only from numbers coprime with the first few primes and not just from odds (i.e., numbers coprime with 2), and counting in the correspondingly adjusted increments so that only such multiples of p are generated that are coprime with those small primes, in the first place. === Example === To find all the prime numbers less than or equal to 30, proceed as follows. First, generate a list of natural numbers from 2 to 30: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 The first number in the list is 2; cross out every 2nd number in the list after 2 by counting up from 2 in increments of 2 (these will be all the multiples of 2 in the list): 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 The next number in the list after 2 is 3; cross out every 3rd number in the list after 3 by counting up from 3 in increments of 3 (these will be all the multiples of 3 in the list): 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 The next number not yet crossed out in the list after 3 is 5; cross out every 5th number in the list after 5 by counting up from 5 in increments of 5 (i.e. all the multiples of 5): 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 The next number not yet crossed out in the list after 5 is 7; the next step would be to cross out every 7th number in the list after 7, but they are all already crossed out at this point, as these numbers (14, 21, 28) are also multiples of smaller primes because 7 × 7 is greater than 30. The numbers not crossed out at this point in the list are all the prime numbers below 30: 2 3 5 7 11 13 17 19 23 29 == Algorithm and variants == === Pseudocode === The sieve of Eratosthenes can be expressed in pseudocode, as follows: algorithm Sieve of Eratosthenes is input: an integer n > 1. output: all prime numbers from 2 through n. let A be an array of Boolean values, indexed by integers 2 to n, initially all set to true. for i = 2, 3, 4, ..., not exceeding √n do if A[i] is true for j = i2, i2+i, i2+2i, i2+3i, ..., not exceeding n do set A[j] := false return all i such that A[i] is true. This algorithm produces all primes not greater than n. It includes a common optimization, which is to start enumerating the multiples of each prime i from i2. The time complexity of this algorithm is O(n log log n), provided the array update is an O(1) operation, as is usually the case. === Segmented sieve === As Sorenson notes, the problem with the sieve of Eratosthenes is not the number of operations it performs but rather its memory requirements. For large n, the range of primes may not fit in memory; worse, even for moderate n, its cache use is highly suboptimal. The algorithm walks through the entire array A, exhibiting almost no locality of reference. A solution to these problems is offered by segmented sieves, where only portions of the range are sieved at a time. These have been known since the 1970s, and work as follows: Divide the range 2 through n into segments of some size Δ ≥ √n. Find the primes in the first (i.e. the lowest) segment, using the regular sieve. For each of the following segments, in increasing order, with m being the segment's topmost value, find the primes in it as follows: Set up a Boolean array of size Δ. Mark as non-prime the positions in the array corresponding to the multiples of each prime p ≤ √m found so far, by enumerating its multiples in steps of p starting from the lowest multiple of p between m - Δ and m. The remaining non-marked positions in the array correspond to the primes in the segment. It is not necessary to mark any multiples of these primes, because all of these primes are larger than √m, as for k ≥ 1, one has ( k Δ + 1 ) 2 > ( k + 1 ) Δ {\displaystyle (k\Delta +1)^{2}>(k+1)\Delta } . If Δ is chosen to be √n, the space complexity of the algorithm is O(√n), while the time complexity is the same as that of the regular sieve. For ranges with upper limit n so large that the sieving primes below √n as required by the page segmented sieve of Eratosthenes cannot fit in memory, a slower but much more space-efficient sieve like the pseudosquares prime sieve, developed by Jonathan P. Sorenson, can be used instead. === Incremental sieve === An incremental formulation of the sieve generates primes indefinitely (i.e., without an upper bound) by interleaving the generation of primes with the generation of their multiples (so that primes can be found in gaps between the multiples), where the multiples of each prime p are generated directly by counting up from the square of the prime in increments of p (or 2p for odd primes). The generation must be initiated only when the prime's square is reached, to avoid adverse effects on efficiency. It can be expressed symbolically under the dataflow paradigm as primes = [2, 3, ...] \ [[p², p²+p, ...] for p in primes], using list comprehension notation with \ denoting set subtraction of arithmetic progressions of numbers. Primes can also be produced by iteratively sieving out the composites through divisibility testing by sequential primes, one prime at a time. It is not the sieve of Eratosthenes but is often confused with it, even though the sieve of Eratosthenes directly generates the composites instead of testing for them. Trial division has worse theoretical complexity than that of the sieve of Eratosthenes in generating ranges of primes. When testing each prime, the optimal trial division algorithm uses all prime numbers not exceeding its square root, whereas the sieve of Eratosthenes produces each composite from its prime factors only, and gets the primes "for free", between the composites. The widely known 1975 functional sieve code by David Turner is often presented as an example of the sieve of Eratosthenes but is actually a sub-optimal trial division sieve. == Algorithmic complexity == The sieve of Eratosthenes is a popular way to benchmark computer performance. The time complexity of calculating all primes below n in the random access machine model is O(n log log n) ope

    Read more →
  • Adaptive algorithm

    Adaptive algorithm

    An adaptive algorithm is an algorithm that changes its behavior at the time it is run, based on information available and on a priori defined reward mechanism (or criterion). Such information could be the story of recently received data, information on the available computational resources, or other run-time acquired (or a priori known) information related to the environment in which it operates. Among the most used adaptive algorithms is the Widrow-Hoff’s least mean squares (LMS), which represents a class of stochastic gradient-descent algorithms used in adaptive filtering and machine learning. In adaptive filtering the LMS is used to mimic a desired filter by finding the filter coefficients that relate to producing the least mean square of the error signal (difference between the desired and the actual signal). For example, stable partition, using no additional memory is O(n lg n) but given O(n) memory, it can be O(n) in time. As implemented by the C++ Standard Library, stable_partition is adaptive and so it acquires as much memory as it can get (up to what it would need at most) and applies the algorithm using that available memory. Another example is adaptive sort, whose behavior changes upon the presortedness of its input. An example of an adaptive algorithm in radar systems is the constant false alarm rate (CFAR) detector. In machine learning and optimization, many algorithms are adaptive or have adaptive variants, which usually means that the algorithm parameters such as learning rate are automatically adjusted according to statistics about the optimisation thus far (e.g. the rate of convergence). Examples include adaptive simulated annealing, adaptive coordinate descent, adaptive quadrature, AdaBoost, Adagrad, Adadelta, RMSprop, and Adam. In data compression, adaptive coding algorithms such as Adaptive Huffman coding or Prediction by partial matching can take a stream of data as input, and adapt their compression technique based on the symbols that they have already encountered. In signal processing, the Adaptive Transform Acoustic Coding (ATRAC) codec used in MiniDisc recorders is called "adaptive" because the window length (the size of an audio "chunk") can change according to the nature of the sound being compressed, to try to achieve the best-sounding compression strategy.

    Read more →
  • Optical granulometry

    Optical granulometry

    Optical granulometry is the process of measuring the different grain sizes in a granular material, based on a photograph. Technology has been created to analyze a photograph and create statistics based on what the picture portrays. This information is vital in maintaining machinery in various trades worldwide. Mining companies can use optical granulometry to analyze inactive or moving rock to quantify the size of these fragments. Forestry companies can zero in on wood chip sizes without stopping the production process, and minimize sizing errors. With more photoanalysis technologies being produced, mining companies have shown an increased interest in these types of systems because of their ability to maintain efficiency throughout the mining process. Companies are saving millions of dollars annually because of this new technology, and are cutting back on maintenance costs on equipment. In order for optical granulometry to be completely successful, an accurate photo must be taken – under sufficient lighting, and using proper technology – to obtain quantified results. If these requirements are met, an image analysis system can be implemented. == The process == Software uses four basic steps in determining the average size of material: See the Wikipedia article on Photoanalysis to see how mining, forestry and agricultural companies are using this technology to improve quality control techniques. == Smartphone-based, segmentation-free estimation of grain size distribution == Recently, a methodology has emerged by which soil grain size distribution can be inferred from optical images acquired with commodity smartphones by training convolutional neural networks to predict parameters of the distribution curve directly from the image, without explicit image segmentation . In this approach, a standardized image of a soil surface is captured under controlled conditions, preprocessed to reduce device-specific variability, and passed to a regression model that outputs the parameters of a cumulative distribution function e.g., a two-parameter Weibull curve. The resulting distribution can be used to derive geotechnical descriptors and class boundaries.

    Read more →
  • Leiden algorithm

    Leiden algorithm

    The Leiden algorithm is a community detection algorithm developed by Traag et al at Leiden University. It was developed as a modification of the Louvain method. Like the Louvain method, the Leiden algorithm attempts to optimize modularity in extracting communities from networks; however, it addresses key issues present in the Louvain method, namely poorly connected communities and the resolution limit of modularity. == Improvement over Louvain method == Broadly, the Leiden algorithm uses the same two primary phases as the Louvain algorithm: a local node moving step (though, the method by which nodes are considered in Leiden is more efficient) and a graph aggregation step. However, to address the issues with poorly-connected communities and the merging of smaller communities into larger communities (the resolution limit of modularity), the Leiden algorithm employs an intermediate refinement phase in which communities may be split to guarantee that all communities are well-connected. Consider, for example, the following graph: Three communities are present in this graph (each color represents a community). Additionally, the center "bridge" node (represented with an extra circle) is a member of the community represented by blue nodes. Now consider the result of a node-moving step which merges the communities denoted by red and green nodes into a single community (as the two communities are highly connected): Notably, the center "bridge" node is now a member of the larger red community after node moving occurs (due to the greedy nature of the local node moving algorithm). In the Louvain method, such a merging would be followed immediately by the graph aggregation phase. However, this causes a disconnection between two different sections of the community represented by blue nodes. In the Leiden algorithm, the graph is instead refined: The Leiden algorithm's refinement step ensures that the center "bridge" node is kept in the blue community to ensure that it remains intact and connected, despite the potential improvement in modularity from adding the center "bridge" node to the red community. == Graph components == Before defining the Leiden algorithm, it will be helpful to define some of the components of a graph. === Vertices and edges === A graph is composed of vertices (nodes) and edges. Each edge is connected to two vertices, and each vertex may be connected to zero or more edges. Edges are typically represented by straight lines, while nodes are represented by circles or points. In set notation, let V {\displaystyle V} be the set of vertices, and E {\displaystyle E} be the set of edges: V := { v 1 , v 2 , … , v n } E := { e i j , e i k , … , e k l } {\displaystyle {\begin{aligned}V&:=\{v_{1},v_{2},\dots ,v_{n}\}\\E&:=\{e_{ij},e_{ik},\dots ,e_{kl}\}\end{aligned}}} where e i j {\displaystyle e_{ij}} is the directed edge from vertex v i {\displaystyle v_{i}} to vertex v j {\displaystyle v_{j}} . We can also write this as an ordered pair: e i j := ( v i , v j ) {\displaystyle {\begin{aligned}e_{ij}&:=(v_{i},v_{j})\end{aligned}}} === Community === A community is a unique set of nodes: C i ⊆ V C i ⋂ C j = ∅ ∀ i ≠ j {\displaystyle {\begin{aligned}C_{i}&\subseteq V\\C_{i}&\bigcap C_{j}=\emptyset ~\forall ~i\neq j\end{aligned}}} and the union of all communities must be the total set of vertices: V = ⋃ i = 1 C i {\displaystyle {\begin{aligned}V&=\bigcup _{i=1}C_{i}\end{aligned}}} === Partition === A partition is the set of all communities: P = { C 1 , C 2 , … , C n } {\displaystyle {\begin{aligned}{\mathcal {P}}&=\{C_{1},C_{2},\dots ,C_{n}\}\end{aligned}}} == Partition quality == How communities are partitioned is an integral part on the Leiden algorithm. How partitions are decided can depend on how their quality is measured. Additionally, many of these metrics contain parameters of their own that can change the outcome of their communities. === Modularity === Modularity is a highly used quality metric for assessing how well a set of communities partition a graph. The equation for this metric is defined for an adjacency matrix, A, as: Q = 1 2 m ∑ i j ( A i j − k i k j 2 m ) δ ( c i , c j ) {\displaystyle Q={\frac {1}{2m}}\sum _{ij}(A_{ij}-{\frac {k_{i}k_{j}}{2m}})\delta (c_{i},c_{j})} where: A i j {\displaystyle A_{ij}} represents the edge weight between nodes i {\displaystyle i} and j {\displaystyle j} ; see Adjacency matrix; k i {\displaystyle k_{i}} and k j {\displaystyle k_{j}} are the sum of the weights of the edges attached to nodes i {\displaystyle i} and j {\displaystyle j} , respectively; m {\displaystyle m} is the sum of all of the edge weights in the graph; c i {\displaystyle c_{i}} and c j {\displaystyle c_{j}} are the communities to which the nodes i {\displaystyle i} and j {\displaystyle j} belong; and δ {\displaystyle \delta } is Kronecker delta function: δ ( c i , c j ) = { 1 if c i and c j are the same community 0 otherwise {\displaystyle {\begin{aligned}\delta (c_{i},c_{j})&={\begin{cases}1&{\text{if }}c_{i}{\text{ and }}c_{j}{\text{ are the same community}}\\0&{\text{otherwise}}\end{cases}}\end{aligned}}} === Reichardt Bornholdt Potts Model (RB) === One of the most well used metrics for the Leiden algorithm is the Reichardt Bornholdt Potts Model (RB). This model is used by default in most mainstream Leiden algorithm libraries under the name RBConfigurationVertexPartition. This model introduces a resolution parameter γ {\displaystyle \gamma } and is highly similar to the equation for modularity. This model is defined by the following quality function for an adjacency matrix, A, as: Q = ∑ i j ( A i j − γ k i k j 2 m ) δ ( c i , c j ) {\displaystyle Q=\sum _{ij}(A_{ij}-\gamma {\frac {k_{i}k_{j}}{2m}})\delta (c_{i},c_{j})} where: γ {\displaystyle \gamma } represents a linear resolution parameter === Constant Potts Model (CPM) === Another metric similar to RB is the Constant Potts Model (CPM). This metric also relies on a resolution parameter γ {\displaystyle \gamma } The quality function is defined as: H = − ∑ i j ( A i j w i j − γ ) δ ( c i , c j ) {\displaystyle H=-\sum _{ij}(A_{ij}w_{ij}-\gamma )\delta (c_{i},c_{j})} === Understanding Potts Model resolution parameters/Resolution limit === Typically Potts models such as RB or CPM include a resolution parameter in their calculation. Potts models are introduced as a response to the resolution limit problem that is present in modularity maximization based community detection. The resolution limit problem is that, for some graphs, maximizing modularity may cause substructures of a graph to merge and become a single community and thus smaller structures are lost. These resolution parameters allow modularity adjacent methods to be modified to suit the requirements of the user applying the Leiden algorithm to account for small substructures at a certain granularity. The figure on the right illustrates why resolution can be a helpful parameter when using modularity based quality metrics. In the first graph, modularity only captures the large scale structures of the graph; however, in the second example, a more granular quality metric could potentially detect all substructures in a graph. == Algorithm == The Leiden algorithm starts with a graph of disorganized nodes (a) and sorts it by partitioning them to maximize modularity (the difference in quality between the generated partition and a hypothetical randomized partition of communities). The method it uses is similar to the Louvain algorithm, except that after moving each node it also considers that node's neighbors that are not already in the community it was placed in. This process results in our first partition (b), also referred to as P {\displaystyle {\mathcal {P}}} . Then the algorithm refines this partition by first placing each node into its own individual community and then moving them from one community to another to maximize modularity. It does this iteratively until each node has been visited and moved, and each community has been refined - this creates partition (c), which is the initial partition of P refined {\displaystyle {\mathcal {P}}_{\text{refined}}} . Then an aggregate network (d) is created by turning each community into a node. P refined {\displaystyle {\mathcal {P}}_{\text{refined}}} is used as the basis for the aggregate network while P {\displaystyle {\mathcal {P}}} is used to create its initial partition. Because we use the original partition P {\displaystyle {\mathcal {P}}} in this step, we must retain it so that it can be used in future iterations. These steps together form the first iteration of the algorithm. In subsequent iterations, the nodes of the aggregate network (which each represent a community) are once again placed into their own individual communities and then sorted according to modularity to form a new P refined {\displaystyle {\mathcal {P}}_{\text{refined}}} , forming (e) in the above graphic. In the case depicted by the graph, the nodes were already sorted optimally, so no change too

    Read more →
  • Master data management

    Master data management

    Master data management (MDM) is a discipline in which business and information technology collaborate to ensure the uniformity, accuracy, stewardship, semantic consistency, and accountability of the enterprise's official shared master data assets. == Reasons for master data management == Data consistency and accuracy: MDM ensures that the organization's critical data is consistent and accurate across all systems, reducing discrepancies and errors caused by multiple, siloed copies of the same data. Improved decision-making: By providing a single version of the truth (SVOT), MDM enables organizations to deliver the right data to decision makers, allowing them to clearly understand business performance and make informed, data-driven decisions. Operational efficiency: With the consistent and accurate data provided by an MDM, operational processes such as reporting and inventory management can be automated to improve efficiency. Employee learning, onboarding, and customer service also become more efficient, as MDM data facilitates rapid, accurate, and thorough information retrieval, permitting more employee time to be spent on work. Regulatory compliance: MDM tries to help organizations comply with industry standards and regulations by ensuring that master data is accurately recorded, maintained, and audited. However, issues with data quality, classification, and reconciliation may require data transformation. As with other Extract, Transform, Load-based data movements, these processes are expensive and inefficient, reducing return on investment for a project. == Business unit and product line segmentation == As a result of business unit and product line segmentation, the same entity (whether a customer, supplier, or product) will be included in different product lines. This leads to data redundancy and even confusion. For example, a customer takes out a mortgage at a bank. If the marketing and customer service departments have separate databases, advertisements might still be sent to the customer, even though they've already signed up. The two parts of the bank are unaware, and the customer is sent irrelevant communications. Record linkage can associate different records corresponding to the same entity, mitigating this issue. == Mergers and acquisitions == One of the most common problems for master data management is company growth through mergers or acquisitions. Reconciling these separate master data systems can present difficulties, as existing applications have dependencies on the master databases. Ideally, database administrators resolve this problem through deduplication of the master data as part of the merger. Over time, as further mergers and acquisitions occur, the problem can multiply. Data reconciliation processes can become extremely complex or even unreliable. Some organizations end up with 10, 15, or even 100 separate and poorly integrated master databases. This can cause serious problems in customer satisfaction, operational efficiency, decision support, and regulatory compliance. Another problem involves determining the proper degrees of detail and normalization to include in the master data schema. For example, in a federated Human Resources environment, the enterprise software may focus on storing people's data as current status, adding a few fields to identify the date of hire, date of last promotion, etc. However, this simplification can introduce business-impacting errors into dependent systems for planning and forecasting. The stakeholders of such systems may be forced to build a parallel network of new interfaces to track the onboarding of new hires, planned retirements, and divestment, which works against one of the aims of master data management. == People, processes and technology == Master data management is enabled by technology, but is more than the technologies that enable it. An organization's master data management capability will also include people and processes in its definition. === People === Several roles should be staffed within MDM. Most prominently, the Data Owner and the Data Steward. Several people would likely be allocated to each role and each person responsible for a subset of Master Data (e.g. one data owner for employee master data, another for customer master data). The Data Owner is responsible for the requirements for data definition, data quality, data security, etc. as well as for compliance with data governance and data management procedures. The Data Owner should also be funding improvement projects in case of deviations from the requirements. The Data Steward is running the master data management on behalf of the data owner and probably also being an advisor to the Data Owner. === Processes === Master data management can be viewed as a "discipline for specialized quality improvement" defined by the policies and procedures put in place by a data governance organization. It has the objective of providing processes for collecting, aggregating, matching, consolidating, quality-assuring, persisting and distributing master data throughout an organization to ensure a common understanding, consistency, accuracy and control, in the ongoing maintenance and application use of that data. Processes commonly seen in master data management include source identification, data collection, data transformation, normalization, rule administration, error detection and correction, data consolidation, data storage, data distribution, data classification, taxonomy services, item master creation, schema mapping, product codification, data enrichment, hierarchy management, business semantics management and data governance. === Technology === A master data management tool can be used to support master data management by removing duplicates, standardizing data (mass maintaining), and incorporating rules to eliminate incorrect data from entering the system to create an authoritative source of master data. Master data are the products, accounts, and parties for which the business transactions are completed. Where the technology approach produces a "golden record" or relies on a "source of record" or "system of record", it is common to talk of where the data is "mastered". This is accepted terminology in the information technology industry, but care should be taken, both with specialists and with the wider stakeholder community, to avoid confusing the concept of "master data" with that of "mastering data". ==== Implementation models ==== There are several models for implementing a technology solution for master data management. These depend on an organization's core business, its corporate structure, and its goals. These include: Source of record Registry Consolidation Coexistence Transaction/centralized ===== Source of record ===== This model identifies a single application, database, or simpler source (e.g. a spreadsheet) as being the "source of record" (or "system of record" where solely application databases are relied on). The benefit of this model is its conceptual simplicity, but it may not fit with the realities of complex master data distribution in large organizations. The source of record can be federated, for example by groups of attributes (so that different attributes of a master data entity may have different sources of record) or geographically (so that different parts of an organization may have different master sources). Federation is only applicable in certain use cases, where there is a clear delineation of which subsets of records will be found in which sources. The source of record model can be applied more widely than simply to master data, for example to reference data. ==== Transmission of master data ==== There are several ways in which master data may be collated and distributed to other systems. This includes: Data consolidation – The process of capturing master data from multiple sources and integrating it into a single hub (operational data store) for replication to other destination systems. Data federation – The process of providing a single virtual view of master data from one or more sources to one or more destination systems. Data propagation – The process of copying master data from one system to another, typically through point-to-point interfaces in legacy systems. == Change management in implementation == Challenges in adopting master data management within large organizations often arise when stakeholders disagree on a "single version of the truth" concept is not affirmed by stakeholders, who believe that their local definition of the master data is necessary. For example, the product hierarchy used to manage inventory may be entirely different from the product hierarchies used to support marketing efforts or pay sales representatives. It is above all necessary to identify if different master data is genuinely required. If it is required, then the solution implemented (technology and process) must be able to allow multiple versions of the truth to exist but will prov

    Read more →
  • Virtual facility

    Virtual facility

    A Virtual Facility (VF) is a highly realistic digital representation of a data center, used to model all relevant aspects of a physical data center with a high degree of precision. The term "virtual" in Virtual Facility refers to its use of virtual reality, rather than the abstraction of computer resources as seen in platform virtualization. The VF mirrors the characteristics of a physical facility over time and allows for detailed analysis and modeling. == VF Model features == A standard VF model includes: Three-dimensional physical facility layout Network connectivity of facility equipment Full inventory of facility equipment, including electronics and electrical systems such as power distribution units (PDUs) and uninterruptible power supplies (UPSs) Full air conditioning system (ACUs) and controls within the room The term Virtual Facility was introduced to address the emerging environmental problems facing modern Mission Critical Facilities (MCFs). This concept combines virtual reality (VR), computer simulation, and expert systems applied to the domain of facilities. The VF type of computer simulation allows for detailed analysis and prototyping of airflow in the data center using computational fluid dynamics (CFD) techniques. This enables the visualization and numerical analysis of airflow and temperatures within the facility, helping to predict real-world outcomes. == VF applications == The VF model can be used to assist with the following: Greenfield design Asset management Troubleshooting existing data centers Making existing data centers more resilient Making existing data centers more energy efficient Cost prediction Staff training Capacity planning Load growth management Many organizations use VF models to virtually assess scenarios before committing resources to physical changes. This allows for better decision-making regarding the addition or modification of equipment, helping to avoid logistical or thermal problems.

    Read more →
  • Wadhwani Institute for Artificial Intelligence

    Wadhwani Institute for Artificial Intelligence

    Wadhwani AI, based in Mumbai, Maharashtra, is an independent, non-profit institute. Founded in 2018, it is dedicated to developing Artificial intelligence solutions for social good. Their mission is to build AI-based innovations and solutions for underserved communities in developing countries, for a wide range of domains including agriculture, education, financial inclusion, healthcare, and infrastructure. == History and funding == The institute was founded with a $30 million philanthropic effort by the Wadhwani brothers, Romesh Wadhwani and Sunil Wadhwani. The institute was inaugurated and dedicated to the nation by Narendra Modi, the 14th Prime Minister of India. In 2019, the institute received a $2 million grant from Google.org to create technologies to help reduce crop losses in cotton farming, through integrated pest management. The United States Agency for International Development awarded $2 million to the institute in 2020 to develop tools, using mathematical modeling techniques and digital technologies such as artificial intelligence and machine learning, to forecast COVID-19 disease patterns, estimate resources needed, and plan interventions. == Collaboration == With assistance from Google, the Ministry of Agriculture and Farmers' Welfare and the Wadhwani AI developed Krishi 24/7, the first AI-powered automated agricultural news monitoring and analysis tool. Through better decision-making, Krishi 24/7 will support the identification of valuable news, provide timely notifications, and respond quickly to safeguard farmers' interests and advance sustainable agricultural growth. The application converts news articles into English after scanning them in several languages. It ensures that the ministry is informed in a timely manner about pertinent occurrences that are published online by extracting key information from news items, including the headline, crop name, event type, date, location, severity, summary, and source link. The National Center for Disease Control has effectively implemented a comparable automated surveillance and analysis tool for disease outbreaks.

    Read more →
  • Data quality

    Data quality

    Data quality refers to the state of qualitative or quantitative pieces of information. There are many definitions of data quality, but data is generally considered high quality if it is "fit for [its] intended uses in operations, decision making and planning". Data is deemed of high quality if it correctly represents the real-world construct to which it refers. Apart from these definitions, as the number of data sources increases, the question of internal data consistency becomes significant, regardless of fitness for use for any particular external purpose. People's views on data quality can often be in disagreement, even when discussing the same set of data used for the same purpose. When this is the case, businesses may adopt recognised international standards for data quality (See #International Standards for Data Quality below). Data governance can also be used to form agreed upon definitions and standards, including international standards, for data quality. In such cases, data cleansing, including standardization, may be required in order to ensure data quality. == Definitions == Defining data quality is difficult due to the many contexts data are used in, as well as the varying perspectives among end users, producers, and custodians of data. From a consumer perspective, data quality is: "data that are fit for use by data consumers" data "meeting or exceeding consumer expectations" data that "satisfies the requirements of its intended use" From a business perspective, data quality is: data that are "'fit for use' in their intended operational, decision-making and other roles" or that exhibits "'conformance to standards' that have been set, so that fitness for use is achieved" data that "are fit for their intended uses in operations, decision making and planning" "the capability of data to satisfy the stated business, system, and technical requirements of an enterprise" From a standards-based perspective, data quality is: the "degree to which a set of inherent characteristics (quality dimensions) of an object (data) fulfills requirements" "the usefulness, accuracy, and correctness of data for its application" Arguably, in all these cases, "data quality" is a comparison of the actual state of a particular set of data to a desired state, with the desired state being typically referred to as "fit for use," "to specification," "meeting consumer expectations," "free of defect," or "meeting requirements." These expectations, specifications, and requirements are usually defined by one or more individuals or groups, standards organizations, laws and regulations, business policies, or software development policies. == Dimensions of data quality == Drilling down further, those expectations, specifications, and requirements are stated in terms of characteristics or dimensions of the data, such as: accessibility or availability accuracy or correctness comparability completeness or comprehensiveness consistency, coherence, or clarity credibility, reliability, or reputation flexibility plausibility relevance, pertinence, or usefulness timeliness or latency uniqueness validity or reasonableness A systematic scoping review of the literature suggests that data quality dimensions and methods with real world data are not consistent in the literature, and as a result quality assessments are challenging due to the complex and heterogeneous nature of these data. == International standards for data quality == ISO 8000 is an international standard for data quality. Managed by the International Organization for Standardization, the ISO 8000 standards address and describe general aspects of data quality including principles, vocabulary and measurement data governance data quality management data quality assessment quality of master data, including exchange of characteristic data and identifiers quality of industrial data == History == Before the rise of the inexpensive computer data storage, massive mainframe computers were used to maintain name and address data for delivery services. This was so that mail could be properly routed to its destination. The mainframes used business rules to correct common misspellings and typographical errors in name and address data, as well as to track customers who had moved, died, gone to prison, married, divorced, or experienced other life-changing events. Government agencies began to make postal data available to a few service companies to cross-reference customer data with the National Change of Address registry (NCOA). This technology saved large companies millions of dollars in comparison to manual correction of customer data. Large companies saved on postage, as bills and direct marketing materials made their way to the intended customer more accurately. Initially sold as a service, data quality moved inside the walls of corporations, as low-cost and powerful server technology became available. Companies with an emphasis on marketing often focused their quality efforts on name and address information, but data quality is recognized as an important property of all types of data. Principles of data quality can be applied to supply chain data, transactional data, and nearly every other category of data found. For example, making supply chain data conform to a certain standard has value to an organization by: 1) avoiding overstocking of similar but slightly different stock; 2) avoiding false stock-out; 3) improving the understanding of vendor purchases to negotiate volume discounts; and 4) avoiding logistics costs in stocking and shipping parts across a large organization. For companies with significant research efforts, data quality can include developing protocols for research methods, reducing measurement error, bounds checking of data, cross tabulation, modeling and outlier detection, verifying data integrity, etc. == Overview == There are a number of theoretical frameworks for understanding data quality. A systems-theoretical approach influenced by American pragmatism expands the definition of data quality to include information quality, and emphasizes the inclusiveness of the fundamental dimensions of accuracy and precision on the basis of the theory of science (Ivanov, 1972). One framework, dubbed "Zero Defect Data" (Hansen, 1991) adapts the principles of statistical process control to data quality. Another framework seeks to integrate the product perspective (conformance to specifications) and the service perspective (meeting consumers' expectations) (Kahn et al. 2002). Another framework is based in semiotics to evaluate the quality of the form, meaning and use of the data (Price and Shanks, 2004). One highly theoretical approach analyzes the ontological nature of information systems to define data quality rigorously (Wand and Wang, 1996). A considerable amount of data quality research involves investigating and describing various categories of desirable attributes (or dimensions) of data. Nearly 200 such terms have been identified and there is little agreement in their nature (are these concepts, goals or criteria?), their definitions or measures (Wang et al., 1993). Software engineers may recognize this as a similar problem to "ilities". MIT has an Information Quality (MITIQ) Program, led by Professor Richard Wang, which produces a large number of publications and hosts a significant international conference in this field (International Conference on Information Quality, ICIQ). This program grew out of the work done by Hansen on the "Zero Defect Data" framework (Hansen, 1991). In practice, data quality is a concern for professionals involved with a wide range of information systems, ranging from data warehousing and business intelligence to customer relationship management and supply chain management. One industry study estimated the total cost to the U.S. economy of data quality problems at over U.S. $600 billion per annum (Eckerson, 2002). Incorrect data – which includes invalid and outdated information – can originate from different data sources – through data entry, or data migration and conversion projects. In 2002, the USPS and PricewaterhouseCoopers released a report stating that 23.6 percent of all U.S. mail sent is incorrectly addressed. One reason contact data becomes stale very quickly in the average database – more than 45 million Americans change their address every year. In fact, the problem is such a concern that companies are beginning to set up a data governance team whose sole role in the corporation is to be responsible for data quality. In some organizations, this data governance function has been established as part of a larger Regulatory Compliance function - a recognition of the importance of Data/Information Quality to organizations. Problems with data quality don't only arise from incorrect data; inconsistent data is a problem as well. Eliminating data shadow systems and centralizing data in a warehouse is one of the initiatives a company can take to ensure data consistency. En

    Read more →
  • Best arm identification

    Best arm identification

    Best arm identification (BAI) is a sequential one-player game where the player has to find the best action (arm) among a list of actions (arms) by collecting information in the most efficient way. It is a multi-armed bandit game as a player only gets information about an arm by playing it. The most common objective in multi-armed bandit games is to minimize the regret (i.e., play the best action as much as possible), but in BAI, the goal is to find the best arm as efficiently as possible. This problem naturally arises in scenarios such as adaptive clinical trials where the number of patients is limited and the quantification of the confidence in a treatment is important. It also arises in hyperparameter optimization where the goal is to find the optimal choice of hyperparameters for an algorithm with the smallest possible number of experiments, as it can be costly in terms of time, energy, or money. == Stochastic multi-armed bandit == The stochastic multi-armed bandit (MAB) is a sequential game with one player and K {\displaystyle K} actions (arms). Each arm has an unknown probability distribution associated with it. At each turn, the player has to choose one action and receive an observation from the probability distribution associated with the arm. The more you play an arm, the more you get information on its probability distribution. === Best arm identification === In BAI the goal is to find the arm that has the probability distribution with the highest mean. BAI may be either fixed confidence or fixed horizon. In a fixed-confidence game, a confidence level δ {\displaystyle \delta } is fixed at the beginning of the game and the goal is to find the best arm with this confidence level in as few turns as possible. In a fixed horizon game, the number of turns T {\displaystyle T} is fixed, and the goal is to find the best arm with the highest possible confidence in T {\displaystyle T} turns. === Math formalisation === We have one player and K {\displaystyle K} actions (arms). Behind each arm k ∈ { 1 , … , K } {\displaystyle k\in \{1,\ldots ,K\}} lies an unknown distribution ν k {\displaystyle \nu _{k}} with mean μ k {\displaystyle \mu _{k}} . Each distribution ν k {\displaystyle \nu _{k}} belongs to a known family D {\displaystyle {\mathcal {D}}} (such as the set of Gaussian distributions or Bernoulli distributions). At each time step t {\displaystyle t} , the player selects an arm a t {\displaystyle a_{t}} and observes an independent sample X t ∼ ν a t {\displaystyle X_{t}\sim \nu _{a_{t}}} from the corresponding distribution. We will note μ ∗ := max μ a {\displaystyle \mu ^{}:=\max \mu _{a}} the highest mean. An arm a {\displaystyle a} that satisfies μ a = μ ∗ {\displaystyle \mu _{a}=\mu ^{}} is called an optimal arm; otherwise it is called suboptimal arm. In best arm identification (BAI) the objective is to identify an optimal arm. Two main settings for BAI appear in the literature: Fixed confidence: In this setting, one typically assumes that there exists a unique optimal arm. A confidence level δ ∈ ( 0 , 1 ) {\displaystyle \delta \in (0,1)} is specified at the beginning. The algorithm must stop at some finite stopping time τ δ < + ∞ {\displaystyle \tau _{\delta }<+\infty } and return an arm a ^ τ δ {\displaystyle {\hat {a}}_{\tau _{\delta }}} such that the probability of error is bounded: P ( a ^ τ δ ≠ a ∗ ) ≤ δ {\displaystyle \mathbb {P} ({\hat {a}}_{\tau _{\delta }}\neq a^{})\leq \delta } . The objective is to minimize the expected sample complexity E [ τ δ ] {\displaystyle \mathbb {E} [\tau _{\delta }]} . Such a setting appears, for example, when a constraint on the confidence is required (for example, if we require a confidence level of 95%, so δ = 1 − 0.95 = 0.05 {\displaystyle \delta =1-0.95=0.05} ). Fixed horizon: In this setting, the number of samples T {\displaystyle T} is fixed in advance. The goal is to design an algorithm that minimizes the probability of misidentifying the optimal arm: P ( a ^ T ≠ a ∗ ) {\displaystyle \mathbb {P} ({\hat {a}}_{T}\neq a^{})} . This setting appears when the number of experiments is limited (for drug tests, the number of patients can be fixed in advance). === Example of simple modelling === In the case where we have K {\displaystyle K} treatments and we want to be sure with a confidence level of 95% which treatment is the best to heal a specific disease. Each treatment heals or does not heal the disease with a probability μ k {\displaystyle \mu _{k}} , which means that each distribution is a Bernoulli distribution, so D {\displaystyle {\mathcal {D}}} is the set of Bernoulli distributions. We can use a BAI algorithm to minimize E [ τ 0.05 ] {\displaystyle \mathbb {E} [\tau _{0.05}]} , the number of patients required to find the best treatment with probability 95%. == Applications == Best arm identification naturally arises in several practical domains: Adaptive clinical trials: The objective is to identify the most effective treatment based on sequentially collected patient data. Each treatment can be modeled as having an underlying distribution of outcomes. The goal is to identify the treatment with the highest expected outcome with high confidence (fixed confidence setting δ {\displaystyle \delta } ) while minimizing the number of drug test patients (minimise E [ τ δ ] {\displaystyle \mathbb {E} [\tau _{\delta }]} ), as it costs to pay patients for this and we would like to use as little as possible less effective drugs. Hyperparameter tuning: Selecting the best configuration for machine learning models efficiently by treating each hyperparameter setting as an arm. The goal is to find the best hyperparameter with as few experiments possible as experiments are costly in time and in energy == Fixed confidence level == In the fixed-confidence setting, the goal is to design an algorithm that identifies the best arm with a prescribed confidence level δ {\displaystyle \delta } while minimizing the expected number of samples. Any such algorithm requires two key components: Stopping rule: A decision criterion that determines when to stop sampling. Formally, this defines a stopping time τ δ {\displaystyle \tau _{\delta }} and returns an arm a ^ τ δ {\displaystyle {\hat {a}}_{\tau _{\delta }}} such that P ( a ^ τ δ ≠ a ⋆ ) ≤ δ {\displaystyle \mathbb {P} ({\hat {a}}_{\tau _{\delta }}\neq a^{\star })\leq \delta } and P ( τ δ < + ∞ ) = 1 {\displaystyle \mathbb {P} (\tau _{\delta }<+\infty )=1} . Sampling rule: A policy π {\displaystyle \pi } that, at each round t {\displaystyle t} , selects the next arm to sample a t {\displaystyle a_{t}} based on all previous observations ( a s , X s ) s < t {\displaystyle (a_{s},X_{s})_{s Read more →

  • Energy informatics

    Energy informatics

    Energy informatics is a research field covering the use of information and communication technology to address energy utilization and management challenges. Methods used for "smart" implementations often combine IoT sensors with artificial intelligence and machine learning. Energy Informatics is founded on flow networks that are the major suppliers and consumers of energy. Their efficiency can be improved by collecting and analyzing information. == Application areas == The field among other consider application areas within: Smart Buildings by developing ICT-centred solutions for improving the energy-efficiency of buildings. Smart Cities by investigating the synergies between demand patterns and supply availability of energy flows in cities and communities to improve energy efficiency, increase integration of renewable sources, and provide resilience towards system faults caused by extreme situations, like hurricanes and flooding. Smart Industries including the development of ICT-centred solutions for improving the energy efficiency and predictability of energy intensive industrial processes, without compromising process and product quality. Smart Energy Networks by developing ICT-centred solutions for coordinating the supply and demand in environmentally sustainable energy networks.

    Read more →
  • Medical data breach

    Medical data breach

    Medical data, including patients' identity information, health status, disease diagnosis and treatment, and biogenetic information, not only involve patients' privacy but also have a special sensitivity and important value, which may bring physical and mental distress and property loss to patients and even negatively affect social stability and national security once leaked. However, the development and application of medical AI must rely on a large amount of medical data for algorithm training, and the larger and more diverse the amount of data, the more accurate the results of its analysis and prediction will be. However, the application of big data technologies such as data collection, analysis and processing, cloud storage, and information sharing has increased the risk of data leakage. In the United States, the rate of such breaches has increased over time, with 176 million records breached by the end of 2017. By 2024, the U.S. Department of Health and Human Services reported 725 large healthcare data breaches affecting approximately 275 million individual records in a single year, marking a significant escalation in both the frequency and scale of incidents. == Black market for health data == In February 2015 an NPR report claimed that organized crime networks had ways of selling health data in the black market. In 2015 a Beazley employee estimated that medical records could sell on the black market for US$40-50. == How data is lost == Theft, data loss, hacking, and unauthorized account access are ways in which medical data breaches happen. Among reported breaches of medical information in the United States networked information systems accounted for the largest number of records breached. There are many data breaches happening in the US health care system, among business associates of the health care providers that continuously gain access to patients' data. == List of data breaches == In February 2024, a ransomware attack on Change Healthcare, a subsidiary of UnitedHealth Group, compromised the protected health information of approximately 100 million individuals, making it the largest healthcare data breach in United States history. The attack disrupted claims processing for healthcare providers nationwide for several weeks. In May 2024, MediSecure suffered a cyberattack involving ransomware in Australia. In May 2021, the Health Service Executive in the Republic of Ireland was the victim of a cyberattack involving ransomware, in the Health Service Executive cyberattack, with admission records and test results present in a sample of the data reviewed by the Financial Times. In October 2018, the Centers for Medicare and Medicaid Services in the US reported that around 75,000 individual records had been affected by a data breach that took place through the ACA Agent and Broker Portal. In 2018, Social Indicators Research published the scientific evidence of 173,398,820 (over 173 million) individuals affected in USA from October 2008 (when the data were collected) to September 2017 (when the statistical analysis took place). In 2015, Anthem Inc. lost data for 37 million people in the Anthem medical data breach In 2014 4.5 million people using Complete Health Systems had their data stolen In 2013-14 1 million people using Montana Department of Public Health and Human Services had their data stolen In 2013 4 million people using Advocate Health and Hospitals Corporation had their data stolen In 2011 4.9 million users of Tricare services had their data stolen due to an employee error by Science Applications International Corporation In 2011 1.9 million people using Health Net had their data stolen In 2011 1 million people using Nemours Foundation had their data stolen In 2010 6800 people using New York-Presbyterian Hospital and Columbia University Medical Center had their data breached. In response, those organizations agreed to pay the United States Department of Health and Human Services a US$4.8 million dollar fine. In 2009 1 million people using BlueCross BlueShield of Tennessee had their data stolen == Regulation == In the United States, the Health Insurance Portability and Accountability Act and Health Information Technology for Economic and Clinical Health Act require companies to report data breaches to affected individuals and the federal government. Under the HIPAA Breach Notification Rule, covered entities must notify affected individuals without unreasonable delay and no later than 60 days after discovering a breach of unsecured protected health information. Breaches affecting 500 or more individuals must also be reported to the HHS Secretary and to prominent media outlets serving the affected state or jurisdiction within the same timeframe; HHS publicly lists these larger breaches on its breach portal, commonly known as the "wall of shame." Breaches affecting fewer than 500 individuals are reported to HHS annually, no later than 60 days after the end of the calendar year in which they were discovered. Health Information Privacy Health Insurance Portability and Accountability Act of 1996 (HIPAA). - 45 CFR Parts 160 and 164, Standards for Privacy of Individually Identifiable Health Information and Security Standards for the Protection of Electronic Protected Health Information. HIPAA includes provisions designed to save health care businesses money by encouraging electronic transactions, as well as regulations to protect the security and confidentiality of patient information. The Privacy Rule became effective April 14, 2001, and most covered entities (health plans, health care clearinghouses, and health care providers that conduct certain financial and administrative transactions electronically) had until April 2003 to comply. This security provision became effective April 21, 2003. The Health Insurance Portability and Accountability Act (HIPAA) is the baseline set of federal regulations governing medical information. It does three things: i. i. i.Establish a structure for how personal health information is disclosed and establish the rights of individuals with respect to health information; ii.Specify security standards for the retention and transmission of electronic patient information; iii.Need a common format and data structure for the electronic exchange of health information. California-Specific Laws California’s medical privacy laws, primarily the Confidentiality of Medical Information Act (CMIA), the data breach sections of the Civil Code, and sections of the Health and Safety Code, provide HIPAA-like protections, although the terminology is different. HIPAA establishes a federal "minimum standard" that applies where there are gaps in California law, and HIPAA also specifies that stricter state laws will override or supersede HIPAA. California's health care privacy laws apply to providers who provide personal health records (PHR), while HIPAA only applies when the provider providing the PHR is a business associate of a covered entity. Federal law does not grant individuals the right to file a lawsuit in the event of a data breach (only the Attorney General can file a lawsuit), but California law does. This means that California law sets a higher standard for medical privacy, and that individuals in California enjoy stronger legal protections and more ways to hold entities that violate their medical privacy accountable. In the UK, the legal framework for how patient data is cared for and processed is the Data Protection Act 2018 (DPA), which incorporates the EU General Data Protection Regulation (GDPR) into law, and the common law duty of confidentiality (CLDC). The data protection legislation requires that the collection and processing of personal data be fair, lawful and transparent. This means that the collection and processing of data as defined by data protection legislation must always have a valid lawful basis and must also meet the requirements of the CLDC. In the China, Article 18 of the "National Health Care Big Data Standards, Security and Services Management Measures (for Trial Implementation)" (National Health Planning and Development (2018) No. 23) promulgated by the National Health Care Commission in 2018 states, "The responsible unit shall adopt measures such as data classification, important data backup, and encryption authentication to guarantee the security of health care big data." However, the scope and definition of important data are not covered. Although the "Information Security Technology-Healthcare Data Security Guide" (the "Guide") issued by the National Standardization Committee also proposes that important data should be evaluated and approved in accordance with the regulations, there is likewise no definition of the connotation and definition of important data.

    Read more →
  • Semantic query

    Semantic query

    Semantic queries allow for queries and analytics of associative and contextual nature. Semantic queries enable the retrieval of both explicitly and implicitly derived information based on syntactic, semantic and structural information contained in data. They are designed to deliver precise results (possibly the distinctive selection of one single piece of information) or to answer more fuzzy and wide open questions through pattern matching and digital reasoning. Semantic queries work on named graphs, linked data or triples. This enables the query to process the actual relationships between information and infer the answers from the network of data. This is in contrast to semantic search, which uses semantics (meaning of language constructs) in unstructured text to produce a better search result. (See natural language processing.) From a technical point of view, semantic queries are precise relational-type operations much like a database query. They work on structured data and therefore have the possibility to utilize comprehensive features like operators (e.g. >, < and =), namespaces, pattern matching, subclassing, transitive relations, semantic rules and contextual full text search. The semantic web technology stack of the W3C is offering SPARQL to formulate semantic queries in a syntax similar to SQL. Semantic queries are used in triplestores, graph databases, semantic wikis, natural language and artificial intelligence systems. == Background == Relational databases represent all relationships between data in an implicit manner only. For example, the relationships between customers and products (stored in two content-tables and connected with an additional link-table) only come into existence in a query statement (SQL in the case of relational databases) written by a developer. Writing the query demands exact knowledge of the database schema. Linked-Data represent all relationships between data in an explicit manner. In the above example, no query code needs to be written. The correct product for each customer can be fetched automatically. Whereas this simple example is trivial, the real power of linked-data comes into play when a network of information is created (customers with their geo-spatial information like city, state and country; products with their categories within sub- and super-categories). Now the system can automatically answer more complex queries and analytics that look for the connection of a particular location with a product category. The development effort for this query is omitted. Executing a semantic query is conducted by walking the network of information and finding matches (also called Data Graph Traversal). Another important aspect of semantic queries is that the type of the relationship can be used to incorporate intelligence into the system. The relationship between a customer and a product has a fundamentally different nature than the relationship between a neighbourhood and its city. The latter enables the semantic query engine to infer that a customer living in Manhattan is also living in New York City whereas other relationships might have more complicated patterns and "contextual analytics". This process is called inference or reasoning and is the ability of the software to derive new information based on given facts. == Articles == Velez, Golda (2008). "Semantics Help Wall Street Cope With Data Overload". Wall Street & Technology. wallstreetandtech.com. Zhifeng, Xiao (2009). "Spatial information semantic query based on SPARQL". In Liu, Yaolin; Tang, Xinming (eds.). International Symposium on Spatial Analysis, Spatial-Temporal Data Modeling, and Data Mining. Vol. 7492. SPIE. pp. 74921P. Bibcode:2009SPIE.7492E..60X. doi:10.1117/12.838556. S2CID 62191842. Aquin, Mathieu (2010). "Watson, more than a Semantic Web search engine" (PDF). Semantic Web Journal. Dworetzky, Tom (2011). "How Siri Works: iPhone's 'Brain' Comes from Natural Language Processing". International Business Times. Horwitt, Elisabeth (2011). "The semantic Web gets down to business". computerworld.com. Rodriguez, Marko (2011). "Graph Pattern Matching with Gremlin". Marko A. Rodriguez. markorodriguez.com on Graph Computing. Sequeda, Juan (2011). "SPARQL Nuts & Bolts". Cambridge Semantics. Freitas, Andre (2012). "Querying Heterogeneous Datasets on the Linked Data Web" (PDF). IEEE Internet Computing. Kauppinen, Tomi (2012). "Using the SPARQL Package in R to handle Spatial Linked Data". linkedscience.org. Lorentz, Alissa (2013). "With Big Data, Context is a Big Issue". Wired.

    Read more →
  • Predictor–corrector method

    Predictor–corrector method

    In numerical analysis, predictor–corrector methods belong to a class of algorithms designed to integrate ordinary differential equations – to find an unknown function that satisfies a given differential equation. All such algorithms proceed in two steps: The initial, "prediction" step, starts from a function fitted to the function-values and derivative-values at a preceding set of points to extrapolate ("anticipate") this function's value at a subsequent, new point. The next, "corrector" step refines the initial approximation by using the predicted value of the function and another method to interpolate that unknown function's value at the same subsequent point. == Predictor–corrector methods for solving ODEs == When considering the numerical solution of ordinary differential equations (ODEs), a predictor–corrector method typically uses an explicit method for the predictor step and an implicit method for the corrector step. === Example: Euler method with the trapezoidal rule === A simple predictor–corrector method (known as Heun's method) can be constructed from the Euler method (an explicit method) and the trapezoidal rule (an implicit method). Consider the differential equation y ′ = f ( t , y ) , y ( t 0 ) = y 0 , {\displaystyle y'=f(t,y),\quad y(t_{0})=y_{0},} and denote the step size by h {\displaystyle h} . First, the predictor step: starting from the current value y i {\displaystyle y_{i}} , calculate an initial guess value y ~ i + 1 {\displaystyle {\tilde {y}}_{i+1}} via the Euler method, y ~ i + 1 = y i + h f ( t i , y i ) . {\displaystyle {\tilde {y}}_{i+1}=y_{i}+hf(t_{i},y_{i}).} Next, the corrector step: improve the initial guess using trapezoidal rule, y i + 1 = y i + 1 2 h ( f ( t i , y i ) + f ( t i + 1 , y ~ i + 1 ) ) . {\displaystyle y_{i+1}=y_{i}+{\tfrac {1}{2}}h{\bigl (}f(t_{i},y_{i})+f(t_{i+1},{\tilde {y}}_{i+1}){\bigr )}.} That value is used as the next step. === PEC mode and PECE mode === There are different variants of a predictor–corrector method, depending on how often the corrector method is applied. The Predict–Evaluate–Correct–Evaluate (PECE) mode refers to the variant in the above example: y ~ i + 1 = y i + h f ( t i , y i ) , y i + 1 = y i + 1 2 h ( f ( t i , y i ) + f ( t i + 1 , y ~ i + 1 ) ) . {\displaystyle {\begin{aligned}{\tilde {y}}_{i+1}&=y_{i}+hf(t_{i},y_{i}),\\y_{i+1}&=y_{i}+{\tfrac {1}{2}}h{\bigl (}f(t_{i},y_{i})+f(t_{i+1},{\tilde {y}}_{i+1}){\bigr )}.\end{aligned}}} It is also possible to evaluate the function f only once per step by using the method in Predict–Evaluate–Correct (PEC) mode: y ~ i + 1 = y i + h f ( t i , y ~ i ) , y i + 1 = y i + 1 2 h ( f ( t i , y ~ i ) + f ( t i + 1 , y ~ i + 1 ) ) . {\displaystyle {\begin{aligned}{\tilde {y}}_{i+1}&=y_{i}+hf(t_{i},{\tilde {y}}_{i}),\\y_{i+1}&=y_{i}+{\tfrac {1}{2}}h{\bigl (}f(t_{i},{\tilde {y}}_{i})+f(t_{i+1},{\tilde {y}}_{i+1}){\bigr )}.\end{aligned}}} Additionally, the corrector step can be repeated in the hope that this achieves an even better approximation to the true solution. If the corrector method is run twice, this yields the PECECE mode: y ~ i + 1 = y i + h f ( t i , y i ) , y ^ i + 1 = y i + 1 2 h ( f ( t i , y i ) + f ( t i + 1 , y ~ i + 1 ) ) , y i + 1 = y i + 1 2 h ( f ( t i , y i ) + f ( t i + 1 , y ^ i + 1 ) ) . {\displaystyle {\begin{aligned}{\tilde {y}}_{i+1}&=y_{i}+hf(t_{i},y_{i}),\\{\hat {y}}_{i+1}&=y_{i}+{\tfrac {1}{2}}h{\bigl (}f(t_{i},y_{i})+f(t_{i+1},{\tilde {y}}_{i+1}){\bigr )},\\y_{i+1}&=y_{i}+{\tfrac {1}{2}}h{\bigl (}f(t_{i},y_{i})+f(t_{i+1},{\hat {y}}_{i+1}){\bigr )}.\end{aligned}}} The PECEC mode has one fewer function evaluation than PECECE mode. More generally, if the corrector is run k times, the method is in P(EC)k or P(EC)kE mode. If the corrector method is iterated until it converges, this could be called PE(CE)∞.

    Read more →