AI Content Editor

AI Content Editor — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Qlone

    Qlone

    Qlone is a 3D scanning app based on photogrammetry for creation of 3D models on mobile devices. The resultant 3D models can be exported for external use. Qlone was featured at the Apple Worldwide Developers Conference in 2021. It was also featured on BBC Click. == Qlone features == === 3D scanning === 3D scanning with Qlone requires the use of an included mat design. The user prints the mat onto a sheet of paper, then places the object to be scanned in the centre of the mat. An augmented reality dome within the Qlone app guides the user through the subsequent scanning process. The iOS version of Qlone allows scanning without the mat. === 3D editing === Qlone's editing features allow users to adjust 3D scanned models using texture mapping, polygon mesh size simplification, digital sculpting, cleaning and smoothing, and artistic effects. === File export === Qlone exports directly to multiple 3D platforms including SketchFab, i.materialise, Lens Studio for Snapchat, Shapeways and CGTrader. Models can also be exported in different 3D formats for use in other 3D tools – OBJ, STL, FBX, USDZ, GLB (Binary gLTF), PLY, and X3D. == Use in Science, Education and Academia == Due to its inexpensive, simple and accessible nature for creating 3D models, Qlone was used in many academically educational and scientific research projects. The European Space Agency used Qlone to scan rocks in a Tele-Robotic rock collection experiment. Neurosurgeons from the University of Southern California and surgeons from Tulane University School of Medicine used Qlone to create 3D models of cadaveric specimens and anatomical models with the aim of increasing access to such components for enhancing anatomy training and allowing realistic surgical simulations for neurosurgeons and practitioners worldwide. Archaeologists from Texas A&M University used Qlone to create 3D replicas of artifacts and models and students from Vancouver iTech Preparatory Middle School used Qlone to create 3D scans of more than 100 artifacts from Fort Vancouver National Historic Site.

    Read more →
  • SQL programming tool

    SQL programming tool

    In the field of software, SQL programming tools provide platforms for database administrators (DBAs) and application developers to perform daily tasks efficiently and accurately. Database administrators and application developers often face constantly changing environments which they rarely completely control. Many changes result from new development projects or from modifications to existing code, which, when deployed to production, do not always produce the expected result. For organizations to better manage development projects and the teams that develop code, suppliers of SQL programming tools normally provide more than facility to the database administrator or application developer to aid in database management and in quality code-deployment practices. == Features == SQL programming tools may include the following features: === SQL editing === SQL editors allow users to edit and execute SQL statements. They may support the following features: cut, copy, paste, undo, redo, find (and replace), bookmarks block indent, print, save file, uppercase/lowercase keyword highlighting auto-completion access to frequently used files output of query result editing query-results committing and rolling-back transactions inside cut paper === Object browsing === Tools may display information about database objects relevant to developers or to database administrators. Users may: view object descriptions view object definitions (DDL) create database objects enable and disable triggers and constraints recompile valid or invalid objects query or edit tables and views Some tools also provide features to display dependencies among objects, and allow users to expand these dependent objects recursively (for example: packages may reference views, views generally reference tables, super/subtypes, and so on). === Session browsing === Database administrators and application developers can use session browsing tools to view the current activities of each user in the database. They can check the resource-usage of individual users, statistics information, locked objects and the current running SQL of each individual session. === User-security management === DBAs can create, edit, delete, disable or enable user-accounts in the database using security-management tools. DBAs can also assign roles, system privileges, object privileges, and storage-quotas to users. === Debugging === Some tools offer features for the debugging of stored procedures: step in, step over, step out, run until exception, breakpoints, view & set variables, view call stack, and so on. Users can debug any program-unit without making any modification to it, including triggers and object types. === Performance monitoring === Monitoring tools may show the database resources — usage summary, service time summary, recent activities, top sessions, session history or top SQL — in easy-to-read graphs. Database administrators can easily monitor the health of various components in the monitoring instance. Application developers may also make use of such tools to diagnose and correct application-performance problems as well as improve SQL server performance. === Test data === Test data generation tools can populate the database by realistic test data for server or client side testing purposes. Also, this kind of software can upload sample blob files to database.

    Read more →
  • Bibliographic database

    Bibliographic database

    A bibliographic database is a database of bibliographic records. This is an organised online collection of references to published written works like journal and newspaper articles, conference proceedings, reports, government and legal publications, patents and books. In contrast to library catalogue entries, a majority of the records in bibliographic databases describe articles and conference papers rather than complete monographs, and they generally contain very rich subject descriptions in the form of keywords, subject classification terms, or abstracts. A bibliographic database may cover a wide range of topics or one academic field like computer science. A significant number of bibliographic databases are marketed under a trade name by licensing agreement from vendors, or directly from their makers: the indexing and abstracting services. Many bibliographic databases have evolved into digital libraries, providing the full text of the organised contents:for instance CORE also organises and mirrors scholarly articles and OurResearch develops a search engine for open access content in Unpaywall. Others merge with non-bibliographic and scholarly databases to create more complete disciplinary search engine systems, such as Chemical Abstracts or Entrez. == History == Prior to the mid-20th century, individuals searching for published literature had to rely on printed bibliographic indexes, generated manually from index cards. During the early 1960s computers were used to digitize text for the first time; the purpose was to reduce the cost and time required to publish two American abstracting journals, the Index Medicus of the National Library of Medicine and the Scientific and Technical Aerospace Reports of the National Aeronautics and Space Administration (NASA). By the late 1960s, such bodies of digitized alphanumeric information, known as bibliographic and numeric databases, constituted a new type of information resource. Online interactive retrieval became commercially viable in the early 1970s over private telecommunications networks. The first services offered a few databases of indexes and abstracts of scholarly literature. These databases contained bibliographic descriptions of journal articles that were searchable by keywords in author and title, and sometimes by journal name or subject heading. The user interfaces were crude, the access was expensive, and searching was done by librarians on behalf of "end users".

    Read more →
  • Higuchi dimension

    Higuchi dimension

    In fractal geometry, the Higuchi dimension (or Higuchi fractal dimension (HFD)) is an approximate value for the box-counting dimension of the graph of a real-valued function or time series. This value is obtained via an algorithmic approximation so one also talks about the Higuchi method. It has many applications in science and engineering and has been applied to subjects like characterizing primary waves in seismograms, clinical neurophysiology and analyzing changes in the electroencephalogram in Alzheimer's disease. == Formulation of the method == The original formulation of the method is due to T. Higuchi. Given a time series X : { 1 , … , N } → R {\displaystyle X:\{1,\dots ,N\}\to \mathbb {R} } consisting of N {\displaystyle N} data points and a parameter k m a x ≥ 2 {\displaystyle k_{\mathrm {max} }\geq 2} the Higuchi Fractal dimension (HFD) of X {\displaystyle X} is calculated in the following way: For each k ∈ { 1 , … , k m a x } {\displaystyle k\in \{1,\dots ,k_{\mathrm {max} }}\} and m ∈ { 1 , … , k } {\displaystyle m\in \{1,\dots ,k}\} define the length L m ( k ) {\displaystyle L_{m}(k)} by L m ( k ) = N − 1 ⌊ N − m k ⌋ k 2 ∑ i = 1 ⌊ N − m k ⌋ | X N ( m + i k ) − X N ( m + ( i − 1 ) k ) | . {\displaystyle L_{m}(k)={\frac {N-1}{\lfloor {\frac {N-m}{k}}\rfloor k^{2}}}\sum _{i=1}^{\lfloor {\frac {N-m}{k}}\rfloor }|X_{N}(m+ik)-X_{N}(m+(i-1)k)|.} The length L ( k ) {\displaystyle L(k)} is defined by the average value of the k {\displaystyle k} lengths L 1 ( k ) , … , L k ( k ) {\displaystyle L_{1}(k),\dots ,L_{k}(k)} , L ( k ) = 1 k ∑ m = 1 k L m ( k ) . {\displaystyle L(k)={\frac {1}{k}}\sum _{m=1}^{k}L_{m}(k).} The slope of the best-fitting linear function through the data points { ( log ⁡ 1 k , log ⁡ L ( k ) ) } {\displaystyle \left\{\left(\log {\frac {1}{k}},\log L(k)\right)\right\}} is defined to be the Higuchi fractal dimension of the time-series X {\displaystyle X} . == Application to functions == For a real-valued function f : [ 0 , 1 ] → R {\displaystyle f:[0,1]\to \mathbb {R} } one can partition the unit interval [ 0 , 1 ] {\displaystyle [0,1]} into N {\displaystyle N} equidistantly intervals [ t j , t j + 1 ) {\displaystyle [t_{j},t_{j+1})} and apply the Higuchi algorithm to the times series X ( j ) = f ( t j ) {\displaystyle X(j)=f(t_{j})} . This results into the Higuchi fractal dimension of the function f {\displaystyle f} . It was shown that in this case the Higuchi method yields an approximation for the box-counting dimension of the graph of f {\displaystyle f} as it follows a geometrical approach (see Liehr & Massopust 2020). == Robustness and stability == Applications to fractional Brownian functions and the Weierstrass function reveal that the Higuchi fractal dimension can be close to the box-dimension. On the other hand, the method can be unstable in the case where the data X ( 1 ) , … , X ( N ) {\displaystyle X(1),\dots ,X(N)} are periodic or if subsets of it lie on a horizontal line (see Liehr & Massopust 2020).

    Read more →
  • Connected-component labeling

    Connected-component labeling

    Connected-component labeling (CCL), connected-component analysis (CCA), blob extraction, region labeling, blob discovery, or region extraction is an algorithmic application of graph theory, where subsets of connected components are uniquely labeled based on a given heuristic. Connected-component labeling is not to be confused with segmentation. Connected-component labeling is used in computer vision to detect connected regions in binary digital images, although color images and data with higher dimensionality can also be processed. When integrated into an image recognition system or human-computer interaction interface, connected component labeling can operate on a variety of information. Blob extraction is generally performed on the resulting binary image from a thresholding step, but it can be applicable to gray-scale and color images as well. Blobs may be counted, filtered, and tracked. Blob extraction is related to but distinct from blob detection. == Overview == A graph, containing vertices and connecting edges, is constructed from relevant input data. The vertices contain information required by the comparison heuristic, while the edges indicate connected 'neighbors'. An algorithm traverses the graph, labeling the vertices based on the connectivity and relative values of their neighbors. Connectivity is determined by the medium; image graphs, for example, can be 4-connected neighborhood or 8-connected neighborhood. Following the labeling stage, the graph may be partitioned into subsets, after which the original information can be recovered and processed . == Definition == The usage of the term connected-component labeling (CCL) and its definition is quite consistent in the academic literature, whereas connected-component analysis (CCA) varies both in terminology and in its definition of the problem. Rosenfeld et al. define connected components labeling as the “[c]reation of a labeled image in which the positions associated with the same connected component of the binary input image have a unique label.” Shapiro et al. define CCL as an operator whose “input is a binary image and [...] output is a symbolic image in which the label assigned to each pixel is an integer uniquely identifying the connected component to which that pixel belongs.” There is no consensus on the definition of CCA in the academic literature. It is often used interchangeably with CCL. A more extensive definition is given by Shapiro et al.: “Connected component analysis consists of connected component labeling of the black pixels followed by property measurement of the component regions and decision making.” The definition for connected-component analysis presented here is more general, taking the thoughts expressed in into account. == Algorithms == The algorithms discussed can be generalised to arbitrary dimensions, albeit with increased time and space complexity. === One component at a time === This is a fast and very simple method to implement and understand. It is based on graph traversal methods in graph theory. In short, once the first pixel of a connected component is found, all the connected pixels of that connected component are labelled before going onto the next pixel in the image. This algorithm is part of Vincent and Soille's watershed segmentation algorithm, other implementations also exist. In order to do that a linked list is formed that will keep the indexes of the pixels that are connected to each other, steps (2) and (3) below. The method of defining the linked list specifies the use of a depth or a breadth first search. For this particular application, there is no difference which strategy to use. The simplest kind of a last in first out queue implemented as a singly linked list will result in a depth first search strategy. It is assumed that the input image is a binary image, with pixels being either background or foreground and that the connected components in the foreground pixels are desired. The algorithm steps can be written as: Start from the first pixel in the image. Set current label to 1. Go to (2). If this pixel is a foreground pixel and it is not already labelled, give it the current label and add it as the first element in a queue, then go to (3). If it is a background pixel or it was already labelled, then repeat (2) for the next pixel in the image. Pop out an element from the queue, and look at its neighbours (based on any type of connectivity). If a neighbour is a foreground pixel and is not already labelled, give it the current label and add it to the queue. Repeat (3) until there are no more elements in the queue. Go to (2) for the next pixel in the image and increment current label by 1. Note that the pixels are labelled before being put into the queue. The queue will only keep a pixel to check its neighbours and add them to the queue if necessary. This algorithm only needs to check the neighbours of each foreground pixel once and doesn't check the neighbours of background pixels. The pseudocode is: algorithm OneComponentAtATime(data) input : imageData[xDim][yDim] initialization : label = 0, labelArray[xDim][yDim] = 0, statusArray[xDim][yDim] = false, queue1, queue2; for i = 0 to xDim do for j = 0 to yDim do if imageData[i][j] has not been processed do if imageData[i][j] is a foreground pixel do check its four neighbors(north, south, east, west) : if neighbor is not processed do if neighbor is a foreground pixel do add it to queue1 else update its status to processed end if labelArray[i][j] = label (give label) statusArray[i][j] = true (update status) while queue1 is not empty do For each pixel in the queue do : check its four neighbors if neighbor is not processed do if neighbor is a foreground pixel do add it to queue2 else update its status to processed end if give it the current label update its status to processed remove the current element from queue1 copy queue2 into queue1 end While increase the label end if else update its status to processed end if end if end if end for end for === Two-pass === Relatively simple to implement and understand, the two-pass algorithm, (also known as the Hoshen–Kopelman algorithm) iterates through 2-dimensional binary data. The algorithm makes two passes over the image: the first pass to assign temporary labels and record equivalences, and the second pass to replace each temporary label by the smallest label of its equivalence class. The input data can be modified in situ (which carries the risk of data corruption), or labeling information can be maintained in an additional data structure. Connectivity checks are carried out by checking neighbor pixels' labels (neighbor elements whose labels are not assigned yet are ignored), or say, the north-east, the north, the north-west and the west of the current pixel (assuming 8-connectivity). 4-connectivity uses only north and west neighbors of the current pixel. The following conditions are checked to determine the value of the label to be assigned to the current pixel (4-connectivity is assumed) Conditions to check: Does the pixel to the left (west) have the same value as the current pixel? Yes – We are in the same region. Assign the same label to the current pixel No – Check next condition Do both pixels to the north and west of the current pixel have the same value as the current pixel but not the same label? Yes – We know that the north and west pixels belong to the same region and must be merged. Assign the current pixel the minimum of the north and west labels, and record their equivalence relationship No – Check next condition Does the pixel to the left (west) have a different value and the one to the north the same value as the current pixel? Yes – Assign the label of the north pixel to the current pixel No – Check next condition Do the pixel's north and west neighbors have different pixel values than current pixel? Yes – Create a new label id and assign it to the current pixel The algorithm continues this way, and creates new region labels whenever necessary. The key to a fast algorithm, however, is how this merging is done. This algorithm uses the union-find data structure which provides excellent performance for keeping track of equivalence relationships. Union-find essentially stores labels which correspond to the same blob in a disjoint-set data structure, making it easy to remember the equivalence of two labels by the use of an interface method E.g.: findSet(l). findSet(l) returns the minimum label value that is equivalent to the function argument 'l'. Once the initial labeling and equivalence recording is completed, the second pass merely replaces each pixel label with its equivalent disjoint-set representative element. A faster-scanning algorithm for connected-region extraction is presented below. On the first pass: Iterate through each element of the data by column, then by row (Raster Scanning) If the element is not the background Get the neighboring elements of the current element If there are no neighbors, uniquely

    Read more →
  • Sardinas–Patterson algorithm

    Sardinas–Patterson algorithm

    In coding theory, the Sardinas–Patterson algorithm is a classical algorithm for determining in polynomial time whether a given variable-length code is uniquely decodable, named after August Albert Sardinas and George W. Patterson, who published it in 1953. The algorithm carries out a systematic search for a string which admits two different decompositions into codewords. As Knuth reports, the algorithm was rediscovered about ten years later in 1963 by Floyd, despite the fact that it was at the time already well known in coding theory. == Idea of the algorithm == Consider the code { a ↦ 1 , b ↦ 011 , c ↦ 01110 , d ↦ 1110 , e ↦ 10011 } {\displaystyle \{\,{\texttt {a}}\mapsto {\texttt {1}},{\texttt {b}}\mapsto {\texttt {011}},{\texttt {c}}\mapsto {\texttt {01110}},{\texttt {d}}\mapsto {\texttt {1110}},{\texttt {e}}\mapsto {\texttt {10011}}\,\}} . This code, which is based on an example by Berstel, is an example of a code which is not uniquely decodable, since the string 011101110011 can be interpreted as the sequence of codewords 01110 – 1110 – 011, but also as the sequence of codewords 011 – 1 – 011 – 10011. Two possible decodings of this encoded string are thus given by cdb and babe. In general, a codeword can be found by the following idea: In the first round, we choose two codewords x 1 {\displaystyle x_{1}} and y 1 {\displaystyle y_{1}} such that x 1 {\displaystyle x_{1}} is a prefix of y 1 {\displaystyle y_{1}} , that is, x 1 w = y 1 {\displaystyle x_{1}w=y_{1}} for some "dangling suffix" w {\displaystyle w} . If one tries first x 1 = 011 {\displaystyle x_{1}={\texttt {011}}} and y 1 = 01110 {\displaystyle y_{1}={\texttt {01110}}} , the dangling suffix is w = 10 {\displaystyle {\texttt {w}}={\texttt {10}}} . If we manage to find two sequences x 2 , … , x p {\displaystyle x_{2},\ldots ,x_{p}} and y 2 , … , y q {\displaystyle y_{2},\ldots ,y_{q}} of codewords such that x 2 ⋯ x p = w y 2 ⋯ y q {\displaystyle x_{2}\cdots x_{p}=wy_{2}\cdots y_{q}} , then we are finished: For then the string x = x 1 x 2 ⋯ x p {\displaystyle x=x_{1}x_{2}\cdots x_{p}} can alternatively be decomposed as y 1 y 2 ⋯ y q {\displaystyle y_{1}y_{2}\cdots y_{q}} , and we have found the desired string having at least two different decompositions into codewords. In the second round, we try out two different approaches: the first trial is to look for a codeword that has w as prefix. Then we obtain a new dangling suffix w, with which we can continue our search. If we eventually encounter a dangling suffix that is itself a codeword (or the empty word), then the search will terminate, as we know there exists a string with two decompositions. The second trial is to seek for a codeword that is itself a prefix of w. In our example, we have w = 10 {\displaystyle w={\texttt {10}}} , and the sequence 1 is a codeword. We can thus also continue with w = 0 {\displaystyle w={\texttt {0}}} as the new dangling suffix. == Precise description of the algorithm == The algorithm is described most conveniently using quotients of formal languages. In general, for two sets of strings D and N, the (left) quotient N − 1 D {\displaystyle N^{-1}D} is defined as the residual words obtained from D by removing some prefix in N. Formally, N − 1 D = { y ∣ x y ∈ D and x ∈ N } {\displaystyle N^{-1}D=\{\,y\mid xy\in D~{\textrm {and}}~x\in N\,\}} . Now let C {\displaystyle C} denote the (finite) set of codewords in the given code. The algorithm proceeds in rounds, where we maintain in each round not only one dangling suffix as described above, but the (finite) set of all potential dangling suffixes. Starting with round i = 1 {\displaystyle i=1} , the set of potential dangling suffixes will be denoted by S i {\displaystyle S_{i}} . The sets S i {\displaystyle S_{i}} are defined inductively as follows: S 1 = C − 1 C ∖ { ε } {\displaystyle S_{1}=C^{-1}C\setminus \{\varepsilon \}} . Here, the symbol ε {\displaystyle \varepsilon } denotes the empty word. S i + 1 = C − 1 S i ∪ S i − 1 C {\displaystyle S_{i+1}=C^{-1}S_{i}\cup S_{i}^{-1}C} , for all i ≥ 1 {\displaystyle i\geq 1} . The algorithm computes the sets S i {\displaystyle S_{i}} in increasing order of i {\displaystyle i} . As soon as one of the S i {\displaystyle S_{i}} contains a word from C or the empty word, then the algorithm terminates and answers that the given code is not uniquely decodable. Otherwise, once a set S i {\displaystyle S_{i}} equals a previously encountered set S j {\displaystyle S_{j}} with j < i {\displaystyle j Read more →

  • Rendezvous hashing

    Rendezvous hashing

    Rendezvous or highest random weight (HRW) hashing is an algorithm that allows clients to achieve distributed agreement on a set of k {\displaystyle k} options out of a possible set of n {\displaystyle n} options. A typical application is when clients need to agree on which sites (or proxies) objects are assigned to. Consistent hashing addresses the special case k = 1 {\displaystyle k=1} using a different method. Rendezvous hashing is both much simpler and more general than consistent hashing (see below). == History == Rendezvous hashing was invented by David Thaler and Chinya Ravishankar at the University of Michigan in 1996. Consistent hashing appeared a year later in the literature. Given its simplicity and generality, rendezvous hashing is now being preferred to consistent hashing in real-world applications. Rendezvous hashing was used very early on in many applications including mobile caching, router design, secure key establishment, and sharding and distributed databases. Other examples of real-world systems that use Rendezvous Hashing include the GitHub load balancer, the Apache Ignite distributed database, the Tahoe-LAFS file store, the CoBlitz large-file distribution service, Apache Druid, IBM's Cloud Object Store, the Arvados Data Management System, Apache Kafka, and the Twitter EventBus pub/sub platform. One of the first applications of rendezvous hashing was to enable multicast clients on the Internet (in contexts such as the MBONE) to identify multicast rendezvous points in a distributed fashion. It was used in 1998 by Microsoft's Cache Array Routing Protocol (CARP) for distributed cache coordination and routing. Some Protocol Independent Multicast routing protocols use rendezvous hashing to pick a rendezvous point. == Problem definition and approach == === Algorithm === Rendezvous hashing solves a general version of the distributed hash table problem: We are given a set of n {\displaystyle n} sites (servers or proxies, say). How can any set of clients, given an object O {\displaystyle O} , agree on a k-subset of sites to assign to O {\displaystyle O} ? The standard version of the problem uses k = 1. Each client is to make its selection independently, but all clients must end up picking the same subset of sites. This is non-trivial if we add a minimal disruption constraint, and require that when a site fails or is removed, only objects mapping to that site need be reassigned to other sites. The basic idea is to give each site S j {\displaystyle S_{j}} a score (a weight) for each object O i {\displaystyle O_{i}} , and assign the object to the highest scoring site. All clients first agree on a hash function h ( ⋅ ) {\displaystyle h(\cdot )} . For object O i {\displaystyle O_{i}} , the site S j {\displaystyle S_{j}} is defined to have weight w i , j = h ( O i , S j ) {\displaystyle w_{i,j}=h(O_{i},S_{j})} . Each client independently computes these weights w i , 1 , w i , 2 … w i , n {\displaystyle w_{i,1},w_{i,2}\dots w_{i,n}} and picks the k sites that yield the k largest hash values. The clients have thereby achieved distributed k {\displaystyle k} -agreement. If a site S {\displaystyle S} is added or removed, only the objects mapping to S {\displaystyle S} are remapped to different sites, satisfying the minimal disruption constraint above. The HRW assignment can be computed independently by any client, since it depends only on the identifiers for the set of sites S 1 , S 2 … S n {\displaystyle S_{1},S_{2}\dots S_{n}} and the object being assigned. HRW easily accommodates different capacities among sites. If site S k {\displaystyle S_{k}} has twice the capacity of the other sites, we simply represent S k {\displaystyle S_{k}} twice in the list, say, as S k , 1 , S k , 2 {\displaystyle S_{k,1},S_{k,2}} . Clearly, twice as many objects will now map to S k {\displaystyle S_{k}} as to the other sites. === Properties === Consider the simple version of the problem, with k = 1, where all clients are to agree on a single site for an object O. Approaching the problem naively, it might appear sufficient to treat the n sites as buckets in a hash table and hash the object name O into this table. Unfortunately, if any of the sites fails or is unreachable, the hash table size changes, forcing all objects to be remapped. This massive disruption makes such direct hashing unworkable. Under rendezvous hashing, however, clients handle site failures by picking the site that yields the next largest weight. Remapping is required only for objects currently mapped to the failed site, and disruption is minimal. Rendezvous hashing has the following properties: Low overhead: The hash function used is efficient, so overhead at the clients is very low. Load balancing: Since the hash function is randomizing, each of the n sites is equally likely to receive the object O. Loads are uniform across the sites. Site capacity: Sites with different capacities can be represented in the site list with multiplicity in proportion to capacity. A site with twice the capacity of the other sites will be represented twice in the list, while every other site is represented once. High hit rate: Since all clients agree on placing an object O into the same site SO, each fetch or placement of O into SO yields the maximum utility in terms of hit rate. The object O will always be found unless it is evicted by some replacement algorithm at SO. Minimal disruption: When a site fails, only the objects mapped to that site need to be remapped. Disruption is at the minimal possible level. Distributed k-agreement: Clients can reach distributed agreement on k sites simply by selecting the top k sites in the ordering. == O(log n) running time via skeleton-based hierarchical rendezvous hashing == The standard version of Rendezvous Hashing described above works quite well for moderate n, but when n {\displaystyle n} is extremely large, the hierarchical use of Rendezvous Hashing achieves O ( log ⁡ n ) {\displaystyle O(\log n)} running time. This approach creates a virtual hierarchical structure (called a "skeleton"), and achieves O ( log ⁡ n ) {\displaystyle O(\log n)} running time by applying HRW at each level while descending the hierarchy. The idea is to first choose some constant m {\displaystyle m} and organize the n {\displaystyle n} sites into c = ⌈ n / m ⌉ {\displaystyle c=\lceil n/m\rceil } clusters C 1 = { S 1 , S 2 … S m } , C 2 = { S m + 1 , S m + 2 … S 2 m } … {\displaystyle C_{1}=\left\{S_{1},S_{2}\dots S_{m}\right\},C_{2}=\left\{S_{m+1},S_{m+2}\dots S_{2m}\right\}\dots } Next, build a virtual hierarchy by choosing a constant f {\displaystyle f} and imagining these c {\displaystyle c} clusters placed at the leaves of a tree T {\displaystyle T} of virtual nodes, each with fanout f {\displaystyle f} . In the accompanying diagram, the cluster size is m = 4 {\displaystyle m=4} , and the skeleton fanout is f = 3 {\displaystyle f=3} . Assuming 108 sites (real nodes) for convenience, we get a three-tier virtual hierarchy. Since f = 3 {\displaystyle f=3} , each virtual node has a natural numbering in octal. Thus, the 27 virtual nodes at the lowest tier would be numbered 000 , 001 , 002 , . . . , 221 , 222 {\displaystyle 000,001,002,...,221,222} in octal (we can, of course, vary the fanout at each level - in that case, each node will be identified with the corresponding mixed-radix number). The easiest way to understand the virtual hierarchy is by starting at the top, and descending the virtual hierarchy. We successively apply Rendezvous Hashing to the set of virtual nodes at each level of the hierarchy, and descend the branch defined by the winning virtual node. We can in fact start at any level in the virtual hierarchy. Starting lower in the hierarchy requires more hashes, but may improve load distribution in the case of failures. For example, instead of applying HRW to all 108 real nodes in the diagram, we can first apply HRW to the 27 lowest-tier virtual nodes, selecting one. We then apply HRW to the four real nodes in its cluster, and choose the winning site. We only need 27 + 4 = 31 {\displaystyle 27+4=31} hashes, rather than 108. If we apply this method starting one level higher in the hierarchy, we would need 9 + 3 + 4 = 16 {\displaystyle 9+3+4=16} hashes to get to the winning site. The figure shows how, if we proceed starting from the root of the skeleton, we may successively choose the virtual nodes ( 2 ) 3 {\displaystyle (2)_{3}} , ( 20 ) 3 {\displaystyle (20)_{3}} , and ( 200 ) 3 {\displaystyle (200)_{3}} , and finally end up with site 74. The virtual hierarchy need not be stored, but can be created on demand, since the virtual nodes names are simply prefixes of base- f {\displaystyle f} (or mixed-radix) representations. We can easily create appropriately sorted strings from the digits, as required. In the example, we would be working with the strings 0 , 1 , 2 {\displaystyle 0,1,2} (at tier 1), 20 , 21 , 22 {\displaystyle 20,21,22} (at tier 2), and 200 , 201 , 202

    Read more →
  • Bibliometrician

    Bibliometrician

    A bibliometrician is a researcher or a specialist in bibliometrics. It is near-synonymous with an informetrican (who studies informetrics), a scientometrican (who study scientometrics) and a webometrician, who study webometrics. == Notable bibliometricians == Christine L. Borgman Samuel C. Bradford Blaise Cronin Margaret Elizabeth Egan Eugene Garfield (developer of the Science Citation Index and the Impact factor) Jorge E. Hirsch (developer of the h-index) Alfred J. Lotka Vasily Nalimov Derek J. de Solla Price Ronald Rousseau George Kingsley Zipf

    Read more →
  • Belief–desire–intention model

    Belief–desire–intention model

    For popular psychology, the belief–desire–intention (BDI) model of human practical reasoning was developed by Michael Bratman as a way of explaining future-directed intention. BDI is fundamentally reliant on folk psychology (the 'theory theory'), which is the notion that our mental models of the world are theories. It was used as a basis for developing the belief–desire–intention software model. == Applications == BDI was part of the inspiration behind the BDI software architecture, which Bratman was also involved in developing. Here, the notion of intention was seen as a way of limiting time spent on deliberating about what to do, by eliminating choices inconsistent with current intentions. BDI has also aroused some interest in psychology. BDI formed the basis for a computational model of childlike reasoning CRIBB.

    Read more →
  • Recommender system

    Recommender system

    A recommender system, also called a recommendation algorithm, recommendation engine, or recommendation platform, is a type of information filtering system that suggests items most relevant to a particular user. The value of these systems becomes particularly evident in scenarios where users must select from a large number of options, such as products, media, or content. Major social media platforms and streaming services rely on recommender systems that employ machine learning to analyze user behavior and preferences, thereby enabling personalized content feeds. Typically, the suggestions refer to a variety decision-making processes, including the selection of a product, musical selection, or online news source to read. The implementation of recommender systems is pervasive, with commonly recognised examples including the generation of playlist for video and music services, the provision of product recommendations for e-commerce platforms, and the recommendation of content on social media platforms and the open web. These systems can operate using a single type of input, such as music, or multiple inputs from diverse platforms, including news, books and search queries. Additionally, popular recommender systems have been developed for specific topics, such as restaurants and online dating services. Recommender systems have also been developed to explore research articles and experts, collaborators, and financial services. A content discovery platform is a software recommendation platform that employs recommender system tools. It utilizes user metadata in order to identify and suggest relevant content, whilst reducing ongoing maintenance and development costs. A content discovery platform delivers personalized content to websites, mobile devices, and set-top boxes. A large range of content discovery platforms currently exist for various forms of content ranging from news articles and academic journal articles to television. As operators compete to serve as the gateway to home entertainment, personalized television emerges as a key service differentiator. Academic content discovery has recently become another area of interest, the emergence of numerous companies dedicated to assisting academic researchers in keeping up to date with relevant academic content and facilitating serendipitous discovery of new content. == Overview == Recommender systems usually make use of either or both collaborative filtering and content-based filtering, as well as other systems such as knowledge-based systems. Collaborative filtering approaches build a model from a user's past behavior (e.g., items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users. This model is then used to predict items (or ratings for items) that the user may have an interest in. Content-based filtering approaches utilize a series of discrete, pre-tagged characteristics of an item in order to recommend additional items with similar properties. === Example === The differences between collaborative and content-based filtering can be demonstrated by comparing two early music recommender systems, Last.fm and Pandora Radio. We can also look at how these methods are applied in e-commerce, for example, on platforms like Amazon. Last.fm creates a "station" of recommended songs by observing what bands and individual tracks the user has listened to on a regular basis and comparing those against the listening behavior of other users. Last.fm will play tracks that do not appear in the user's library, but are often played by other users with similar interests. As this approach leverages the behavior of users, it is an example of a collaborative filtering technique. Pandora uses the properties of a song or artist (a subset of the 450 attributes provided by the Music Genome Project) to seed a "station" that plays music with similar properties. User feedback is used to refine the station's results, deemphasizing certain attributes when a user "dislikes" a particular song and emphasizing other attributes when a user "likes" a song. This is an example of a content-based approach. In e-commerce, Amazon's well-known "customers who bought X also bought Y" feature is a prime example of collaborative filtering. It also uses content-based filtering when it recommends a book by the same author you've previously read or a pair of shoes in a similar style to ones you've viewed. Each type of system has its strengths and weaknesses. In the above example, Last.fm requires a large amount of information about a user to make accurate recommendations. This is an example of the cold start problem, and is common in collaborative filtering systems. Whereas Pandora needs very little information to start, it is far more limited in scope (for example, it can only make recommendations that are similar to the original seed). === Alternative implementations === Recommender systems are a useful alternative to search algorithms since they help users discover items they might not have found otherwise. Of note, recommender systems are often implemented using search engines indexing non-traditional data. In some cases, like in the Gonzalez v. Google Supreme Court case, may argue that search and recommendation algorithms are different technologies. Recommender systems have been the focus of several granted patents, and there are more than 50 software libraries that support the development of recommender systems including LensKit, RecBole, ReChorus and RecPack. == History == Elaine Rich created the first recommender system in 1979, called Grundy. She looked for a way to recommend users books they might like. Her idea was to create a system that asks users specific questions and classifies them into classes of preferences, or "stereotypes", depending on their answers. Depending on users' stereotype membership, they would then get recommendations for books they might like. Another early recommender system, called a "digital bookshelf", was described in a 1990 technical report by Jussi Karlgren at Columbia University, and implemented at scale and worked through in technical reports and publications from 1994 onwards by Jussi Karlgren, then at SICS, and research groups led by Pattie Maes at MIT, Will Hill at Bellcore, and Paul Resnick, also at MIT, whose work with GroupLens was awarded the 2010 ACM Software Systems Award. Montaner provided the first overview of recommender systems from an intelligent agent perspective. Adomavicius provided a new, alternate overview of recommender systems. Herlocker provides an additional overview of evaluation techniques for recommender systems, and Beel et al. discussed the problems of offline evaluations. Beel et al. have also provided literature surveys on available research paper recommender systems and existing challenges. == Approaches == === Collaborative filtering === One approach to the design of recommender systems that has wide use is collaborative filtering. Collaborative filtering is based on the assumption that people who agreed in the past will agree in the future, and that they will like similar kinds of items as they liked in the past. The system generates recommendations using only information about rating profiles for different users or items. By locating peer users/items with a rating history similar to the current user or item, they generate recommendations using this neighborhood. This approach is a cornerstone for e-commerce sites that analyze the purchasing patterns of thousands of users to suggest what you might like. Collaborative filtering methods are classified as memory-based and model-based. A well-known example of memory-based approaches is the user-based algorithm, while that of model-based approaches is matrix factorization (recommender systems). A key advantage of the collaborative filtering approach is that it does not rely on machine analyzable content and therefore it is capable of accurately recommending complex items such as movies without requiring an "understanding" of the item itself. Many algorithms have been used in measuring user similarity or item similarity in recommender systems. For example, the k-nearest neighbor (k-NN) approach and the Pearson Correlation as first implemented by Allen. When building a model from a user's behavior, a distinction is often made between explicit and implicit forms of data collection. Examples of explicit data collection include the following: Asking a user to rate an item on a sliding scale. Asking a user to search. Asking a user to rank a collection of items from favorite to least favorite. Presenting two items to a user and asking him/her to choose the better one of them. Asking a user to create a list of items that he/she likes (see Rocchio classification or other similar techniques). Examples of implicit data collection include the following: Observing the items that a user views in an online store, media library, or other repository of med

    Read more →
  • Information

    Information

    Information is an abstract concept that refers to something which has the power to inform. At the most fundamental level, it pertains to the interpretation (perhaps formally) of that which may be sensed, or their abstractions. Any natural process that is not completely random and any observable pattern in any medium can be said to convey some amount of information. Whereas digital signals and other data use discrete signs to convey information, other phenomena and artifacts such as analogue signals, poems, pictures, music or other sounds, and currents convey information in a more continuous form. Information is not knowledge itself, but the meaning that may be derived from a representation through interpretation. The concept of information is relevant to and connected with various concepts, including constraint, communication, control, data, form, education, knowledge, meaning, understanding, mental stimuli, pattern, perception, proposition, representation, and entropy. Information is often processed iteratively: Data available at one step are processed into information to be interpreted and processed at the next step. For example, in written text each symbol or letter conveys information relevant to the word it is part of, each word conveys information relevant to the phrase it is part of, each phrase conveys information relevant to the sentence it is part of, and so on until at the final step information is interpreted and becomes knowledge in a given domain. In a digital signal, bits may be interpreted into the symbols, letters, numbers, or structures that convey the information available at the next level up. The key characteristic of information is that it is subject to interpretation and processing. The derivation of information from a signal or message may be thought of as the resolution of ambiguity or uncertainty that arises during the interpretation of patterns within the signal or message. Information may be structured as data. Redundant data can be compressed up to an optimal size, which is the theoretical limit of compression. The information available through a collection of data may be derived by analysis. For example, a restaurant collects data from every customer order. That information may be analyzed to produce knowledge that is put to use when the business subsequently wants to identify the most popular or least popular dish. Information can be transmitted in time, via data storage, and space, via communication and telecommunication. Information is expressed either as the content of a message or through direct or indirect observation. That which is perceived can be construed as a message in its own right, and in that sense, all information is always conveyed as the content of a message. Information can be encoded into various forms for transmission and interpretation (for example, information may be encoded into a sequence of signs, or transmitted via a signal). It can also be encrypted for safe storage and communication. The uncertainty of an event is measured by its probability of occurrence. Uncertainty is proportional to the negative logarithm of the probability of occurrence. Information theory takes advantage of this by concluding that more uncertain events require more information to resolve their uncertainty. The bit is the standard unit of information. It is 'that which reduces uncertainty by half'. Other units such as the nat may be used. For example, the information encoded in one "fair" coin flip is log2(2/1) = 1 bit, and in two fair coin flips is log2(4/1) = 2 bits. A 2011 Science article estimates that 97% of technologically stored information was already in digital bits in 2007 and that the year 2002 was the beginning of the digital age for information storage (with digital storage capacity bypassing analogue for the first time). == Etymology and history of the concept == The English word "information" comes from Middle French enformacion/informacion/information 'a criminal investigation' and its etymon, Latin informatiō(n) 'conception, teaching, creation'. In English, "information" is an uncountable mass noun. References on "formation or molding of the mind or character, training, instruction, teaching" date from the 14th century in both English (according to Oxford English Dictionary) and other European languages. In the transition from Middle Ages to Modernity the use of the concept of information reflected a fundamental turn in epistemological basis – from "giving a (substantial) form to matter" to "communicating something to someone". Peters (1988, pp. 12–13) concludes: Information was readily deployed in empiricist psychology (though it played a less important role than other words such as impression or idea) because it seemed to describe the mechanics of sensation: objects in the world inform the senses. But sensation is entirely different from "form" – the one is sensual, the other intellectual; the one is subjective, the other objective. My sensation of things is fleeting, elusive, and idiosyncratic. For Hume, especially, sensory experience is a swirl of impressions cut off from any sure link to the real world... In any case, the empiricist problematic was how the mind is informed by sensations of the world. At first informed meant shaped by; later it came to mean received reports from. As its site of action drifted from cosmos to consciousness, the term's sense shifted from unities (Aristotle's forms) to units (of sensation). Information came less and less to refer to internal ordering or formation, since empiricism allowed for no preexisting intellectual forms outside of sensation itself. Instead, information came to refer to the fragmentary, fluctuating, haphazard stuff of sense. Information, like the early modern worldview in general, shifted from a divinely ordered cosmos to a system governed by the motion of corpuscles. Under the tutelage of empiricism, information gradually moved from structure to stuff, from form to substance, from intellectual order to sensory impulses. In the modern era, the most important influence on the concept of information is derived from the Information theory developed by Claude Shannon and others. This theory, however, reflects a fundamental contradiction. Northrup (1993) wrote: Thus, actually two conflicting metaphors are being used: The well-known metaphor of information as a quantity, like water in the water-pipe, is at work, but so is a second metaphor, that of information as a choice, a choice made by :an information provider, and a forced choice made by an :information receiver. Actually, the second metaphor implies that the information sent isn't necessarily equal to the information received, because any choice implies a comparison with a list of possibilities, i.e., a list of possible meanings. Here, meaning is involved, thus spoiling the idea of information as a pure "Ding an sich." Thus, much of the confusion regarding the concept of information seems to be related to the basic confusion of metaphors in Shannon's theory: is information an autonomous quantity, or is information always per SE information to an observer? Actually, I don't think that Shannon himself chose one of the two definitions. Logically speaking, his theory implied information as a subjective phenomenon. But this had so wide-ranging epistemological impacts that Shannon didn't seem to fully realize this logical fact. Consequently, he continued to use metaphors about information as if it were an objective substance. This is the basic, inherent contradiction in Shannon's information theory." (Northrup, 1993, p. 5). In their seminal book The Study of Information: Interdisciplinary Messages, Almach and Mansfield (1983) collected key views on the interdisciplinary controversy in computer science, artificial intelligence, library and information science, linguistics, psychology, and physics, as well as in the social sciences. Almach (1983, p. 660) himself disagrees with the use of the concept of information in the context of signal transmission, the basic senses of information in his view all referring "to telling something or to the something that is being told. Information is addressed to human minds and is received by human minds." All other senses, including its use with regard to nonhuman organisms as well to society as a whole, are, according to Machlup, metaphoric and, as in the case of cybernetics, anthropomorphic. Hjørland (2007) describes the fundamental difference between objective and subjective views of information and argues that the subjective view has been supported by, among others, Bateson, Yovits, Span-Hansen, Brier, Buckland, Goguen, and Hjørland. Hjørland provided the following example: A stone on a field could contain different information for different people (or from one situation to another). It is not possible for information systems to map all the stone's possible information for every individual. Nor is any one mapping the one "true" mapping. But peop

    Read more →
  • Bibliometrician

    Bibliometrician

    A bibliometrician is a researcher or a specialist in bibliometrics. It is near-synonymous with an informetrican (who studies informetrics), a scientometrican (who study scientometrics) and a webometrician, who study webometrics. == Notable bibliometricians == Christine L. Borgman Samuel C. Bradford Blaise Cronin Margaret Elizabeth Egan Eugene Garfield (developer of the Science Citation Index and the Impact factor) Jorge E. Hirsch (developer of the h-index) Alfred J. Lotka Vasily Nalimov Derek J. de Solla Price Ronald Rousseau George Kingsley Zipf

    Read more →
  • Corpus of Linguistic Acceptability

    Corpus of Linguistic Acceptability

    Corpus of Linguistic Acceptability (CoLA) is a dataset the primary purpose of which is to serve as a benchmark for evaluating the ability of artificial neural networks, including large language models, to judge the grammatical correctness of sentences. It consists of 10,657 English sentences from published linguistics literature that were manually labeled either as grammatical or ungrammatical. == Public version == The publicly available version of CoLA contains 9,594 sentences that belong to training and development sets. It excludes 1,063 sentences reserved for a held-out test set.

    Read more →
  • Reverse data management

    Reverse data management

    Reverse data management describes a branch and set of research questions in relational database theory that aim to reverse the common focus of standard data management. Instead of focusing on the "forward" transformation of an input databases (a set of relational tables) to an output table, which is the main focus of standard query evaluation, reverse data management reverses that focus and studies the possible input database transformations that would achieve a desired output. Usually the objective is to find an intervention (a deletion, addition, or change of tuples) of minimal size, in order to achieve a particular change in the output. The problem has been studied at least since the 1980s, but has received renewed attention due to an influential paper in the early 2000s that made a connection between provenance and view propagation. The term was coined in a VLDB 2011 vision paper. The problem has been receiving significant attention in recent years due to its connection to computational fairness. == Topics in reverse data management problems == Example topics in reverse data management include: Deletion propagation with source side-effects: Find a minimal number of tuples to delete in the database in order to delete a particular tuple in the output. Deletion propagation with view side-effects: Find a set of tuples to delete in the database in order to delete a particular tuple in the output, while removing the minimal number of other output tuples. Causal responsibility: Find a minimal number of tuples to delete in the database in order to make a particular input tuple counterfactual. This notion is inspired by the notions of actual cause and causal responsibility from the work of Halpern and Pearl. Resilience: Find a minimal number of tuples to delete in the database in order to make a Boolean query false. The complexity of this problem is identical to the problem of deletion propagation with source-side effects over a different database. Smallest witness problem: Find a minimal number of tuples to keep in the a database (or equivalently, delete a maximal number of tuples) while keeping a particular tuple in the output. Minimum repair: Given a database that violates certain integrity constraints, find a minimal number of tuples to delete in the database in order to fulfill all constraints (also called to "repair" the database).

    Read more →
  • Bisection (software engineering)

    Bisection (software engineering)

    Bisection is a method used in software development to identify change sets that result in a specific behavior change. It is mostly employed for finding the patch that introduced a bug. Another application area is finding the patch that indirectly fixed a bug. == Overview == The process of locating the changeset that introduced a specific regression was described as "source change isolation" in 1997 by Brian Ness and Viet Ngo of Cray Research. Regression testing was performed on Cray's compilers in editions comprising one or more changesets. Editions with known regressions could not be validated until developers addressed the problem. Source change isolation narrowed the cause to a single changeset that could then be excluded from editions, unblocking them with respect to this problem, while the author of the change worked on a fix. Ness and Ngo outlined linear search and binary search methods of performing this isolation. Code bisection has the goal of minimizing the effort to find a specific change set. It employs a divide and conquer algorithm that depends on having access to the code history which is usually preserved by revision control in a code repository. == Bisection method == === Code bisection algorithm === Code history has the structure of a directed acyclic graph which can be topologically sorted. This makes it possible to use a divide and conquer search algorithm which: splits up the search space of candidate revisions tests for the behavior in question reduces the search space depending on the test result re-iterates the steps above until a range with at most one bisectable patch candidate remains === Algorithmic complexity === Bisection is in LSPACE having an algorithmic complexity of O ( log ⁡ N ) {\displaystyle O(\log N)} with N {\displaystyle N} denoting the number of revisions in the search space, and is similar to a binary search. === Desirable repository properties === For code bisection it is desirable that each revision in the search space can be built and tested independently. === Monotonicity === For the bisection algorithm to identify a single changeset which caused the behavior being tested to change, the behavior must change monotonically across the search space. For a Boolean function such as a pass/fail test, this means that it only changes once across all changesets between the start and end of the search space. If there are multiple changesets across the search space where the behavior being tested changes between false and true, then the bisection algorithm will find one of them, but it will not necessarily be the root cause of the change in behavior between the start and the end of the search space. The root cause could be a different changeset, or a combination of two or more changesets across the search space. To help deal with this problem, automated tools allow specific changesets to be ignored during a bisection search. == Automation support == Although the bisection method can be completed manually, one of its main advantages is that it can be easily automated. It can thus fit into existing test automation processes: failures in exhaustive automated regression tests can trigger automated bisection to localize faults. Ness and Ngo focused on its potential in Cray's continuous delivery-style environment in which the automatically isolated bad changeset could be automatically excluded from builds. The revision control systems Fossil, Git and Mercurial have built-in functionality for code bisection. The user can start a bisection session with a specified range of revisions from which the revision control system proposes a revision to test, the user tells the system whether the revision tested as "good" or "bad", and the process repeats until the specific "bad" revision has been identified. Other revision control systems, such as Bazaar or Subversion, support bisection through plugins or external scripts. Phoronix Test Suite can do bisection automatically to find performance regressions.

    Read more →