AI Data Farms

AI Data Farms — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Teaspiller

    Teaspiller

    Teaspiller was a US-based web application for customers to find accountants and hire them to do their taxes and accounting online. In 2013 the company was acquired by Intuit, Inc and added to its TurboTax product line. The Teaspiller employees and code were all acquired and the product was renamed as "TurboTax CPA select". It enabled accountants to work remotely with clients (share files, send secure messages, schedule appointments), as well as find new clients looking for their specific skills through a complex search algorithm. This was done through extended profiles containing licensing information, professional histories, user ratings, peer endorsements, association memberships, and practice areas. The service had been called an H&R Block killer by Business Insider as it helped customers find accountants to prepare tax returns online. As of 2011 it had 20,000 US accountants listed on the site. The application was built using the Django framework. == History == Teaspiller was built by Vemdara, LLC, a web company based in New York and founded in 2009 by Amit Vemuri (a former VP at Travelocity). The web application was launched in 2010. In 2013 the company was acquired by Intuit as part of their TurboTax product line and renamed as "TurboTax CPA select".

    Read more →
  • Archival bond

    Archival bond

    The archival bond is a concept in archival theory referring to the relationship that each archival record has with the other records produced as part of the same transaction or activity and located within the same grouping. These bonds are a core component of each individual record and are necessary for transforming a document into a record, as a document will only acquire meaning (and become a record) through its interrelationships with other records. == Description == The concept of the archival bond is primarily associated with the work of Luciana Duranti along with Heather MacNeil, as part of research into the integrity of electronic records. Duranti resumed and extended the concept of vincolo archivistico (archival bond), first expressed in 1937 by archivist Giorgio Cencetti of the Italian archival school. This bond emerges from the fact that electronic records are not physically arranged like traditional records. For traditional, analog records, their bond is implicit in their arrangement. But for electronic records, this bond must be made explicit due to the lack of a single sequential order of records in a digital environment. The archival bond was one of the core concepts of the subsequent International Research on Permanent Authentic Records in Electronic Systems (InterPARES) project and can be found in the InterPARES glossary. As Duranti notes, the archival bond is not to be confused with the broader term "context" as context exists independently of a record, while "the archival bond is an essential part of the record, which would not exist without it."

    Read more →
  • TurboQuant

    TurboQuant

    TurboQuant is an online vector quantization algorithm for compressing high-dimensional Euclidean vectors while preserving their geometric structure. It was proposed in 2025 by Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni in the paper TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. The paper lists Zandieh and Mirrokni as affiliated with Google Research, Daliri with New York University, and Hadian with Google DeepMind. The method was developed for applications including large language model (LLM) inference, key–value (KV) cache compression, vector databases, and nearest neighbor search. TurboQuant consists of two related algorithms: TurboQuantmse, which is optimized for mean squared error (MSE), and TurboQuantprod, which is optimized for unbiased inner product estimation. The algorithm uses a random rotation of input vectors, applies scalar quantizers to the rotated coordinates, and, for inner-product estimation, applies a one-bit Quantized Johnson–Lindenstrauss (QJL) transform to the residual error. == Background == Vector quantization is a compression method that maps high-dimensional vectors to a finite set of codewords. The problem has roots in Shannon's source coding theory and rate–distortion theory. In machine learning and information retrieval, vector quantization is used to reduce the memory required to store embeddings, activation vectors, and other numerical representations. In Transformer-based large language models, the KV cache stores key and value vectors from previous tokens during autoregressive decoding. The size of this cache grows with context length, the number of attention heads, and the number of concurrent requests, making it a major memory bottleneck in LLM serving. Similar compression problems appear in vector search, where large collections of embedding vectors must be stored and searched efficiently. Earlier approaches to vector quantization include product quantization, scalar quantization, and data-dependent k-means codebook construction. The TurboQuant paper argues that many existing methods either require offline preprocessing and calibration or suffer from suboptimal distortion guarantees in online settings. == Algorithm == === TurboQuantmse === TurboQuantmse is the version of the algorithm optimized for mean-squared error. For a unit vector x ∈ S d − 1 {\displaystyle x\in S^{d-1}} , the algorithm first applies a random rotation matrix Π ∈ R d × d {\displaystyle \Pi \in \mathbb {R} ^{d\times d}} and sets z = Π x {\displaystyle z=\Pi x} . Each coordinate of the rotated vector follows a shifted and scaled beta distribution, which converges to a normal distribution in high dimensions. In high dimensions, distinct coordinates also become nearly independent, allowing the algorithm to apply scalar quantizers independently to each coordinate. The scalar quantizer is constructed by solving a one-dimensional continuous k-means or Lloyd–Max quantization problem. If the centroids are c 1 , c 2 , … , c 2 b {\displaystyle c_{1},c_{2},\ldots ,c_{2^{b}}} , the quantization step stores, for each coordinate, i d x j = ⁡ a r g m i n k ∈ [ 2 b ] | z j − c k | . {\displaystyle \mathrm {idx} _{j}=\operatorname {} {arg\,min}_{k\in [2^{b}]}|z_{j}-c_{k}|.} During dequantization, the stored index for each coordinate is replaced by the corresponding centroid, giving a reconstructed rotated vector z ~ {\displaystyle {\tilde {z}}} . The algorithm then rotates back: x ~ = Π ⊤ z ~ . {\displaystyle {\tilde {x}}=\Pi ^{\top }{\tilde {z}}.} The paper gives the following bound for TurboQuantmse: D m s e ≤ 3 π 2 ⋅ 1 4 b . {\displaystyle D_{\mathrm {mse} }\leq {\frac {\sqrt {3\pi }}{2}}\cdot {\frac {1}{4^{b}}}.} It also reports finer-grained MSE values of approximately 0.36, 0.117, 0.03, and 0.009 for bit-widths b = 1 , 2 , 3 , 4 {\displaystyle b=1,2,3,4} , respectively. === TurboQuantprod === TurboQuantprod is optimized for unbiased inner-product estimation. The authors note that an MSE-optimized quantizer may introduce bias when used to estimate inner products. To address this, TurboQuantprod first applies TurboQuantmse with bit-width b − 1 {\displaystyle b-1} , then applies a one-bit Quantized Johnson–Lindenstrauss transform to the remaining residual vector. Let r = x − Q m s e − 1 ( Q m s e ( x ) ) {\displaystyle r=x-Q_{\mathrm {mse} }^{-1}(Q_{\mathrm {mse} }(x))} be the residual after MSE quantization, and let γ = ‖ r ‖ 2 {\displaystyle \gamma =\|r\|_{2}} . The QJL step stores a sign vector for the residual. For γ ≠ 0 {\displaystyle \gamma \neq 0} , this can be written using the normalized residual u = r / γ {\displaystyle u=r/\gamma } : q j l = sign ⁡ ( S u ) , {\displaystyle qjl=\operatorname {sign} (Su),} where S ∈ R d × d {\displaystyle S\in \mathbb {R} ^{d\times d}} is a random projection matrix. Since the sign function is invariant under positive rescaling, this is equivalent to sign ⁡ ( S r ) {\displaystyle \operatorname {sign} (Sr)} when r ≠ 0 {\displaystyle r\neq 0} . If γ = 0 {\displaystyle \gamma =0} , the residual correction is zero. TurboQuantprod stores the MSE quantization, the QJL sign vector, and the residual norm: Q p r o d ( x ) = [ Q m s e ( x ) , q j l , γ ] . {\displaystyle Q_{\mathrm {prod} }(x)=\left[Q_{\mathrm {mse} }(x),qjl,\gamma \right].} The dequantized vector is reconstructed as x ~ = x ~ m s e + π / 2 d γ S ⊤ q j l . {\displaystyle {\tilde {x}}={\tilde {x}}_{\mathrm {mse} }+{\frac {\sqrt {\pi /2}}{d}}\,\gamma S^{\top }qjl.} The paper proves that TurboQuantprod is unbiased for inner-product estimation: E x ~ [ ⟨ y , x ~ ⟩ ] = ⟨ y , x ⟩ . {\displaystyle \mathbb {E} _{\tilde {x}}\left[\langle y,{\tilde {x}}\rangle \right]=\langle y,x\rangle .} It also gives the distortion bound D p r o d ≤ 3 π 2 ⋅ ‖ y ‖ 2 2 d ⋅ 1 4 b . {\displaystyle D_{\mathrm {prod} }\leq {\frac {\sqrt {3\pi }}{2}}\cdot {\frac {\|y\|_{2}^{2}}{d}}\cdot {\frac {1}{4^{b}}}.} == Performance and applications == The TurboQuant paper reports that the algorithm achieves near-optimal distortion rates within a small constant factor of information-theoretic lower bounds. The authors report that, for KV cache quantization, TurboQuant achieved quality neutrality at 3.5 bits per channel and marginal degradation at 2.5 bits per channel. In long-context LLM experiments using Llama 3.1 8B Instruct, the paper evaluated the method on a "needle-in-a-haystack" retrieval task with document lengths from 4,000 to 104,000 tokens. It reported that TurboQuant matched the uncompressed full-precision baseline while using more than 4× compression, and compared the method against PolarQuant, SnapKV, PyramidKV, and KIVI. Google Research stated that TurboQuant was evaluated on long-context benchmarks including LongBench, Needle in a Haystack, ZeroSCROLLS, RULER, and L-Eval using open-source models including Gemma and Mistral. According to a report in Tom's Hardware, Google described the method as reducing KV-cache memory by at least six times and achieving up to an eightfold improvement in attention-logit computation on Nvidia H100 GPUs compared with unquantized 32-bit keys. TurboQuant has also been applied to nearest-neighbor vector search. The original paper reports experiments on DBpedia entity embeddings and GloVe embeddings, comparing TurboQuant with product quantization and other vector-search quantization baselines. == Relationship to other methods == TurboQuant is related to several methods for efficient large language model inference and high-dimensional search: Product quantization – a vector quantization technique widely used for approximate nearest-neighbor search Quantization (machine learning) – reducing the numerical precision of weights, activations, or cached tensors in machine learning models PagedAttention – a memory-management algorithm for LLM serving that reduces fragmentation in the KV cache Johnson–Lindenstrauss lemma – a result in high-dimensional geometry used in random projection methods Lloyd's algorithm – an algorithm for scalar and vector quantization, including k-means-style codebook construction Unlike PagedAttention, which focuses on memory allocation and cache layout, TurboQuant reduces the numerical storage cost of the vectors themselves. Unlike many product-quantization methods, TurboQuant is designed to be data-oblivious and online, avoiding dataset-specific codebook training. == Limitations == The strongest performance claims for TurboQuant come from the original paper and Google Research's own publication. Coverage in technology media has noted that the broader impact of the method will depend on real-world implementation details, workloads, and hardware architectures.

    Read more →
  • Vector-field consistency

    Vector-field consistency

    Vector-Field Consistency is a consistency model for replicated data (for example, objects), initially described in a paper which was awarded the best-paper prize in the ACM/IFIP/Usenix Middleware Conference 2007. It has since been enhanced for increased scalability and fault-tolerance in a recent paper. == Description == This consistency model was initially designed for replicated data management in ad hoc gaming in order to minimize bandwidth usage without sacrificing playability. Intuitively, it captures the notion that although players require, wish, and take advantage of information regarding the whole of the game world (as opposed to a restricted view to rooms, arenas, etc. of limited size employed in many multiplayer video games), they need to know information with greater freshness, frequency, and accuracy as other game entities are located closer and closer to the player's position. It prescribes a multidimensional divergence bounding scheme, based on a vector field that employs consistency vectors k=(θ,σ,ν), standing for maximum allowed time - or replica staleness, sequence - or missing updates, and value - or user-defined measured replica divergence, applied to all space coordinates in game scenario or world. The consistency vector-fields emanate from field-generators designated as pivots (for example, players) and field intensity attenuates as distance grows from these pivots in concentric or square-like regions. This consistency model unifies locality-awareness techniques employed in message routing and consistency enforcement for multiplayer games, with divergence bounding techniques traditionally employed in replicated database and web scenarios.

    Read more →
  • IgHome

    IgHome

    igHome is a customizable start page introduced in 2012 as an alternative to iGoogle, the personal web portal launched by Google in May 2005. Just like iGoogle, igHome offers users the possibility to build a start page containing a central search box and a number of gadgets. igHome mimics the user interface of iGoogle. Registered igHome users can create multiple tabs and import RSS feeds.

    Read more →
  • Driver scheduling problem

    Driver scheduling problem

    The driver scheduling problem (DSP) is type of problem in operations research and theoretical computer science. The DSP consists of selecting a set of duties (assignments) for the drivers or pilots of vehicles (e.g., buses, trains, boats, or planes) involved in the transportation of passengers or goods, within the constraints of various legislative and logistical criteria. == Criteria and modelling == This very complex problem involves several constraints related to labour and company rules and also different evaluation criteria and objectives. Being able to solve this problem efficiently can have a great impact on costs and quality of service for public transportation companies. There is a large number of different rules that a feasible duty might be required to satisfy, such as Minimum and maximum stretch duration Minimum and maximum break duration Minimum and maximum work duration Minimum and maximum total duration Maximum extra work duration Maximum number of vehicle changes Minimum driving duration of a particular vehicle Operations research has provided optimization models and algorithms that lead to efficient solutions for this problem. Among the most common models proposed to solve the DSP are the Set Covering and Set Partitioning Models (SPP/SCP). In the SPP model, each work piece (task) is covered by only one duty. In the SCP model, it is possible to have more than one duty covering a given work piece. In both models, the set of work pieces that needs to be covered is laid out in rows, and the set of previously defined feasible duties available for covering specific work pieces is arranged in columns. The DSP resolution, based on either of these models, is the selection of the set of feasible duties that guarantees that there is one (SPP) or more (SCP) duties covering each work piece while minimizing the total cost of the final schedule.

    Read more →
  • Australian Geoscience Data Cube

    Australian Geoscience Data Cube

    The Australian Geoscience Data Cube (AGDC) is an approach to storing, processing and analyzing large collections of Earth observation data. The technology is designed to meet challenges of national interest by being agile and flexible with vast amounts of layered grid data. The AGDC reduces processing time of traditional image analysis by calibrating, pre-computing known extents, pixel alignment and storing metadata in a cell lattice structure. The temporal-pixel aligned data can often be analysed faster across space and time dimensions than previous scene based techniques. This allows the AGDC to be flexible in tackling future challenges and improve analysis times on every-increasing data repositories of earth observation. The AGDC has also been used internationally to allow countries to maintain ecologically sustainable programs and reduce the difficulty curve of utilizing Remote Sensing data. == Background == The AGDC was originally conceived by Geoscience Australia but is now maintained in a partnership between Geoscience Australia, Commonwealth Scientific and Industrial Research Organisation (CSIRO) and National Computational Infrastructure National Facility (Australia) (NCI). This is made possible by the funding from the partnership and a number of organisations such as National Collaborative Research Infrastructure Strategy (NCRIS). == Analysis ready data, ingestion and indexing == The data processed in the cube is made analysis ready before being ingested and indexed into the AGDC. Analysis ready data is pre-processed data that has applied corrections for instrument calibration (gains and offsets), geolocation (spatial alignment) and radiometry (solar illumination, incidence angle, topography, atmospheric interference). The ingestion process manages the translation of datasets into the storage units while maintaining a database index. The data within the storage and index can be accessed via API calls often compiled within code such as Python (programming language). Example: s2a_l1c = dc.load(product='s2a_level1c_granule',x=(147.36, 147.41), y=(-35.1, -35.15), measurements=['04','03','02'], output_crs='EPSG:4326', resolution=(-0.00025,0.00025)) === Datasets currently stored === Geoscience Australia Landsat Surface Reflectance (1987 to present) Landsat Pixel Quality Landsat Fractional Cover Landsat NDVI === Datasets that have been piloted === USGS Landsat Surface Reflectance SRTM DEM Himawari 8 MODIS Sentinel-2 L1C / S2A Australian Gridded Climate Data == Open source == The AGDC code base is situated in GitHub as an open repository. The core code base moved to the Open Data Cube in early 2017 as part of an international collaboration. Whilst the code base is the Open Data Cube, individual cubes exist as their own right such as the AGDC on the National Computational Infrastructure National Facility (Australia) (NCI) using the High-Performance Computing Cluster HPCC. The core code can be installed on personal computers or public computers (using git) and has many unit tests. Documentation for the code base exists on Read the Docs. == Challenges of the AGDC == The AGDC is designed to meet nationally significant challenges such as the following. Sustainability Environment Water resource management Disaster assist Policy development Community planning Forest preservation Carbon measurement == International awards == The AGDC won the 2016 Content Platform of the Year award from Geospatial World Forum.

    Read more →
  • Rendezvous hashing

    Rendezvous hashing

    Rendezvous or highest random weight (HRW) hashing is an algorithm that allows clients to achieve distributed agreement on a set of k {\displaystyle k} options out of a possible set of n {\displaystyle n} options. A typical application is when clients need to agree on which sites (or proxies) objects are assigned to. Consistent hashing addresses the special case k = 1 {\displaystyle k=1} using a different method. Rendezvous hashing is both much simpler and more general than consistent hashing (see below). == History == Rendezvous hashing was invented by David Thaler and Chinya Ravishankar at the University of Michigan in 1996. Consistent hashing appeared a year later in the literature. Given its simplicity and generality, rendezvous hashing is now being preferred to consistent hashing in real-world applications. Rendezvous hashing was used very early on in many applications including mobile caching, router design, secure key establishment, and sharding and distributed databases. Other examples of real-world systems that use Rendezvous Hashing include the GitHub load balancer, the Apache Ignite distributed database, the Tahoe-LAFS file store, the CoBlitz large-file distribution service, Apache Druid, IBM's Cloud Object Store, the Arvados Data Management System, Apache Kafka, and the Twitter EventBus pub/sub platform. One of the first applications of rendezvous hashing was to enable multicast clients on the Internet (in contexts such as the MBONE) to identify multicast rendezvous points in a distributed fashion. It was used in 1998 by Microsoft's Cache Array Routing Protocol (CARP) for distributed cache coordination and routing. Some Protocol Independent Multicast routing protocols use rendezvous hashing to pick a rendezvous point. == Problem definition and approach == === Algorithm === Rendezvous hashing solves a general version of the distributed hash table problem: We are given a set of n {\displaystyle n} sites (servers or proxies, say). How can any set of clients, given an object O {\displaystyle O} , agree on a k-subset of sites to assign to O {\displaystyle O} ? The standard version of the problem uses k = 1. Each client is to make its selection independently, but all clients must end up picking the same subset of sites. This is non-trivial if we add a minimal disruption constraint, and require that when a site fails or is removed, only objects mapping to that site need be reassigned to other sites. The basic idea is to give each site S j {\displaystyle S_{j}} a score (a weight) for each object O i {\displaystyle O_{i}} , and assign the object to the highest scoring site. All clients first agree on a hash function h ( ⋅ ) {\displaystyle h(\cdot )} . For object O i {\displaystyle O_{i}} , the site S j {\displaystyle S_{j}} is defined to have weight w i , j = h ( O i , S j ) {\displaystyle w_{i,j}=h(O_{i},S_{j})} . Each client independently computes these weights w i , 1 , w i , 2 … w i , n {\displaystyle w_{i,1},w_{i,2}\dots w_{i,n}} and picks the k sites that yield the k largest hash values. The clients have thereby achieved distributed k {\displaystyle k} -agreement. If a site S {\displaystyle S} is added or removed, only the objects mapping to S {\displaystyle S} are remapped to different sites, satisfying the minimal disruption constraint above. The HRW assignment can be computed independently by any client, since it depends only on the identifiers for the set of sites S 1 , S 2 … S n {\displaystyle S_{1},S_{2}\dots S_{n}} and the object being assigned. HRW easily accommodates different capacities among sites. If site S k {\displaystyle S_{k}} has twice the capacity of the other sites, we simply represent S k {\displaystyle S_{k}} twice in the list, say, as S k , 1 , S k , 2 {\displaystyle S_{k,1},S_{k,2}} . Clearly, twice as many objects will now map to S k {\displaystyle S_{k}} as to the other sites. === Properties === Consider the simple version of the problem, with k = 1, where all clients are to agree on a single site for an object O. Approaching the problem naively, it might appear sufficient to treat the n sites as buckets in a hash table and hash the object name O into this table. Unfortunately, if any of the sites fails or is unreachable, the hash table size changes, forcing all objects to be remapped. This massive disruption makes such direct hashing unworkable. Under rendezvous hashing, however, clients handle site failures by picking the site that yields the next largest weight. Remapping is required only for objects currently mapped to the failed site, and disruption is minimal. Rendezvous hashing has the following properties: Low overhead: The hash function used is efficient, so overhead at the clients is very low. Load balancing: Since the hash function is randomizing, each of the n sites is equally likely to receive the object O. Loads are uniform across the sites. Site capacity: Sites with different capacities can be represented in the site list with multiplicity in proportion to capacity. A site with twice the capacity of the other sites will be represented twice in the list, while every other site is represented once. High hit rate: Since all clients agree on placing an object O into the same site SO, each fetch or placement of O into SO yields the maximum utility in terms of hit rate. The object O will always be found unless it is evicted by some replacement algorithm at SO. Minimal disruption: When a site fails, only the objects mapped to that site need to be remapped. Disruption is at the minimal possible level. Distributed k-agreement: Clients can reach distributed agreement on k sites simply by selecting the top k sites in the ordering. == O(log n) running time via skeleton-based hierarchical rendezvous hashing == The standard version of Rendezvous Hashing described above works quite well for moderate n, but when n {\displaystyle n} is extremely large, the hierarchical use of Rendezvous Hashing achieves O ( log ⁡ n ) {\displaystyle O(\log n)} running time. This approach creates a virtual hierarchical structure (called a "skeleton"), and achieves O ( log ⁡ n ) {\displaystyle O(\log n)} running time by applying HRW at each level while descending the hierarchy. The idea is to first choose some constant m {\displaystyle m} and organize the n {\displaystyle n} sites into c = ⌈ n / m ⌉ {\displaystyle c=\lceil n/m\rceil } clusters C 1 = { S 1 , S 2 … S m } , C 2 = { S m + 1 , S m + 2 … S 2 m } … {\displaystyle C_{1}=\left\{S_{1},S_{2}\dots S_{m}\right\},C_{2}=\left\{S_{m+1},S_{m+2}\dots S_{2m}\right\}\dots } Next, build a virtual hierarchy by choosing a constant f {\displaystyle f} and imagining these c {\displaystyle c} clusters placed at the leaves of a tree T {\displaystyle T} of virtual nodes, each with fanout f {\displaystyle f} . In the accompanying diagram, the cluster size is m = 4 {\displaystyle m=4} , and the skeleton fanout is f = 3 {\displaystyle f=3} . Assuming 108 sites (real nodes) for convenience, we get a three-tier virtual hierarchy. Since f = 3 {\displaystyle f=3} , each virtual node has a natural numbering in octal. Thus, the 27 virtual nodes at the lowest tier would be numbered 000 , 001 , 002 , . . . , 221 , 222 {\displaystyle 000,001,002,...,221,222} in octal (we can, of course, vary the fanout at each level - in that case, each node will be identified with the corresponding mixed-radix number). The easiest way to understand the virtual hierarchy is by starting at the top, and descending the virtual hierarchy. We successively apply Rendezvous Hashing to the set of virtual nodes at each level of the hierarchy, and descend the branch defined by the winning virtual node. We can in fact start at any level in the virtual hierarchy. Starting lower in the hierarchy requires more hashes, but may improve load distribution in the case of failures. For example, instead of applying HRW to all 108 real nodes in the diagram, we can first apply HRW to the 27 lowest-tier virtual nodes, selecting one. We then apply HRW to the four real nodes in its cluster, and choose the winning site. We only need 27 + 4 = 31 {\displaystyle 27+4=31} hashes, rather than 108. If we apply this method starting one level higher in the hierarchy, we would need 9 + 3 + 4 = 16 {\displaystyle 9+3+4=16} hashes to get to the winning site. The figure shows how, if we proceed starting from the root of the skeleton, we may successively choose the virtual nodes ( 2 ) 3 {\displaystyle (2)_{3}} , ( 20 ) 3 {\displaystyle (20)_{3}} , and ( 200 ) 3 {\displaystyle (200)_{3}} , and finally end up with site 74. The virtual hierarchy need not be stored, but can be created on demand, since the virtual nodes names are simply prefixes of base- f {\displaystyle f} (or mixed-radix) representations. We can easily create appropriately sorted strings from the digits, as required. In the example, we would be working with the strings 0 , 1 , 2 {\displaystyle 0,1,2} (at tier 1), 20 , 21 , 22 {\displaystyle 20,21,22} (at tier 2), and 200 , 201 , 202

    Read more →
  • Language model benchmark

    Language model benchmark

    A language model benchmark is a standardized test designed to evaluate the performance of language models on various natural language processing tasks. These tests are intended for comparing different models' capabilities in areas such as language understanding, generation, and reasoning. Benchmarks generally consist of a dataset and corresponding evaluation metrics. The dataset provides text samples and annotations, while the metrics measure a model's performance on tasks like answering questions, text classification, and machine translation. These benchmarks are developed and maintained by academic institutions, research organizations, and industry players to track progress in the field. In addition to accuracy, the metrics can include throughput, energy efficiency, bias, trust, and sustainability. == Overview == === Types === Benchmarks may be described by the following adjectives, not mutually exclusive: Classical: These tasks are studied in natural language processing, even before the advent of deep learning. Examples include the Penn Treebank for testing syntactic and semantic parsing, as well as bilingual translation benchmarked by BLEU scores. Question answering: These tasks have a text question and a text answer, often multiple-choice. They can be open-book or closed-book. Open-book QA resembles reading comprehension questions, with relevant passages included as annotation in the question, in which the answer appears. Closed-book QA includes no relevant passages. Closed-book QA is also called open-domain question-answering. Before the era of large language models, open-book QA was more common, and understood as testing information retrieval methods. Closed-book QA became common since GPT-2 as a method to measure knowledge stored within model parameters. Omnibus: An omnibus benchmark combines many benchmarks, often previously published. It is intended as an all-in-one benchmarking solution. Reasoning: These tasks are usually in the question-answering format, but are intended to be more difficult than standard question answering. Multimodal: These tasks require processing not only text, but also other modalities, such as images and sound. Examples include OCR and transcription. Agency: These tasks are for a language-model–based software agent that operates a computer for a user, such as editing images, browsing the web, etc. Adversarial: A benchmark is "adversarial" if the items in the benchmark are picked specifically so that certain models do badly on them. Adversarial benchmarks are often constructed after state of the art (SOTA) models have saturated (achieved 100% performance) a benchmark, to renew the benchmark. A benchmark is "adversarial" only at a certain moment in time, since what is adversarial may cease to be adversarial as newer SOTA models appear. Public/Private: A benchmark might be partly or entirely private, meaning that some or all of the questions are not publicly available. The idea is that if a question is publicly available, then it might be used for training, which would be "training on the test set" and invalidate the result of the benchmark. Usually, only the guardians of the benchmark have access to the private subsets, and to score a model on such a benchmark, one must send the model weights, or provide API access, to the guardians. The boundary between a benchmark and a dataset is not sharp. Generally, a dataset contains three "splits": training, test, and validation. Both the test and validation splits are essentially benchmarks. In general, a benchmark is distinguished from a test/validation dataset in that a benchmark is typically intended to be used to measure the performance of many different models that are not trained specifically for doing well on the benchmark, while a test/validation set is intended to be used to measure the performance of models trained specifically on the corresponding training set. In other words, a benchmark may be thought of as a test/validation set without a corresponding training set. Conversely, certain benchmarks may be used as a training set, such as the English Gigaword or the One Billion Word Benchmark, which in modern language is just the negative log-likelihood loss on a pretraining set with 1 billion words. Indeed, the distinction between benchmark and dataset in language models became sharper after the rise of the pretraining paradigm, whereby a model is first trained on massive, unlabeled datasets to learn general language patterns, syntax, and knowledge (pretraining), and the base model is then adapted to specific, downstream tasks using smaller, labeled datasets (fine-tuning). === Lifecycle === Generally, the life cycle of a benchmark consists of the following steps: Inception: A benchmark is published. It can be simply given as a demonstration of the power of a new model (implicitly) that others then picked up as a benchmark, or as a benchmark that others are encouraged to use (explicitly). Growth: More papers and models use the benchmark, and the performance on the benchmark grows. Maturity, degeneration or deprecation: A benchmark may be saturated, after which researchers move on to other benchmarks. Progress on the benchmark may also be neglected as the field moves to focus on other benchmarks. Renewal: A saturated benchmark can be upgraded to make it no longer saturated, allowing further progress. === Construction === Like datasets, benchmarks are typically constructed by several methods, individually or in combination: Web scraping: Ready-made question-answer pairs may be scraped online, such as from websites that teach mathematics and programming. Conversion: Items may be constructed programmatically from scraped web content, such as by blanking out named entities from sentences, and asking the model to fill in the blank. This was used for making the CNN/Daily Mail Reading Comprehension Task. Crowd sourcing: Items may be constructed by paying people to write them, such as on Amazon Mechanical Turk. This was used for making the MCTest. === Evaluation === Generally, benchmarks are fully automated. This limits the questions that can be asked. For example, with mathematical questions, "proving a claim" would be difficult to automatically check, while "calculate an answer with a unique integer answer" would be automatically checkable. With programming tasks, the answer can generally be checked by running unit tests, with an upper limit on runtime. The benchmark scores are of the following kinds: For multiple choice or cloze questions, common scores are accuracy (frequency of correct answer), precision, recall, F1 score, etc. pass@n: The model is given n {\displaystyle n} attempts to solve each problem. If any attempt is correct, the model earns a point. The pass@n score is the model's average score over all problems. k@n: The model makes n {\displaystyle n} attempts to solve each problem, but only k {\displaystyle k} attempts out of them are selected for submission. If any submission is correct, the model earns a point. The k@n score is the model's average score over all problems. cons@n: The model is given n {\displaystyle n} attempts to solve each problem. If the most common answer is correct, the model earns a point. The cons@n score is the model's average score over all problems. Here "cons" stands for "consensus" or "majority voting". The pass@n score can be estimated more accurately by making N > n {\displaystyle N>n} attempts, and use the unbiased estimator 1 − ( N − c n ) ( N n ) {\displaystyle 1-{\frac {\binom {N-c}{n}}{\binom {N}{n}}}} , where c {\displaystyle c} is the number of correct attempts. For less well-formed tasks, where the output can be any sentence, there are the following commonly used scores including BLEU ROUGE, METEOR, NIST, word error rate, LEPOR, CIDEr, and SPICE. === Issues === error: Some benchmark answers may be wrong. ambiguity: Some benchmark questions may be ambiguously worded. subjective: Some benchmark questions may not have an objective answer at all. This problem generally prevents creative writing benchmarks. Similarly, this prevents benchmarking writing proofs in natural language, though benchmarking proofs in a formal language is possible. open-ended: Some benchmark questions may not have a single answer of a fixed size. This problem generally prevents programming benchmarks from using more natural tasks such as "write a program for X", and instead uses tasks such as "write a function that implements specification X". inter-annotator agreement: Some benchmark questions may be not fully objective, such that even people would not agree with 100% on what the answer should be. This is common in natural language processing tasks, such as syntactic annotation. shortcut: Some benchmark questions may be easily solved by an "unintended" shortcut. For example, in the SNLI benchmark, having a negative word like "not" in the second sentence is a strong signal for the "Contradiction" category, regardless of what the se

    Read more →
  • Super column

    Super column

    A super column is a tuple (a pair) with a binary super column name and a value that maps it to many columns. They consist of a key–value pairs, where the values are columns. Theoretically speaking, super columns are (sorted) associative array of columns. Similar to a regular column family where a row is a sorted map of column names and column values, a row in a super column family is a sorted map of super column names that maps to column names and column values. A super column is part of a keyspace together with other super columns and column families, and columns. == Code example == Written in the JSON-like syntax, a super column definition can be like this: Where: "databases" are keyspace; "Cassandra" and "HBase" are rowKeys; "name" and "address" are super column names; "firstName", "city", "age", etc. are column names.

    Read more →
  • Geospatial metadata

    Geospatial metadata

    Geospatial metadata (also geographic metadata) is a type of metadata applicable to geographic data and information. Such objects may be stored in a geographic information system (GIS) or may simply be documents, data-sets, images or other objects, services, or related items that exist in some other native environment but whose features may be appropriate to describe in a (geographic) metadata catalog (may also be known as a data directory or data inventory). == Definition == ISO 19115:2013 "Geographic Information – Metadata" from ISO/TC 211, the industry standard for geospatial metadata, describes its scope as follows: [This standard] provides information about the identification, the extent, the quality, the spatial and temporal aspects, the content, the spatial reference, the portrayal, distribution, and other properties of digital geographic data and services. ISO 19115:2013 also provides for non-digital mediums: Though this part of ISO 19115 is applicable to digital data and services, its principles can be extended to many other types of resources such as maps, charts, and textual documents as well as non-geographic data. The U.S. Federal Geographic Data Committee (FGDC) describes geospatial metadata as follows: A metadata record is a file of information, usually presented as an XML document, which captures the basic characteristics of a data or information resource. It represents the who, what, when, where, why and how of the resource. Geospatial metadata commonly document geographic digital data such as Geographic Information System (GIS) files, geospatial databases, and earth imagery but can also be used to document geospatial resources including data catalogs, mapping applications, data models and related websites. Metadata records include core library catalog elements such as Title, Abstract, and Publication Data; geographic elements such as Geographic Extent and Projection Information; and database elements such as Attribute Label Definitions and Attribute Domain Values. == History == The growing appreciation of the value of geospatial metadata through the 1980s and 1990s led to the development of a number of initiatives to collect metadata according to a variety of formats either within agencies, communities of practice, or countries/groups of countries. For example, NASA's "DIF" metadata format was developed during an Earth Science and Applications Data Systems Workshop in 1987, and formally approved for adoption in 1988. Similarly, the U.S. FGDC developed its geospatial metadata standard over the period 1992–1994. The Spatial Information Council of Australia and New Zealand (ANZLIC), a combined body representing spatial data interests in Australia and New Zealand, released version 1 of its "metadata guidelines" in 1996. ISO/TC 211 undertook the task of harmonizing the range of formal and de facto standards over the approximate period 1999–2002, resulting in the release of ISO 19115 "Geographic Information – Metadata" in 2003 and a subsequent revision in 2013. As of 2011 individual countries, communities of practice, agencies, etc. have started re-casting their previously used metadata standards as "profiles" or recommended subsets of ISO 19115, occasionally with the inclusion of additional metadata elements as formal extensions to the ISO standard. The growth in popularity of Internet technologies and data formats, such as Extensible Markup Language (XML) during the 1990s led to the development of mechanisms for exchanging geographic metadata on the web. In 2004, the Open Geospatial Consortium released the current version (3.1) of Geography Markup Language (GML), an XML grammar for expressing geospatial features and corresponding metadata. With the growth of the Semantic Web in the 2000s, the geospatial community has begun to develop ontologies for representing semantic geospatial metadata. Some examples include the Hydrology and Administrative ontologies developed by the Ordnance Survey in the United Kingdom. == ISO 19115: Geographic information – Metadata == ISO 19115 is a standard of the International Organization for Standardization (ISO). The standard is part of the ISO geographic information suite of standards (19100 series). ISO 19115 and its parts define how to describe geographical information and associated services, including contents, spatial-temporal purchases, data quality, access and rights to use. The objective of this International Standard is to provide a clear procedure for the description of digital geographic data-sets so that users will be able to determine whether the data in a holding will be of use to them and how to access the data. By establishing a common set of metadata terminology, definitions and extension procedures, this standard promotes the proper use and effective retrieval of geographic data. ISO 19115 was revised in 2013 to accommodate growing use of the internet for metadata management, as well as add many new categories of metadata elements (referred to as codelists) and the ability to limit the extent of metadata use temporally or by user. == ISO 19139 Geographic information Metadata XML schema implementation == ISO 19139:2012 provides the XML implementation schema for ISO 19115 specifying the metadata record format and may be used to describe, validate, and exchange geospatial metadata prepared in XML. The standard is part of the ISO geographic information suite of standards (19100 series), and provides a spatial metadata XML (spatial metadata eXtensible Mark-up Language (smXML)) encoding, an XML schema implementation derived from ISO 19115, Geographic information – Metadata. The metadata includes information about the identification, constraint, extent, quality, spatial and temporal reference, distribution, lineage, and maintenance of the digital geographic data-set. == Metadata directories == Also known as metadata catalogues or data directories. (need discussion of, and subsections on GCMD, FGDC metadata gateway, ASDD, European and Canadian initiatives, etc. etc.) GIS Inventory – National GIS Inventory System which is maintained by the US-based National States Geographic Information Council (NSGIC) as a tool for the entire US GIS Community. Its primary purpose is to track data availability and the status of geographic information system (GIS) implementation in state and local governments to aid the planning and building of statewide spatial data infrastructures (SSDI). The Random Access Metadata for Online Nationwide Assessment (RAMONA) database is a critical component of the GIS Inventory. RAMONA moves its FGDC-compliant metadata (CSDGM Standard) for each data layer to a web folder and a Catalog Service for the Web (CSW) that can be harvested by Federal programs and others. This provides far greater opportunities for discovery of user information. The GIS Inventory website was originally created in 2006 by NSGIC under award NA04NOS4730011 from the Coastal Services Center, National Oceanic and Atmospheric Administration, U.S. Department of Commerce. The Department of Homeland Security has been the principal funding source since 2008 and they supported the development of the Version 5 during 2011/2012 under Order Number HSHQDC-11-P-00177. The Federal Emergency Management Agency and National Oceanic and Atmospheric Administration have provided additional resources to maintain and improve the GIS Inventory. Some US Federal programs require submission of CSDGM-Compliant Metadata for data created under grants and contracts that they issue. The GIS Inventory provides a very simple interface to create the required Metadata. GCMD - Global Change Master Directory's goal is to enable users to locate and obtain access to Earth science data sets and services relevant to global change and Earth science research. The GCMD database holds more than 20,000 descriptions of Earth science data sets and services covering all aspects of Earth and environmental sciences. ECHO - The EOS Clearing House (ECHO) is a spatial and temporal metadata registry, service registry, and order broker. It allows users to more efficiently search and access data and services through the Reverb Client or Application Programmer Interfaces (APIs). ECHO stores metadata from a variety of science disciplines and domains, totalling over 3400 Earth science data sets and over 118 million granule records. GoGeo - GoGeo is a service run by EDINA (University of Edinburgh) and is supported by Jisc. GoGeo allows users to conduct geographically targeted searches to discover geospatial datasets. GoGeo searches many data portals from the HE and FE community and beyond. GoGeo also allows users to create standards compliant metadata through its Geodoc metadata editor. == Geospatial metadata tools == There are many proprietary GIS or geospatial products that support metadata viewing and editing on GIS resources. For example, ESRI's ArcGIS Desktop, SOCET GXP, Autodesk's AutoCAD Map 3D 2008, Arcitecta's Mediaflux and Intergraph's Geo

    Read more →
  • BuildingSMART Data Dictionary

    BuildingSMART Data Dictionary

    buildingSMART Data Dictionary (bSDD) is a service provided by buildingSMART which offers free data dictionaries for the international standardization of construction planning. The structure of bSDD was defined by the Nonprofit organization Buildingsmart and is used to describe objects and their attributes in a BIM process. == Aim == The aim of bSDD is to enable architects and planners to exchange and share building data across different specialists and language boundaries and thus avoid misunderstandings caused by different interpretations of terms. The bSDD standard extends the more general IFC. Software developers can access and use the dictionaries. In May 2025 over 300 dictionaries are available, including IFC, extensions to it such as Airport Domain IFC extension module or classification systems like Uniclass. == Structure == The main structural parts of bSDD are: Dictionary: A dictionary is a collection of classes: Class: A class describes the various object types, such as Bag drop or Baggage conveyor in airport planning. A class contains properties: Property: A property describes a part of a class, e.g. color or weight. Related properties are organized in a group: GroupOfProperties: A group organizes related properties, e.g. environmental properties or electrical properties. == Creating and managing a directory == Every dictionary in bSDD must be published in the name of a registered organization. As soon as the content is activated, it receives an unchangeable URI. This means that the content remains permanently in bSDD and cannot be deleted - this ensures stable use of the dictionary. It is only possible to change the status to inactive if it is no longer to be used - however, the dictionary remains permanently.

    Read more →
  • Non-separable wavelet

    Non-separable wavelet

    Non-separable wavelets are multi-dimensional wavelets that are not directly implemented as tensor products of wavelets on some lower-dimensional space. They have been studied since 1992. They offer a few important advantages. Notably, using non-separable filters leads to more parameters in design, and consequently better filters. The main difference, when compared to the one-dimensional wavelets, is that multi-dimensional sampling requires the use of lattices (e.g., the quincunx lattice). The wavelet filters themselves can be separable or non-separable regardless of the sampling lattice. Thus, in some cases, the non-separable wavelets can be implemented in a separable fashion. Unlike separable wavelet, the non-separable wavelets are capable of detecting structures that are not only horizontal, vertical or diagonal (show less anisotropy). == Examples == Red-black wavelets Contourlets Shearlets Directionlets Steerable pyramids Non-separable schemes for tensor-product wavelets

    Read more →
  • Metadirectory

    Metadirectory

    A metadirectory system provides for the flow of data between one or more directory services and databases in order to maintain synchronization of that data. It is an important part of identity management systems. The data being synchronized typically are collections of entries that contain user profiles and possibly authentication or policy information. Most metadirectory deployments synchronize data into at least one LDAP-based directory server, to ensure that LDAP-based applications such as single sign-on and portal servers have access to recent data, even if the data is mastered in a non-LDAP data source. Metadirectory products support filtering and transformation of data in transit. Most identity management suites from commercial vendors include a metadirectory product, or a user provisioning product.

    Read more →
  • DIKW pyramid

    DIKW pyramid

    The DIKW pyramid (also known as the knowledge pyramid or information hierarchy) is a model describing relationships between data, information, knowledge and wisdom sometimes also stylized as a chain, refer to models of possible structural and functional relationships between a set of components—often four, data, information, knowledge, and wisdom. The concept has roots predating the 1980s. In the latter years of that decade, interest in the models grew after explicit presentations and discussions, including from Milan Zeleny, Russell Ackoff, and Robert W. Lucky. Subsequent important discussions extended along theoretical and practical lines into the coming decades. While debate continues as to actual meaning of the component terms of DIKW-type models, and the actual nature of their relationships—including occasional doubt being cast over any simple, linear, unidirectional model—even so they have become very popular visual representations in use by business, the military, and others. Among the academic and popular, not all versions of the DIKW-type models include all four components (earlier ones excluding data, later ones excluding or downplaying wisdom, and several including additional components (for instance Ackoff inserting "understanding" before and Zeleny adding "enlightenment" after the wisdom component). In addition, DIKW-type models are no longer always presented as pyramids, instead also as a chart or framework (e.g., by Zeleny), as flow diagrams (e.g., by Liew, and by Chisholm et al.), and sometimes as a continuum (e.g., by Choo et al.). == Short description == As Rowley noted in 2007, the DIKW model "is often quoted, or used implicitly, in definitions of data, information and knowledge in the information management, information systems and knowledge management literatures, but [as of that date] there ha[d] been limited direct discussion of the hierarchy". Reviews of textbooks and a survey of scholars in relevant fields indicate that there was not a consensus as to definitions used in the model as of that date, and as reviewed by Liew in that year, even less "in the description of the processes that transform components lower in the hierarchy into those above them". Zins work, published in 2007—from studies in 2003-2005 that documented "130 definitions of data, information, and knowledge formulated by 45 scholars", published in 2007—to suggest that the data–information–knowledge components of DIKW refer to a class of no less than five models, as a function of whether data, information, and knowledge are each conceived of as subjective, objective (what Zins terms, "universal" or "collective") or both. In Zins' usage, subjective and objective "are not related to arbitrariness and truthfulness, which are usually attached to the concepts of subjective knowledge and objective knowledge". Information science, Zins argues, studies data and information, but not knowledge, as knowledge is an internal (subjective) rather than an external (universal–collective) phenomenon. == Representations == === Graphical representation === DIKW is a hierarchical model often depicted as a pyramid, sometimes as a chain, with data at its base and wisdom at its apex (or chain-beginning and -end). Both Zeleny and Ackoff have been credited with originating the pyramid representation, although neither used a pyramid to present their ideas. According to Wallace, Debons and colleagues may have been the first to "present the hierarchy graphically". Many variations of the DIKW-type pyramid have been produced. One, in use by knowledge managers in the United States Department of Defense, attempts to show the DIKW progression to enable effective decisions and consequent activities supporting shared understanding throughout defense organizations, as well as supporting management of risks associated with decisions. DIKW-type hierarchical information paradigms have also been represented as two-dimensional charts, and as flow diagrams, where relationships between the components may be presented less hierarchically, with defining aspects of the relationships, feedback loops, etc. === Computational representation === Intelligent decision support systems are trying to improve decision making by introducing new technologies and methods from the domain of modeling and simulation in general, and in particular from the domain of intelligent software agents in the contexts of agent-based modeling. The following example describes a military decision support system, but the architecture and underlying conceptual idea are transferable to other application domains: The value chain starts with data quality describing the information within the underlying command and control systems. Information quality tracks the completeness, correctness, currency, consistency and precision of the data items and information statements available. Knowledge quality deals with procedural knowledge and information embedded in the command and control system such as templates for adversary forces, assumptions about entities such as ranges and weapons, and doctrinal assumptions, often coded as rules. Awareness quality measures the degree of using the information and knowledge embedded within the command and control system. Awareness is explicitly placed in the cognitive domain. By the introduction of a common operational picture, data are put into context, which leads to information instead of data. The next step, which is enabled by service-oriented web-based infrastructures (but not yet operationally used), is the use of models and simulations for decision support. Simulation systems are the prototype for procedural knowledge, which is the basis for knowledge quality. Finally, using intelligent software agents to continually observe the battle sphere, apply models and simulations to analyze what is going on, to monitor the execution of a plan, and to do all the tasks necessary to make the decision maker aware of what is going on, command and control systems could even support situational awareness, the level in the value chain traditionally limited to pure cognitive methods. == History == Danny P. Wallace, a professor of library and information science, explained that the origin of the DIKW pyramid is uncertain: The presentation of the relationships among data, information, knowledge, and sometimes wisdom in a hierarchical arrangement has been part of the language of information science for many years. Although it is uncertain when and by whom those relationships were first presented, the ubiquity of the notion of a hierarchy is embedded in the use of the acronym DIKW as a shorthand representation for the data-to-information-to-knowledge-to-wisdom transformation.Many authors think that the idea of the DIKW relationship originated from two lines in the poem "Choruses", by T. S. Eliot, that appeared in the pageant play The Rock, in 1934: === Knowledge, intelligence, and wisdom === In 1927, Clarence W. Barron addressed his employees at Dow Jones & Company on the hierarchy: "Knowledge, Intelligence and Wisdom". === Data, information, knowledge === In 1955, English-American economist and educator Kenneth Boulding presented a variation on the hierarchy consisting of "signals, messages, information, and knowledge". However, "[t]he first author to distinguish among data, information, and knowledge and to also employ the term 'knowledge management' may have been American educator Nicholas L. Henry", in a 1974 journal article. === Data, information, knowledge, wisdom === Other early versions (prior to 1982) of the hierarchy that refer to a data tier include those of Chinese-American geographer Yi-Fu Tuan and sociologist-historian Daniel Bell.. In 1980, Irish-born engineer Mike Cooley invoked the same hierarchy in his critique of automation and computerization, in his book Architect or Bee?: The Human / Technology Relationship. Thereafter, in 1987, Czechoslovakia-born educator Milan Zeleny mapped the components of the hierarchy to knowledge forms: know-nothing, know-what, know-how, and know-why. Zeleny "has frequently been credited with proposing the [representation of DIKW as a pyramid ]... although he actually made no reference to any such graphical model." The hierarchy appears again in a 1988 address to the International Society for General Systems Research, by American organizational theorist Russell Ackoff, published in 1989. Subsequent authors and textbooks cite Ackoff's as the "original articulation" of the hierarchy or otherwise credit Ackoff with its proposal. Ackoff's version of the model includes an understanding tier (as Adler had, before him), interposed between knowledge and wisdom. Although Ackoff did not present the hierarchy graphically, he has also been credited with its representation as a pyramid. In 1989, Bell Labs veteran Robert W. Lucky wrote about the four-tier "information hierarchy" in the form of a pyramid in his book Silicon Dreams. In the same year as Ackoff presented his a

    Read more →