AI Content Humanizer

AI Content Humanizer — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Synthesia (company)

    Synthesia (company)

    Synthesia Limited is a British multinational artificial intelligence company based in London, United Kingdom. It is a synthetic media-generation software developer and creator of AI-generated video content, including audio-visual agents and cloned avatars. Britain's largest generative-AI firm, it is used by 70% of FTSE 100 and over 90% of Fortune 100 companies. == Overview == Synthesia is most often used by corporations for localized communication, orientation, employee training videos, advertising campaigns, reporting, product demonstrations, customer service, and to create chatbots. Its software algorithm mimics speech and facial movements based on video recordings of an individual’s speech and facial expressions. From this, a text-to-speech video is created to look and sound like the individual. Swiss bank UBS incorporated Synthesia AI-powered avatars of their human financial experts, for instance, in 2025. Users create content via the platform's pre-generated AI presenters or by creating digital representations of themselves, or personal avatars, using the platform's AI video editing tool. These avatars can be used to narrate videos generated from text. As of August 2021, Synthesia's voice database included multiple gender options in over 60 languages. Its free voice library doubled by 2025, to 140 languages and accents, and its Express-Voice technology can clone a user's own voice, or generate a synthetic one. === Deepfakes === The platform prohibits use of its software to create non-consensual clones, including of celebrities or political figures for satirical purposes. Explicit consent must be provided in addition to a strict pre-screening regimen for use of an individual's likeness to avoid “deepfaking”. While the company prohibits use of its technology for misinformation or "news-like content", an October 2023 Freedom House report stated that Synthesia tools had been used by governments in Venezuela, China, Burkina Faso, and Russia to create videos of fake TV news outlets with AI-generated avatars in order to spread propaganda. Actor Dan Dewhirst signed a contract with the company in 2021, becoming one of the first actors whose likeness would be made into an AI avatar, finding his likeness used in the Venezuelan generated-videos. The company stated, in February 2024, that it had improved its misuse detection systems, and, in April 2024, that new users of its technology are screened by the company, and content employing it is further vetted by Synthesia moderators. == History == Synthesia's software utilizes deep learning architecture developed by Lourdes Agapito and Matthias Niessner. The company was co-founded in 2017 by Agapito, Niessner, Victor Riparbelli, and Steffen Tjerrild. In 2018, the company first demonstrated the software's capabilities on the BBC programme Click when it presented a digitization of Matthew Amroliwala speaking Spanish, Mandarin, and Hindi. Through Synthesia's first two years of existence, it employed 10 people and struggled to make sales, leading to an expansion of the company's focus. It moved on from just targeting entertainment studios to a variety of businesses. In 2020, Synthesia users were reported to include Amazon, Tiffany & Co. and IHG Hotels & Resorts. In January 2024, the company introduced its AI video assistant, which turns text-to-video. That April, with a reported 55,000 customers, including half of the Fortune 100, Synthesia launched "expressive avatars". That September, an enhanced dubbing feature was launched, to translate video in 30 languages with naturalized lip-syncing. Peter Hill joined Synthesia as CTO in January 2025, following 25 years at Amazon, and two years as CEO and CPO of Wildfire Studios. That March, a million dollar base of shares was formed to furnish human actors, employed to generate digital avatars, with company stock, which all of its employees hold. By June of that year, 150,000 individuals from among Synthesia's 65,000 customers had created AI-generated avatars of themselves. In July 2025, the company's new global headquarters at Regent’s Place was opened by London mayor Sadiq Khan, who described Britain's largest generative-AI company, then valued at over $2 billion, as a "London success story". By that October, its technology was employed by 90% of the Fortune 100, and Synthesia 3.0 was launched, with hyper-realistic digital avatars equipped with AI-powered dubbing and translation, and a built-in video assistant. In January 2026, it reached a $4 billion valuation, with 70% of FTSE 100 companies noted among its customers. === Funding === The company raised $3.1 million in seed funding in 2019. In April 2021, the company raised $12.5 million in Series A funding. In December 2021, it raised $50 million in a Series B funding round led by Kleiner Perkins and GV (then Google Ventures). Synthesia gained a total valuation of $1 billion, and achieved unicorn status, when it raised $90 million from Accel and Nvidia partnership NVentures, in June 2023, during its Series C funding round. Counting 60,000 customers by January 2025, including over 60% of Fortune 100 companies; the company raised $180 million in a Series D round led by NEA, with new investors World Innovation Lab (WiL), Atlassian Ventures and PSP Growth, as well as existing investors GV, MMC Ventures and FirstMark, doubling Synthesia's valuation to $2.1 billion. Capital raised by 2025 had reached $330 million, with investments slated to further product innovation, talent growth, and company expansion in North America, Europe, Japan and Australia. In April 2025, Adobe Inc. invested £10 million in the company for a strategic partnership. Synthesia subsequently rejected a $3 billion acquisition offer from Adobe, choosing to remain independent. With a revenue stream then exceeding $100 million annually; GV led a Series E funding round in October 2025, resulting in Synthesia's $4 billion valuation, raising $200 million from GV, Nvidia and Accel to develop, in 2026, interactive audio-visual avatar "agents" that converse on topic, for automated sales training and corporate communications, such as recruiting. == Recognition == In 2021, Synthesia partnered with Lay's to create the Messi Messages campaign featuring Argentine footballer Lionel Messi. Users created personalized messages with Synthesia's software and sent custom artificial reality video messages from Messi based on their text input. The campaign received a Cannes Lion Award under the Bronze category. In February 2025, UK Science and Technology Minister Peter Kyle commended Synthesia's "pioneering generative AI innovations."

    Read more →
  • Scriptella

    Scriptella

    Scriptella is an open source extract transform load (ETL) and script execution tool written in Java. It allows the use of SQL or another scripting language suitable for the data source to perform required transformations. Scriptella does not offer any graphical user interface. == Typical use == Database migration. Database creation/update scripts. Cross-database ETL operations, import/export. Alternative for Ant task. Automated database schema upgrade. == Features == Simple XML syntax for scripts. Add dynamics to your existing SQL scripts by creating a thin wrapper XML file: Support for multiple datasources (or multiple connections to a single database) in an ETL file. Support for many useful JDBC features, e.g. parameters in SQL including file blobs and JDBC escaping. Performance and low memory usage are one of the primary goals. Support for evaluated expressions and properties (JEXL syntax) Support for cross-database ETL scripts by using elements Transactional execution Error handling via elements Conditional scripts/queries execution (similar to Ant if/unless attributes but more powerful) Easy-to-Use as a standalone tool or Ant task, without deployment or installation. Easy-To-Run ETL files directly from Java code. Built-in adapters for popular databases for a tight integration. Support for any database with JDBC/ODBC compliant driver. Service Provider Interface (SPI) for interoperability with non-JDBC DataSources and integration with scripting languages. Out of the box support for JSR 223 (Scripting for the Java Platform) compatible languages. Built-in CSV, TEXT, XML, LDAP, Lucene, Velocity, JEXL and Janino providers. Integration with Java EE, Spring Framework, JMX and JNDI for enterprise ready scripts.

    Read more →
  • Synthetic data

    Synthetic data

    Synthetic data are artificially generated data not produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models. Data generated by a computer simulation can be seen as synthetic data. This encompasses most applications of physical modeling, such as music synthesizers or flight simulators. The output of such systems approximates the real thing, but is fully algorithmically generated. Synthetic data is used in a variety of fields as a filter for information that would otherwise compromise the confidentiality of particular aspects of the data. In many sensitive applications, datasets theoretically exist but cannot be released to the general public; synthetic data sidesteps the privacy issues that arise from using real consumer information without permission or compensation. == Usefulness == Synthetic data is generated to meet specific needs or certain conditions that may not be found in the original, real data. One of the hurdles in applying up-to-date machine learning approaches for complex scientific tasks is the scarcity of labeled data, a gap effectively bridged by the use of synthetic data, which closely replicates real experimental data. This can be useful when designing many systems, from simulations based on theoretical value, to database processors, etc. This helps detect and solve unexpected issues such as information processing limitations. Synthetic data are often generated to represent the authentic data and allows a baseline to be set. Another benefit of synthetic data is to protect the privacy and confidentiality of authentic data, while still allowing for use in testing systems. Computer security experts claim generated synthetic data "... enables us to create realistic behavior profiles for users and attackers. The data is used to train the fraud detection system itself, thus creating the necessary adaptation of the system to a specific environment." In defense and military contexts, synthetic data is seen as a potentially valuable tool to develop and improve complex AI systems, particularly in contexts where high-quality real-world data is scarce. At the same time, synthetic data together with the testing approach can give the ability to model real-world scenarios. == History == Scientific modelling of physical systems has a long history that runs concurrent with the history of physics. For example, research into synthesis of audio and voice can be traced back to the 1930s and before, driven forward by the developments of the telephone and audio recording technologies. Digitization gave rise to software synthesizers from the 1970s onwards. In the context of privacy-preserving statistical analysis, in 1993, the idea of original fully synthetic data was created by Donald Rubin. Rubin originally designed this to synthesize the Decennial Census long form responses for the short form households. He then released samples that did not include any actual long form records - in this he preserved anonymity of the household. Later that year, the idea of original partially synthetic data was created by Little. Little used this idea to synthesize the sensitive values on the public use file. A 1993 work fitted a statistical model to 60,000 MNIST digits, then it was used to generate over 1 million examples. Those were used to train a LeNet-4 to reach state of the art performance. In 1994, Stephen Fienberg introduced 'critical refinement', in which a parametric posterior predictive distribution (instead of a Bayes bootstrap) is used to do the sampling. Later, other important contributors to the development of synthetic data generation were Trivellore Raghunathan, Jerry Reiter, Donald Rubin, John M. Abowd, and Jim Woodcock. Collectively they came up with a solution for how to treat partially synthetic data with missing data. Similarly, they developed the technique of Sequential Regression Multivariate Imputation. == Calculations == Researchers test the framework on synthetic data, which is "the only source of ground truth on which they can objectively assess the performance of their algorithms". Synthetic data can be generated through the use of random lines, having different orientations and starting positions. Datasets can get fairly complicated. A more complicated dataset can be generated by using a synthesizer build. To create a synthesizer build, first use the original data to create a model or equation that fits the data the best. This model or equation will be called a synthesizer build. This build can be used to generate more data. Constructing a synthesizer build involves constructing a statistical model. In a linear regression line example, the original data can be plotted, and a best fit linear line can be created from the data. This line is a synthesizer created from the original data. The next step will be generating more synthetic data from the synthesizer build or from this linear line equation. In this way, the new data can be used for studies and research, and it protects the confidentiality of the original data. David Jensen from the Knowledge Discovery Laboratory explains how to generate synthetic data: "Researchers frequently need to explore the effects of certain data characteristics on their data model." To help construct datasets exhibiting specific properties, such as auto-correlation or degree disparity, proximity can generate synthetic data having one of several types of graph structure: random graphs that are generated by some random process; lattice graphs having a ring structure; lattice graphs having a grid structure, etc. In all cases, the data generation process follows the same process: Generate the empty graph structure. Generate attribute values based on user-supplied prior probabilities. Since the attribute values of one object may depend on the attribute values of related objects, the attribute generation process assigns values collectively. == Applications == === Fraud detection and confidentiality systems === Testing and training fraud detection and confidentiality systems are devised using synthetic data. Specific algorithms and generators are designed to create realistic data, which then assists in teaching a system how to react to certain situations or criteria. For example, intrusion detection software is tested using synthetic data. This data is a representation of the authentic data and may include intrusion instances that are not found in the authentic data. The synthetic data allows the software to recognize these situations and react accordingly. If synthetic data was not used, the software would only be trained to react to the situations provided by the authentic data and it may not recognize another type of intrusion. === Scientific research === Researchers doing clinical trials or any other research may generate synthetic data to aid in creating a baseline for future studies and testing. Real data can contain information that researchers may not want released, so synthetic data is sometimes used to protect the privacy and confidentiality of a dataset. Using synthetic data reduces confidentiality and privacy issues since it holds no personal information and cannot be traced back to any individual. Beyond privacy protection, synthetic data is also being explored for methodological innovation in drug development. For instance, synthetic data may be used to construct synthetic control arms as an alternative to conventional external control arms based on real-world data (RWD) or randomized controlled trials (RCTs). Collectively, regulatory agencies such as the FDA and EMA appear to be at various stages of recognizing and integrating AI-generated synthetic data into their methodologies. While there is growing consensus on the potential of such data to support model development and the broader lifecycle of medicinal products, to date no drug or medical device has been approved using solely or predominantly synthetic data—particularly not as a comparator arm generated entirely via data-driven algorithms. The quality and statistical handling of synthetic data are expected to become more prominent in future regulatory discussions, particularly in contexts such as predictive modeling (e.g., digital twins), where innovative approaches have already been referenced. === Machine learning === Synthetic data is increasingly being used for machine learning applications: a model is trained on a synthetically generated dataset with the intention of transfer learning to real data. Efforts have been made to enable more data science experiments via the construction of general-purpose synthetic data generators, such as the Synthetic Data Vault. In general, synthetic data has several natural advantages: once the synthetic environment is ready, it is fast and cheap to produce as much data as needed; synthetic data can have perfectly accurate labels, including labeling that may be very expensive or impo

    Read more →
  • SQL/PSM

    SQL/PSM

    SQL/PSM (SQL/Persistent Stored Modules) is an ISO standard mainly defining an extension of SQL with a procedural language for use in stored procedures. Initially published in 1996 as an extension of SQL-92 (ISO/IEC 9075-4:1996, a version sometimes called PSM-96 or even SQL-92/PSM), SQL/PSM was later incorporated into the multi-part SQL:1999 standard, and has been part 4 of that standard since then, most recently in SQL:2023. The SQL:1999 part 4 covered less than the original PSM-96 because the SQL statements for defining, managing, and invoking routines were actually incorporated into part 2 SQL/Foundation, leaving only the procedural language itself as SQL/PSM. The SQL/PSM facilities are still optional as far as the SQL standard is concerned; most of them are grouped in Features P001-P008. SQL/PSM standardizes syntax and semantics for control flow, exception handling (called "condition handling" in SQL/PSM), local variables, assignment of expressions to variables and parameters, and (procedural) use of cursors. It also defines an information schema (metadata) for stored procedures. SQL/PSM is one language in which methods for the SQL:1999 structured types can be defined. The other is Java, via SQL/JRT. SQL/PSM is derived, seemingly directly, from Oracle's PL/SQL. Oracle developed PL/SQL and released it in 1991, basing the language on the US Department of Defense's Ada programming language. However, Oracle has maintained a distance from the standard in its documentation. IBM's SQL PL (used in DB2) and Mimer SQL's PSM were the first two products officially implementing SQL/PSM. It is commonly thought that these two languages, and perhaps also MySQL/MariaDB's procedural language, are closest to the SQL/PSM standard. However, a PostgreSQL addon implements SQL/PSM (alongside its other procedural languages like the PL/SQL-derived plpgsql), although it is not part of the core product. RDF functionality in OpenLink Virtuoso was developed entirely through SQL/PSM, combined with custom datatypes (e.g., ANY for handling URI and Literal relation objects), sophisticated indexing, and flexible physical storage choices (column-wise or row-wise).

    Read more →
  • Sample complexity

    Sample complexity

    The sample complexity of a machine learning algorithm represents the number of training-samples that it needs in order to successfully learn a target function. More precisely, the sample complexity is the number of training-samples that we need to supply to the algorithm, so that the function returned by the algorithm is within an arbitrarily small error of the best possible function, with probability arbitrarily close to 1. There are two variants of sample complexity: The weak variant fixes a particular input-output distribution; The strong variant takes the worst-case sample complexity over all input-output distributions. The No free lunch theorem, discussed below, proves that, in general, the strong sample complexity is infinite, i.e. that there is no algorithm that can learn the globally-optimal target function using a finite number of training samples. However, if we are only interested in a particular class of target functions (e.g., only linear functions) then the sample complexity is finite, and it depends linearly on the VC dimension on the class of target functions. == Definition == Let X {\displaystyle X} be a space which we call the input space, and Y {\displaystyle Y} be a space which we call the output space, and let Z {\displaystyle Z} denote the product X × Y {\displaystyle X\times Y} . For example, in the setting of binary classification, X {\displaystyle X} is typically a finite-dimensional vector space and Y {\displaystyle Y} is the set { − 1 , 1 } {\displaystyle \{-1,1\}} . Fix a hypothesis space H {\displaystyle {\mathcal {H}}} of functions h : X → Y {\displaystyle h\colon X\to Y} . A learning algorithm over H {\displaystyle {\mathcal {H}}} is a computable map from Z {\displaystyle Z} to H {\displaystyle {\mathcal {H}}} . In other words, it is an algorithm that takes as input a finite sequence of training samples and outputs a function from X {\displaystyle X} to Y {\displaystyle Y} . Typical learning algorithms include empirical risk minimization, without or with Tikhonov regularization. Fix a loss function L : Y × Y → R ≥ 0 {\displaystyle {\mathcal {L}}\colon Y\times Y\to \mathbb {R} _{\geq 0}} , for example, the square loss L ( y , y ′ ) = ( y − y ′ ) 2 {\displaystyle {\mathcal {L}}(y,y')=(y-y')^{2}} , where h ( x ) = y ′ {\displaystyle h(x)=y'} . For a given distribution ρ {\displaystyle \rho } on X × Y {\displaystyle X\times Y} , the expected risk of a hypothesis (a function) h ∈ H {\displaystyle h\in {\mathcal {H}}} is E ( h ) := E ρ [ L ( h ( x ) , y ) ] = ∫ X × Y L ( h ( x ) , y ) d ρ ( x , y ) {\displaystyle {\mathcal {E}}(h):=\mathbb {E} _{\rho }[{\mathcal {L}}(h(x),y)]=\int _{X\times Y}{\mathcal {L}}(h(x),y)\,d\rho (x,y)} In our setting, we have h = A ( S n ) {\displaystyle h={\mathcal {A}}(S_{n})} , where A {\displaystyle {\mathcal {A}}} is a learning algorithm and S n = ( ( x 1 , y 1 ) , … , ( x n , y n ) ) ∼ ρ n {\displaystyle S_{n}=((x_{1},y_{1}),\ldots ,(x_{n},y_{n}))\sim \rho ^{n}} is a sequence of vectors which are all drawn independently from ρ {\displaystyle \rho } . Define the optimal risk E H ∗ = inf h ∈ H E ( h ) . {\displaystyle {\mathcal {E}}_{\mathcal {H}}^{}={\underset {h\in {\mathcal {H}}}{\inf }}{\mathcal {E}}(h).} Set h n = A ( S n ) {\displaystyle h_{n}={\mathcal {A}}(S_{n})} , for each sample size n {\displaystyle n} . h n {\displaystyle h_{n}} is a random variable and depends on the random variable S n {\displaystyle S_{n}} , which is drawn from the distribution ρ n {\displaystyle \rho ^{n}} . The algorithm A {\displaystyle {\mathcal {A}}} is called consistent if E ( h n ) {\displaystyle {\mathcal {E}}(h_{n})} probabilistically converges to E H ∗ {\displaystyle {\mathcal {E}}_{\mathcal {H}}^{}} . In other words, for all ϵ , δ > 0 {\displaystyle \epsilon ,\delta >0} , there exists a positive integer N {\displaystyle N} , such that, for all sample sizes n ≥ N {\displaystyle n\geq N} , we have Pr ρ n [ E ( h n ) − E H ∗ ≥ ε ] < δ . {\displaystyle \Pr _{\rho ^{n}}[{\mathcal {E}}(h_{n})-{\mathcal {E}}_{\mathcal {H}}^{}\geq \varepsilon ]<\delta .} The sample complexity of A {\displaystyle {\mathcal {A}}} is then the minimum N {\displaystyle N} for which this holds, as a function of ρ , ϵ {\displaystyle \rho ,\epsilon } , and δ {\displaystyle \delta } . We write the sample complexity as N ( ρ , ϵ , δ ) {\displaystyle N(\rho ,\epsilon ,\delta )} to emphasize that this value of N {\displaystyle N} depends on ρ , ϵ {\displaystyle \rho ,\epsilon } , and δ {\displaystyle \delta } . If A {\displaystyle {\mathcal {A}}} is not consistent, then we set N ( ρ , ϵ , δ ) = ∞ {\displaystyle N(\rho ,\epsilon ,\delta )=\infty } . If there exists an algorithm for which N ( ρ , ϵ , δ ) {\displaystyle N(\rho ,\epsilon ,\delta )} is finite, then we say that the hypothesis space H {\displaystyle {\mathcal {H}}} is learnable. In others words, the sample complexity N ( ρ , ϵ , δ ) {\displaystyle N(\rho ,\epsilon ,\delta )} defines the rate of consistency of the algorithm: given a desired accuracy ϵ {\displaystyle \epsilon } and confidence δ {\displaystyle \delta } , one needs to sample N ( ρ , ϵ , δ ) {\displaystyle N(\rho ,\epsilon ,\delta )} data points to guarantee that the risk of the output function is within ϵ {\displaystyle \epsilon } of the best possible, with probability at least 1 − δ {\displaystyle 1-\delta } . In probably approximately correct (PAC) learning, one is concerned with whether the sample complexity is polynomial, that is, whether N ( ρ , ϵ , δ ) {\displaystyle N(\rho ,\epsilon ,\delta )} is bounded by a polynomial in 1 / ϵ {\displaystyle 1/\epsilon } and 1 / δ {\displaystyle 1/\delta } . If N ( ρ , ϵ , δ ) {\displaystyle N(\rho ,\epsilon ,\delta )} is polynomial for some learning algorithm, then one says that the hypothesis space H {\displaystyle {\mathcal {H}}} is PAC-learnable. This is a stronger notion than being learnable. == Unrestricted hypothesis space: infinite sample complexity == One can ask whether there exists a learning algorithm so that the sample complexity is finite in the strong sense, that is, there is a bound on the number of samples needed so that the algorithm can learn any distribution over the input-output space with a specified target error. More formally, one asks whether there exists a learning algorithm A {\displaystyle {\mathcal {A}}} , such that, for all ϵ , δ > 0 {\displaystyle \epsilon ,\delta >0} , there exists a positive integer N {\displaystyle N} such that for all n ≥ N {\displaystyle n\geq N} , we have sup ρ ( Pr ρ n [ E ( h n ) − E H ∗ ≥ ε ] ) < δ , {\displaystyle \sup _{\rho }\left(\Pr _{\rho ^{n}}[{\mathcal {E}}(h_{n})-{\mathcal {E}}_{\mathcal {H}}^{}\geq \varepsilon ]\right)<\delta ,} where h n = A ( S n ) {\displaystyle h_{n}={\mathcal {A}}(S_{n})} , with S n = ( ( x 1 , y 1 ) , … , ( x n , y n ) ) ∼ ρ n {\displaystyle S_{n}=((x_{1},y_{1}),\ldots ,(x_{n},y_{n}))\sim \rho ^{n}} as above. The No Free Lunch Theorem says that without restrictions on the hypothesis space H {\displaystyle {\mathcal {H}}} , this is not the case, i.e., there always exist "bad" distributions for which the sample complexity is arbitrarily large. Thus, in order to make statements about the rate of convergence of the quantity sup ρ ( Pr ρ n [ E ( h n ) − E H ∗ ≥ ε ] ) , {\displaystyle \sup _{\rho }\left(\Pr _{\rho ^{n}}[{\mathcal {E}}(h_{n})-{\mathcal {E}}_{\mathcal {H}}^{}\geq \varepsilon ]\right),} one must either constrain the space of probability distributions ρ {\displaystyle \rho } , e.g. via a parametric approach, or constrain the space of hypotheses H {\displaystyle {\mathcal {H}}} , as in distribution-free approaches. == Restricted hypothesis space: finite sample-complexity == The latter approach leads to concepts such as VC dimension and Rademacher complexity which control the complexity of the space H {\displaystyle {\mathcal {H}}} . A smaller hypothesis space introduces more bias into the inference process, meaning that E H ∗ {\displaystyle {\mathcal {E}}_{\mathcal {H}}^{}} may be greater than the best possible risk in a larger space. However, by restricting the complexity of the hypothesis space it becomes possible for an algorithm to produce more uniformly consistent functions. This trade-off leads to the concept of regularization. It is a theorem from VC theory that the following three statements are equivalent for a hypothesis space H {\displaystyle {\mathcal {H}}} : H {\displaystyle {\mathcal {H}}} is PAC-learnable. The VC dimension of H {\displaystyle {\mathcal {H}}} is finite. H {\displaystyle {\mathcal {H}}} is a uniform Glivenko-Cantelli class. This gives a way to prove that certain hypothesis spaces are PAC learnable, and by extension, learnable. === An example of a PAC-learnable hypothesis space === X = R d , Y = { − 1 , 1 } {\displaystyle X=\mathbb {R} ^{d},Y=\{-1,1\}} , and let H {\displaystyle {\mathcal {H}}} be the space of affine functions on X {\displaystyle X} , that is, functions of the form x ↦ ⟨ w , x ⟩ + b {\displaystyle x\mapsto \langl

    Read more →
  • Library and information scientist

    Library and information scientist

    A library and information scientist, also known as a library scholar, is a researcher or academic who specializes in the field of library and information science and often participates in scholarly writing about and related to library and information science. A library and information scientist is neither limited to any one subfield of library and information science nor any one particular type of library. These scientists come from all information-related sectors including library and book history. == University of Chicago Graduate Library School == The University of Chicago Graduate Library School was established in 1928 to grant a graduate degree in librarianship with an emphasis on research. The program expanded the concept of librarianship, focused on scientific inquiry and established it as a domain for scientific study. In The Spirit of Inquiry: The Graduate Library School at Chicago, 1921-51 Richardson reviewed the history of the School and its impact on the discipline. == Bibliometric mappings == Bibliometric methods have been used to create maps of library and information science, thus identifying the most important researchers as well as their relative connections (or distances) and identifying emerging trends related to LIS publications within the field. White and McCain (1998) made a map of information science and Åström (2002), Chen, Ibekwe-SanJuan, and Hou (2010), Janssens, Leta, Glanzel, and De Moor (2006), and Zhao and Strotmann (2008) constructed some later maps of library and information science. Jabeen, Yun, Rafiq, and Jabeen (2015) mapped the growth and trends of LIS publications. == Notable library and information scientists == See also Beta Phi Mu Award, Award of Merit - Association for Information Science and Technology, Justin Winsor Prize (library)

    Read more →
  • Computer and information science

    Computer and information science

    Computer and information science (CIS; also known as information and computer science) is a field that emphasizes both computing and informatics, upholding the strong association between the fields of information sciences and computer sciences and treating computers as a tool rather than a field. Information science is one with a long history, unlike the relatively very young field of computer science, and is primarily concerned with gathering, storing, disseminating, sharing and protecting any and all forms of information. It is a broad field, covering a myriad of different areas but is often referenced alongside computer science because of the incredibly useful nature of computers and computer programs in helping those studying and doing research in the field – particularly in helping to analyse data and in spotting patterns too broad for a human to intuitively perceive. While information science is sometimes confused with information theory, the two have vastly different subject matter. Information theory focuses on one particular mathematical concept of information while information science is focused on all aspects of the processes and techniques of information. Computer science, in contrast, is less focused on information and its different states, but more, in a very broad sense, on the use of computers – both in theory and practice – to design and implement algorithms in order to aid the processing of information during the different states described above. It has strong foundations in the field of mathematics, as the very first recognised practitioners of the field were renowned mathematicians such as Alan Turing. Information science and computing began to converge in the 1950s and 1960s, as information scientists started to realize the many ways computers would improve information storage and retrieval. == Terminology == Due to the distinction between computers and computing, some of the research groups refer to computing or datalogy. The French refer to computer science as the term informatique. The term information and communications technology (ICT), refers to how humans communicate with using machines and computers, making a distinction from information and computer science, which is how computers use and gain information. Informatics is also distinct from computer science, which encompasses the study of logic and low-level computing issues. == Education == Universities may confer degrees with a major in computer and information science, not to be confused with a more specific Bachelor of Computer Science or respective graduate computer science degrees. The QS World University Rankings is one of the most widely recognised and distinguished university comparisons. They ranked the top 10 universities for computer science and information systems in 2015. They are: Massachusetts Institute of Technology (MIT) Stanford University University of Oxford Carnegie Mellon University Harvard University University of California, Berkeley (UCB) University of Cambridge The Hong Kong University of Science and Technology Swiss Federal Institute of Technology (ETH Zurich) Princeton University A Computer Information Science degree gives students both network and computing knowledge which is needed to design, develop, and assist information systems which helps to solve business problems and to support business problems and to support business operations and decision making at a managerial level also. == Areas of information and computer science == Due to the nature of this field, many topics are also shared with computer science and information systems. The discipline of Information and Computer Science spans a vast range of areas from basic computer science theory (algorithms and computational logic) to in depth analysis of data manipulation and use within technology. === Programming theory === The process of taking a given algorithm and encoding it into a language that can be understood and executed by a computer. There are many different types of programming languages and various different types of computers, however, they all have the same goal: to turn algorithms into machine code. Popular programming languages used within the academic study of CIS include, but are not limited to: Java, Python, C#, C++, Perl, Ruby, Pascal, Swift, Visual Basic. === Information and information systems === The academic study of software and hardware systems that process large quantities and data, support large scale data management and how data can be used. This is where the field is unique from the standard study of computer science. The area of information systems focuses on the networks of hardware and software that are required to process, manipulate and distribute such data. === Computer systems and organisations === The process of analysing computer architecture and various logic circuits. This involves looking at low level computer processes at bit level computation. This is an in-depth look into the hardware processing of a computational system, involving looking at the basic structure of a computer and designing such systems. This can also involve evaluating complex circuit diagrams, and being able to construct these to solve a main problem. The main purpose behind this area of study is to achieve an understanding of how computers function on a basic level, often through tracing machine operations. === Machines, languages, and computation === This is the study into fundamental computer algorithms, which are the basis to computer programs. Without algorithms, no computer programs would exist. This also involves the process of looking into various mathematical functions behind computational algorithms, basic theory and functional (low level) programming. In an academic setting, this area would introduce the fundamental mathematical theorems and functions behind theoretical computer science which are the building blocks for other areas in the field. Complex topics such as; proofs, algebraic functions and sets will be introduced during studies of CIS. == Developments == Information and computer science is a field that is rapidly developing with job prospects for students being extremely promising with 75.7% of graduates gaining employment. Also the IT industry employs one in twenty of the workforce with it predicted to increase nearly five times faster than the average of the UK and between 2012 and 2017 more than half a million people will be needed within the industry and the fact that nine out of ten tech firms are suffering from candidate shortages which is having a negative impact on their business as it delays the creation and development of new products, and it's predicted in the US that in the next decade there will be more than one million jobs in the technology sector than computer science graduates to fill them. Because of this programming is now being taught at an earlier age with an aim to interest students from a young age into computer and information science hopefully leading more children to study this at a higher level. For example, children in England will now be exposed to computer programming at the age of 5 due to an updated national curriculum. == Employment == Due to the wide variety of jobs that now involve computer and information science related tasks, it is difficult to provide a comprehensive list of possible jobs in this area, but some of the key areas are artificial intelligence, software engineering and computer networking and communication. Work in this area also tends to require sufficient understanding of mathematics and science. Moreover, jobs that having a CIS degree can lead to, include: systems analyst, network administrator, system architect, information systems developer, web programmer, or software developer. The earning potential for CIS graduates is quite promising. A 2013 survey from the National Association of Colleges and Employers (NACE) found that the average starting salary for graduates who earned a degree in a computer related field was $59,977, up 4.3% from the prior year. This is higher than other popular degrees such as business ($54,234), education ($40,480) and math and sciences ($42,724). Furthermore, Payscale ranked 129 college degrees based on their graduates earning potential with engineering, math, science, and technology fields dominating the ranking. With eight computer related degrees appearing among the top 30. With the lowest starting salary for these jobs being $49,900. A Rasmussen College article describes various jobs CIS graduates may obtain with software applications developers at the top making a median income of $98,260. According to the National Careers Service an Information Scientist can expect to earn £24,000+ per year as a starting salary.

    Read more →
  • Automatic image annotation

    Automatic image annotation

    Automatic image annotation (also known as automatic image tagging or linguistic indexing) is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database. This method can be regarded as a type of multi-class image classification with a very large number of classes - as large as the vocabulary size. Typically, image analysis in the form of extracted feature vectors and the training annotation words are used by machine learning techniques to attempt to automatically apply annotations to new images. The first methods learned the correlations between image features and training annotations. Subsequently, techniques were developed using machine translation to attempt to translate the textual vocabulary into the 'visual vocabulary,' represented by clustered regions known as blobs. Subsequent work has included classification approaches, relevance models, and other related methods. The advantages of automatic image annotation versus content-based image retrieval (CBIR) are that queries can be more naturally specified by the user. At present, Content-Based Image Retrieval (CBIR) generally requires users to search by image concepts such as color and texture or by finding example queries. However, certain image features in example images may override the concept that the user is truly focusing on. Traditional methods of image retrieval, such as those used by libraries, have relied on manually annotated images, which is expensive and time-consuming, especially given the large and constantly growing image databases in existence.

    Read more →
  • Hard sigmoid

    Hard sigmoid

    In artificial intelligence, especially computer vision and artificial neural networks, a hard sigmoid is non-smooth function used in place of a sigmoid function. These retain the basic shape of a sigmoid, rising from 0 to 1, but using simpler functions, especially piecewise linear functions or piecewise constant functions. These are preferred where speed of computation is more important than precision. == Examples == The most extreme examples are the sign function or Heaviside step function, which go from −1 to 1 or 0 to 1 (which to use depends on normalization) at 0. Other examples include the Theano library, which provides two approximations: ultra_fast_sigmoid, which is a multi-part piecewise approximation and hard_sigmoid, which is a 3-part piecewise linear approximation (output 0, line with slope 0.2, output 1).

    Read more →
  • Quality of Data

    Quality of Data

    Quality of Data (QoD) is a designation coined by L. Veiga, that specifies and describes the required Quality of Service of a distributed storage system from the Consistency point of view of its data. It can be used to support big data management frameworks, Workflow management, and HPC systems (mainly for data replication and consistency). It takes into account data semantics, namely the Time interval of data freshness, the Sequence of tolerable number of outstanding versions of the data read before ore refresh, and the Value divergence allowed before displaying it. Initially it was based on a model from an existing research work regarding vector-field Consistency, awarded the best-paper prize in the ACM/IFIP/Usenix Middleware Conference 2007 and later enhanced for increased scalability and fault-tolerance. This consistency model has been successfully applied and proven in big data key/value store Apache HBase, initially designed as a middleware module seating between clusters from separate data centres. The HBase-QoD coupling minimises bandwidth usage and optimises resources allocation during replication achieving the desired consistency level at a more fine-grained level. QoD is defined by the three-dimensions of vector k=(θ,σ,ν), but with a broader view of the issue, applicable also to large-scale data management techniques in regards to their timely delivery. == Other descriptions == Quality of Data should not be confused with other definitions for data quality such as completeness, validity, and accuracy.

    Read more →
  • Semantic query

    Semantic query

    Semantic queries allow for queries and analytics of associative and contextual nature. Semantic queries enable the retrieval of both explicitly and implicitly derived information based on syntactic, semantic and structural information contained in data. They are designed to deliver precise results (possibly the distinctive selection of one single piece of information) or to answer more fuzzy and wide open questions through pattern matching and digital reasoning. Semantic queries work on named graphs, linked data or triples. This enables the query to process the actual relationships between information and infer the answers from the network of data. This is in contrast to semantic search, which uses semantics (meaning of language constructs) in unstructured text to produce a better search result. (See natural language processing.) From a technical point of view, semantic queries are precise relational-type operations much like a database query. They work on structured data and therefore have the possibility to utilize comprehensive features like operators (e.g. >, < and =), namespaces, pattern matching, subclassing, transitive relations, semantic rules and contextual full text search. The semantic web technology stack of the W3C is offering SPARQL to formulate semantic queries in a syntax similar to SQL. Semantic queries are used in triplestores, graph databases, semantic wikis, natural language and artificial intelligence systems. == Background == Relational databases represent all relationships between data in an implicit manner only. For example, the relationships between customers and products (stored in two content-tables and connected with an additional link-table) only come into existence in a query statement (SQL in the case of relational databases) written by a developer. Writing the query demands exact knowledge of the database schema. Linked-Data represent all relationships between data in an explicit manner. In the above example, no query code needs to be written. The correct product for each customer can be fetched automatically. Whereas this simple example is trivial, the real power of linked-data comes into play when a network of information is created (customers with their geo-spatial information like city, state and country; products with their categories within sub- and super-categories). Now the system can automatically answer more complex queries and analytics that look for the connection of a particular location with a product category. The development effort for this query is omitted. Executing a semantic query is conducted by walking the network of information and finding matches (also called Data Graph Traversal). Another important aspect of semantic queries is that the type of the relationship can be used to incorporate intelligence into the system. The relationship between a customer and a product has a fundamentally different nature than the relationship between a neighbourhood and its city. The latter enables the semantic query engine to infer that a customer living in Manhattan is also living in New York City whereas other relationships might have more complicated patterns and "contextual analytics". This process is called inference or reasoning and is the ability of the software to derive new information based on given facts. == Articles == Velez, Golda (2008). "Semantics Help Wall Street Cope With Data Overload". Wall Street & Technology. wallstreetandtech.com. Zhifeng, Xiao (2009). "Spatial information semantic query based on SPARQL". In Liu, Yaolin; Tang, Xinming (eds.). International Symposium on Spatial Analysis, Spatial-Temporal Data Modeling, and Data Mining. Vol. 7492. SPIE. pp. 74921P. Bibcode:2009SPIE.7492E..60X. doi:10.1117/12.838556. S2CID 62191842. Aquin, Mathieu (2010). "Watson, more than a Semantic Web search engine" (PDF). Semantic Web Journal. Dworetzky, Tom (2011). "How Siri Works: iPhone's 'Brain' Comes from Natural Language Processing". International Business Times. Horwitt, Elisabeth (2011). "The semantic Web gets down to business". computerworld.com. Rodriguez, Marko (2011). "Graph Pattern Matching with Gremlin". Marko A. Rodriguez. markorodriguez.com on Graph Computing. Sequeda, Juan (2011). "SPARQL Nuts & Bolts". Cambridge Semantics. Freitas, Andre (2012). "Querying Heterogeneous Datasets on the Linked Data Web" (PDF). IEEE Internet Computing. Kauppinen, Tomi (2012). "Using the SPARQL Package in R to handle Spatial Linked Data". linkedscience.org. Lorentz, Alissa (2013). "With Big Data, Context is a Big Issue". Wired.

    Read more →
  • Recording format

    Recording format

    A recording format is a format for encoding data for storage on a storage medium. The format can be container information such as sectors on a disk, or user/audience information (content) such as analog stereo audio. Multiple levels of encoding may be achieved in one format. For example, a text encoded page may contain HTML and XML encoding, combined in a plain text file format, using either EBCDIC or ASCII character encoding, on a UDF digitally formatted disk. In electronic media, the primary format is the encoding that requires hardware to interpret (decode) data; while secondary encoding is interpreted by secondary signal processing methods, usually computer software. == Recording container formats == A container format is a system for dividing physical storage space or virtual space for data. Data space can be divided evenly by a system of measurement, or divided unevenly with meta data. A grid may divide physical or virtual space with physical or virtual (dividers) borders, evenly or unevenly. Just as a physical container (such as a file cabinet) is divided by physical borders (such as drawers and file folders), data space is divided by virtual borders. Meta data such as a unit of measurement, address, or meta tags act as virtual borders in a container format. A template may be considered an abstract format for containing a solution as well as the content itself. Systems of measurement Metric system Geographic coordinate system Page grid Film formats Audio data format Video tape format Disk format File format Meta data Text formatting Template Data structure == Raw content formats == A raw content format is a system of converting data to displayable information. Raw content formats may either be recorded in secondary signal processing methods such as a software container format (e.g. digital audio, digital video) or recorded in the primary format. A primary raw content format may be directly observable (e.g. image, sound, motion, smell, sensation) or physical data which only requires hardware to display it, such as a phonographic needle and diaphragm or a projector lamp and magnifying glass.

    Read more →
  • Internet Security Awareness Training

    Internet Security Awareness Training

    Internet Security Awareness Training (ISAT) is the training given to members of an organization regarding the protection of various information assets of that organization. ISAT is a subset of general security awareness training (SAT). Even small and medium enterprises are generally recommended to provide such training, but organizations that need to comply with government regulations (e.g., the Gramm–Leach–Bliley Act, the Payment Card Industry Data Security Standard, Health Insurance Portability and Accountability Act, Sarbanes–Oxley Act) normally require formal ISAT for annually for all employees. Often such training is provided in the form of online courses. ISAT, also referred to as Security Education, Training, and Awareness (SETA), organizations train and create awareness of information security management within their environment. It is beneficial to organizations when employees are well trained and feel empowered to take important actions to protect themselves and organizational data. The SETA program target must be based on user roles within organizations and for positions that expose the organizations to increased risk levels, specialized courses must be required. == Coverage == There are general topics to cover for the training, but it is necessary for each organization to have a coverage strategy based on its needs, as this will ensure the training is practical and captures critical topics relevant to the organization. As the threat landscape changes very frequently, organizations should continuously review their training programs to ensure relevance with current trends. Topics covered in ISAT include: Appropriate methods for protecting sensitive information on personal computer systems, including password policy Various computer security concerns, including spam, malware, phishing, social engineering, etc. Consequences of failure to properly protect information, including potential job loss, economic consequences to the firm, damage to individuals whose private records are divulged, and possible civil and criminal law penalties. Being Internet Security Aware means you understand that there are people actively trying to steal data that is stored within your organization's computers. (This often focuses on user names and passwords, so that criminal elements can ultimately get access to bank accounts and other high-value IT assets.) That is why it is important to protect the assets of the organization and stop that from happening. The general scope should include topics such as password security, Email phishing, Social engineering, Mobile device security, Sensitive data security, and Business communications. In contrast, those requiring specialized knowledge are usually required to take technical and in-depth training courses. Suppose an organization determines that it is best to use one of the available training tools on the market, it must ensure it sets objectives that the training can meet, including confirming the training will provide employees with the knowledge to understand risks and the behaviors needed in managing them, actions to take to prevent or detect security incidents, using language easily understandable by the trainees, and ensuring the pricing is reasonable. Organizations are recommended to base ISAT training content on employee roles and their culture; the policy should guide that training for all employees and gave the following as examples of sources of reference materials: National Institute of Standards and Technology (NIST) Special Publication 800-50, Building an Information Technology Security Awareness and Training Program International Standards Organization (ISO) 27002:2013, Information technology—Security techniques—Code of practice for information security controls International Standards Organization (ISO) 27001:2013, Information technology — Security techniques — Information security management systems COBIT 5 Appendix F.2, Detailed Guidance: Services, Infrastructure and Applications Enabler, Security Awareness The training must focus on current threats specific to an organization and the impacts if that materializes as a result of user actions. Including practical examples and ways of dealing with scenarios help users know the appropriate measures to take. It is a good practice to periodically train customers of specific organizations on threats they face from people with malicious intentions. Coverage strategy for SAT should be driven by an organization's policy. It can help truly determine the level of depth of the training and where it should be conducted at a global level or business unit level, or a combination of both. A policy also empowers a responsible party within the organization to run the training. == Importance == Studies show that well-structured security awareness training can significantly reduce the likelihood of cyber incidents caused by human error. According to the Ponemon Institute, organizations that implement regular security training experience up to 70% fewer successful phishing attacks. Additionally, a 2023 Verizon Data Breach Investigations Report found that 74% of breaches involve the human element, highlighting the need for continuous education. Employees are key in whether organizations are breached or not; there must be a policy on creating awareness and training them on emerging threats and actions to take in safeguarding sensitive information and reporting any observed unusual activity within the corporate environment. Research has shown that SAT has helped reduce cyber-attacks within organizations, especially when it comes to phishing, as trainees learned to identify these attack modes and give them the self-assurance to take action appropriately. There is an increase in phishing attacks, and it has become increasingly important for people to understand how to these attacks work, and the actions required to prevent these and SAT has shown a significant impact on the number of successful phishing attacks against organizations. == Compliance Requirements == Various regulations and laws mandate SAT for organizations in specific industries, including the Gramm–Leach–Bliley Act (GLBA) for the financial services, the Federal Information Security Modernization Act of 2014 for federal agencies, and the European Union's General Data Protection Regulation (GDPR). === Federal Information Security Modernization Act === Employees and contractors in federal agencies are required to receive Security Awareness Training annually, and the program needs to address job-related information security risks linked that provide them with the knowledge to lessen security risks. === Health Insurance Portability and Accountability Act === The Health Insurance Portability and Accountability Act has the Security Rule, and Privacy Rule requiring the creation of a security awareness training program and ensuring employees are trained accordingly. === Payment Card Industry Data Security Standard === The Payment Card Industry Security Standards Council, the governing council for stakeholders in the payment industry, formed by American Express, Discover, JCB International, MasterCard, and Visa that developed the DSS as a requirement for the payment industry. Requirement 12.6 requires member organizations to institute a formal security awareness program. There is a published guide for organizations to adhere to when setting up the program. === US States Training Regulations === Some States mandate Security Awareness Training whiles other do not but simply recommend voluntary training. Among states that require the training for its employees include: Colorado (The Colorado Information Security Act, Colorado Revised Statutes 24-37.5-401 et seq.) Connecticut (13 FAM 301.1-1 Cyber Security Awareness Training (PS800)) Florida (Florida Statutes Chapter 282) Georgia (Executive Order GA E.O.182 mandated training within 90 days of issue) Illinois (Cook County) Indiana (IN H 1240) Louisiana (Louisiana Division of Administration, Office of Technology Services p. 52: LA H 633) Maryland (20-07 IT Security Policy) Montana (Mandatory cyber training for executive branch state employees) Nebraska Nevada (agency-by-agency state employee requirement - State Security Standard 123 – IT Security) New Hampshire New Jersey ( NJ A 1654) North Carolina Ohio (IT-15 - Security Awareness and Training) Pennsylvania Texas Utah Vermont Virginia West Virginia (WV Code Section 5A-6-4a) == Training Techniques == Below are some common training techniques, even though some can be blended depending on the operating environment: Interactive video training – This technique allows users to be trained using two-way interactive audio and video instruction. Web-based training – This method allows employees or users to take the training independently and usually has a testing component to determine if learning has taken place. If not, users can be allowed to retake the course and test to ensure there is a complete understanding

    Read more →
  • Subject indexing

    Subject indexing

    Subject indexing is the act of describing or classifying a document by index terms, keywords, or other symbols in order to indicate what different documents are about, to summarize their contents or to increase findability. In other words, the objective is to identify and describe the subject of documents. Indexes are constructed, separately, on three distinct levels: terms in a document, such as a book; objects in a collection, such as a library; and documents (such as books and articles) within a field of knowledge. Subject indexing is used in information retrieval especially to create bibliographic indexes to retrieve documents on a particular subject. Examples of academic indexing services are Zentralblatt MATH, Chemical Abstracts, and PubMed. The index terms were mostly assigned by experts but author keywords are also common. The process of indexing begins with any analysis of the subject of the document. The indexer must then identify terms that appropriately identify the subject, either by extracting words directly from the document or assigning words from a controlled vocabulary. The terms in the index are then presented in a systematic order. Indexers must decide how many terms to include and how specific the terms should be. Together this gives a depth of indexing. == Subject analysis == The first step in indexing is to decide on the subject matter of the document. In manual indexing, the indexer would consider the subject matter in terms of answer to a set of questions such as "Does the document deal with a specific product, condition or phenomenon?". As the analysis is influenced by the knowledge and experience of the indexer, it follows that two indexers may analyze the content differently and so come up with different index terms. This will impact on the success of retrieval. === Automatic vs. manual subject analysis === Automatic indexing follows set processes of analyzing frequencies of word patterns and comparing results to other documents in order to assign to subject categories. This requires no understanding of the material being indexed. This leads to more uniform indexing but at the expense of the true meaning being interpreted. A computer program will not understand the meaning of statements and may therefore fail to assign some relevant terms or assign incorrectly. Human indexers focus their attention on certain parts of the document such as the title, abstract, summary and conclusions, as analyzing the full text in depth is costly and time-consuming. An automated system takes away the time limit and allows the entire document to be analyzed, but also has the option to be directed to particular parts of the document. == Term selection == The second stage of indexing involves the translation of the subject analysis into a set of index terms. This can involve extracting from the document or assigning from a controlled vocabulary. With the ability to conduct a full text search widely available, many people have come to rely on their own expertise in conducting information searches and full text search has become very popular. Subject indexing and its experts, professional indexers, catalogers, and librarians, remains crucial to information organization and retrieval. These experts understand controlled vocabularies and are able to find information that cannot be located by full text search. The cost of expert analysis to create subject indexing is not easily compared to the cost of hardware, software and labor to manufacture a comparable set of full-text, fully searchable materials. With new web applications that allow every user to annotate documents, social tagging has gained popularity especially in the Web. One application of indexing, the book index, remains relatively unchanged despite the information revolution. === Extraction/Derived indexing === Extraction indexing involves taking words directly from the document. It uses natural language and lends itself well to automated techniques where word frequencies are calculated and those with a frequency over a pre-determined threshold are used as index terms. A stop-list containing common words (such as "the", "and") would be referred to and such stop words would be excluded as index terms. Automated extraction indexing may lead to loss of meaning of terms by indexing single words as opposed to phrases. Although it is possible to extract commonly occurring phrases, it becomes more difficult if key concepts are inconsistently worded in phrases. Automated extraction indexing also has the problem that, even with use of a stop-list to remove common words, some frequent words may not be useful for allowing discrimination between documents. For example, the term glucose is likely to occur frequently in any document related to diabetes. Therefore, use of this term would likely return most or all the documents in the database. Post-coordinated indexing where terms are combined at the time of searching would reduce this effect but the onus would be on the searcher to link appropriate terms as opposed to the information professional. In addition terms that occur infrequently may be highly significant for example a new drug may be mentioned infrequently but the novelty of the subject makes any reference significant. One method for allowing rarer terms to be included and common words to be excluded by automated techniques would be a relative frequency approach where frequency of a word in a document is compared to frequency in the database as a whole. Therefore, a term that occurs more often in a document than might be expected based on the rest of the database could then be used as an index term, and terms that occur equally frequently throughout will be excluded. Another problem with automated extraction is that it does not recognize when a concept is discussed but is not identified in the text by an indexable keyword. Since this process is based on simple string matching and involves no intellectual analysis, the resulting product is more appropriately known as a concordance than an index. === Assignment indexing === An alternative is assignment indexing where index terms are taken from a controlled vocabulary. This has the advantage of controlling for synonyms as the preferred term is indexed and synonyms or related terms direct the user to the preferred term. This means the user can find articles regardless of the specific term used by the author and saves the user from having to know and check all possible synonyms. It also removes any confusion caused by homographs by inclusion of a qualifying term. A third advantage is that it allows the linking of related terms whether they are linked by hierarchy or association, e.g. an index entry for an oral medication may list other oral medications as related terms on the same level of the hierarchy but would also link to broader terms such as treatment. Assignment indexing is used in manual indexing to improve inter-indexer consistency as different indexers will have a controlled set of terms to choose from. Controlled vocabularies do not completely remove inconsistencies as two indexers may still interpret the subject differently. == Index presentation == The final phase of indexing is to present the entries in a systematic order. This may involve linking entries. In a pre-coordinated index the indexer determines the order in which terms are linked in an entry by considering how a user may formulate their search. In a post-coordinated index, the entries are presented singly and the user can link the entries through searches, most commonly carried out by computer software. Post-coordination results in a loss of precision in comparison to pre-coordination. == Depth of indexing == Indexers must make decisions about what entries should be included and how many entries an index should incorporate. The depth of indexing describes the thoroughness of the indexing process with reference to exhaustivity and specificity. === Exhaustivity === An exhaustive index is one which lists all possible index terms. Greater exhaustivity gives a higher recall, or more likelihood of all the relevant articles being retrieved, however, this occurs at the expense of precision. This means that the user may retrieve a larger number of irrelevant documents or documents which only deal with the subject in little depth. In a manual system a greater level of exhaustivity brings with it a greater cost as more man-hours are required. The additional time taken in an automated system would be much less significant. At the other end of the scale, in a selective index only the most important aspects are covered. Recall is reduced in a selective index as if an indexer does not include enough terms, a highly relevant article may be overlooked. Therefore, indexers should strive for a balance and consider what the document may be used. They may also have to consider the implications of time and expense. === Specificity === The specificity describes how closel

    Read more →
  • Bibliometrician

    Bibliometrician

    A bibliometrician is a researcher or a specialist in bibliometrics. It is near-synonymous with an informetrican (who studies informetrics), a scientometrican (who study scientometrics) and a webometrician, who study webometrics. == Notable bibliometricians == Christine L. Borgman Samuel C. Bradford Blaise Cronin Margaret Elizabeth Egan Eugene Garfield (developer of the Science Citation Index and the Impact factor) Jorge E. Hirsch (developer of the h-index) Alfred J. Lotka Vasily Nalimov Derek J. de Solla Price Ronald Rousseau George Kingsley Zipf

    Read more →