AI Data Farms

AI Data Farms — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Contextual AI

    Contextual AI

    Contextual AI is an enterprise software company based in Mountain View, California. It develops a platform for building specialized Retrieval-Augmented Generation (RAG) agents for enterprise use. The company was founded in 2023 by Douwe Kiela and Amanpreet Singh, both former AI researchers at Facebook AI Research (FAIR) and Hugging Face. Douwe Kiela previously led the Meta research team that introduced the Retrieval-Augmented Generation (RAG) approach in 2020. Contextual AI focuses on enterprise generative AI applications using RAG 2.0 technology, with deployments primarily in the technology, banking, finance and media sectors. == History == In June 2023, Contextual AI announced it had raised $20 million in a seed funding round led by Bain Capital Ventures (BCV), with participation from Lightspeed Venture Partners, Greycroft, SV Angel, and several angel investors. In August 2024, the company raised $80 million in a Series A funding round led by Greycroft, with participation from previous investors including Bain Capital Ventures, Lightspeed, and Conviction Partners. The round also included new backers such as Bezos Expeditions, NVentures (Nvidia), HSBC Ventures, and Snowflake Ventures. == Features == Retrieval-Augmented Generation (RAG) is an artificial intelligence framework that integrates information retrieval with text generation to improve the performance of large language models (LLMs) on complex, knowledge-intensive tasks. It was introduced in 2020 by researchers at Meta AI, including Douwe Kiela, Patrick Lewis and others, in their paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. RAG enables language models to access and incorporate external information, such as proprietary databases or real-time web content, at query time, instead of relying solely on pre-trained, internal, static knowledge. This architecture addresses common limitations of standard LLMs, including hallucination, outdated information, and lack of attribution to source materials. RAG systems retrieve relevant context through a variety of techniques - including vector search, keyword search, text-to-SQL - and feeds this context into the language model to generate responses. The approach improves factual accuracy, supports domain-specific customization, enables citation of sources, and allows for more updated information without retraining the model itself. General Availability. In January 2025, Contextual AI announced the general availability of its enterprise platform for building specialized RAG agents. Early adopters included Qualcomm, which used the platform for their Customer Engineering team needs. Grounded Language Model. In March 2025, the company introduced a Grounded Language Model (GLM) for factual accuracy in enterprise AI applications. Reranker. In March 2025, Contextual AI released an instruction-following reranker that allows users to influence the ranking of retrieved documents through natural language instructions, such as prioritizing recent files, specific formats, or content from designated sources. == Applications == Contextual AI's platform has been adopted across a range of industries, including finance, technology, media and professional services. Clients include Fortune 500 companies such as Qualcomm and HSBC.

    Read more →
  • Is an AI Copywriting Tool Worth It in 2026?

    Is an AI Copywriting Tool Worth It in 2026?

    Looking for the best AI copywriting tool? An AI copywriting tool is software that uses machine learning to help you get more done — it can save you hours every week by automating repetitive work. Most options offer a generous free tier, with paid plans unlocking higher limits, faster processing, and team features. Whether you are a beginner or a pro, the right AI copywriting tool slots into your workflow and pays for itself fast. Read on for hands-on impressions, pricing tiers, and the standout features that matter.

    Read more →
  • Models of DNA evolution

    Models of DNA evolution

    A number of different Markov models of DNA sequence evolution have been proposed. These substitution models differ in terms of the parameters used to describe the rates at which one nucleotide replaces another during evolution. These models are frequently used in molecular phylogenetic analyses. In particular, they are used during the calculation of likelihood of a tree (in Bayesian and maximum likelihood approaches to tree estimation) and they are used to estimate the evolutionary distance between sequences from the observed differences between the sequences. == Introduction == These models are phenomenological descriptions of the evolution of DNA as a string of four discrete states. These Markov models do not explicitly depict the mechanism of mutation nor the action of natural selection. Rather they describe the relative rates of different changes. For example, mutational biases and purifying selection favoring conservative changes are probably both responsible for the relatively high rate of transitions compared to transversions in evolving sequences. However, the Kimura (K80) model described below only attempts to capture the effect of both forces in a parameter that reflects the relative rate of transitions to transversions. Evolutionary analyses of sequences are conducted on a wide variety of time scales. Thus, it is convenient to express these models in terms of the instantaneous rates of change between different states (the Q matrices below). If we are given a starting (ancestral) state at one position, the model's Q matrix and a branch length expressing the expected number of changes to have occurred since the ancestor, then we can derive the probability of the descendant sequence having each of the four states. The mathematical details of this transformation from rate-matrix to probability matrix are described in the mathematics of substitution models section of the substitution model page. By expressing models in terms of the instantaneous rates of change we can avoid estimating a large numbers of parameters for each branch on a phylogenetic tree (or each comparison if the analysis involves many pairwise sequence comparisons). The models described on this page describe the evolution of a single site within a set of sequences. They are often used for analyzing the evolution of an entire locus by making the simplifying assumption that different sites evolve independently and are identically distributed. This assumption may be justifiable if the sites can be assumed to be evolving neutrally. If the primary effect of natural selection on the evolution of the sequences is to constrain some sites, then models of among-site rate-heterogeneity can be used. This approach allows one to estimate only one matrix of relative rates of substitution, and another set of parameters describing the variance in the total rate of substitution across sites. == DNA evolution as a continuous-time Markov chain == === Continuous-time Markov chains === Continuous-time Markov chains have the usual transition matrices which are, in addition, parameterized by time, t {\displaystyle t} . Specifically, if E 1 , E 2 , E 3 , E 4 {\displaystyle E_{1},E_{2},E_{3},E_{4}} are the states, then the transition matrix P ( t ) = ( P i j ( t ) ) {\displaystyle P(t)={\big (}P_{ij}(t){\big )}} where each individual entry, P i j ( t ) {\displaystyle P_{ij}(t)} refers to the probability that state E i {\displaystyle E_{i}} will change to state E j {\displaystyle E_{j}} in time t {\displaystyle t} . Example: We would like to model the substitution process in DNA sequences (i.e. Jukes–Cantor, Kimura, etc.) in a continuous-time fashion. The corresponding transition matrices will look like: P ( t ) = ( p A A ( t ) p A G ( t ) p A C ( t ) p A T ( t ) p G A ( t ) p G G ( t ) p G C ( t ) p G T ( t ) p C A ( t ) p C G ( t ) p C C ( t ) p C T ( t ) p T A ( t ) p T G ( t ) p T C ( t ) p T T ( t ) ) {\displaystyle P(t)={\begin{pmatrix}p_{\mathrm {AA} }(t)&p_{\mathrm {AG} }(t)&p_{\mathrm {AC} }(t)&p_{\mathrm {AT} }(t)\\p_{\mathrm {GA} }(t)&p_{\mathrm {GG} }(t)&p_{\mathrm {GC} }(t)&p_{\mathrm {GT} }(t)\\p_{\mathrm {CA} }(t)&p_{\mathrm {CG} }(t)&p_{\mathrm {CC} }(t)&p_{\mathrm {CT} }(t)\\p_{\mathrm {TA} }(t)&p_{\mathrm {TG} }(t)&p_{\mathrm {TC} }(t)&p_{\mathrm {TT} }(t)\end{pmatrix}}} where the top-left and bottom-right 2 × 2 blocks correspond to transition probabilities and the top-right and bottom-left 2 × 2 blocks corresponds to transversion probabilities. Assumption: If at some time t 0 {\displaystyle t_{0}} , the Markov chain is in state E i {\displaystyle E_{i}} , then the probability that at time t 0 + t {\displaystyle t_{0}+t} , it will be in state E j {\displaystyle E_{j}} depends only upon i {\displaystyle i} , j {\displaystyle j} and t {\displaystyle t} . This then allows us to write that probability as p i j ( t ) {\displaystyle p_{ij}(t)} . Theorem: Continuous-time transition matrices satisfy: P ( t + τ ) = P ( t ) P ( τ ) {\displaystyle P(t+\tau )=P(t)P(\tau )} Note: There is here a possible confusion between two meanings of the word transition. (i) In the context of Markov chains, transition is the general term for the change between two states. (ii) In the context of nucleotide changes in DNA sequences, transition is a specific term for the exchange between either the two purines (A ↔ G) or the two pyrimidines (C ↔ T) (for additional details, see the article about transitions in genetics). By contrast, an exchange between one purine and one pyrimidine is called a transversion. === Deriving the dynamics of substitution === Consider a DNA sequence of fixed length m evolving in time by base replacement. Assume that the processes followed by the m sites are Markovian independent, identically distributed and that the process is constant over time. For a particular site, let E = { A , G , C , T } {\displaystyle {\mathcal {E}}=\{A,\,G,\,C,\,T\}} be the set of possible states for the site, and p ( t ) = ( p A ( t ) , p G ( t ) , p C ( t ) , p T ( t ) ) {\displaystyle \mathbf {p} (t)=(p_{A}(t),\,p_{G}(t),\,p_{C}(t),\,p_{T}(t))} their respective probabilities at time t {\displaystyle t} . For two distinct x , y ∈ E {\displaystyle x,y\in {\mathcal {E}}} , let μ x y {\displaystyle \mu _{xy}\ } be the transition rate from state x {\displaystyle x} to state y {\displaystyle y} . Similarly, for any x {\displaystyle x} , let the total rate of change from x {\displaystyle x} be μ x = ∑ y ≠ x μ x y . {\displaystyle \mu _{x}=\sum _{y\neq x}\mu _{xy}\,.} The changes in the probability distribution p A ( t ) {\displaystyle p_{A}(t)} for small increments of time Δ t {\displaystyle \Delta t} are given by p A ( t + Δ t ) = p A ( t ) − p A ( t ) μ A Δ t + ∑ x ≠ A p x ( t ) μ x A Δ t . {\displaystyle p_{A}(t+\Delta t)=p_{A}(t)-p_{A}(t)\mu _{A}\Delta t+\sum _{x\neq A}p_{x}(t)\mu _{xA}\Delta t\,.} In other words, (in frequentist language), the frequency of A {\displaystyle A} 's at time t + Δ t {\displaystyle t+\Delta t} is equal to the frequency at time t {\displaystyle t} minus the frequency of the lost A {\displaystyle A} 's plus the frequency of the newly created A {\displaystyle A} 's. Similarly for the probabilities p G ( t ) {\displaystyle p_{G}(t)} , p C ( t ) {\displaystyle p_{C}(t)} and p T ( t ) {\displaystyle p_{T}(t)} . These equations can be written compactly as p ( t + Δ t ) = p ( t ) + p ( t ) Q Δ t , {\displaystyle \mathbf {p} (t+\Delta t)=\mathbf {p} (t)+\mathbf {p} (t)Q\Delta t\,,} where Q = ( − μ A μ A G μ A C μ A T μ G A − μ G μ G C μ G T μ C A μ C G − μ C μ C T μ T A μ T G μ T C − μ T ) {\displaystyle Q={\begin{pmatrix}-\mu _{A}&\mu _{AG}&\mu _{AC}&\mu _{AT}\\\mu _{GA}&-\mu _{G}&\mu _{GC}&\mu _{GT}\\\mu _{CA}&\mu _{CG}&-\mu _{C}&\mu _{CT}\\\mu _{TA}&\mu _{TG}&\mu _{TC}&-\mu _{T}\end{pmatrix}}} is known as the rate matrix. Note that, by definition, the sum of the entries in each row of Q {\displaystyle Q} is equal to zero. It follows that p ′ ( t ) = p ( t ) Q . {\displaystyle \mathbf {p} '(t)=\mathbf {p} (t)Q\,.} For a stationary process, where Q {\displaystyle Q} does not depend on time t, this differential equation can be solved. First, P ( t ) = exp ⁡ ( t Q ) , {\displaystyle P(t)=\exp(tQ),} where exp ⁡ ( t Q ) {\displaystyle \exp(tQ)} denotes the exponential of the matrix t Q {\displaystyle tQ} . As a result, p ( t ) = p ( 0 ) P ( t ) = p ( 0 ) exp ⁡ ( t Q ) . {\displaystyle \mathbf {p} (t)=\mathbf {p} (0)P(t)=\mathbf {p} (0)\exp(tQ)\,.} === Ergodicity === If the Markov chain is irreducible, i.e. if it is always possible to go from a state x {\displaystyle x} to a state y {\displaystyle y} (possibly in several steps), then it is also ergodic. As a result, it has a unique stationary distribution π = { π x , x ∈ E } {\displaystyle {\boldsymbol {\pi }}=\{\pi _{x},\,x\in {\mathcal {E}}\}} , where π x {\displaystyle \pi _{x}} corresponds to the proportion of time spent in state x {\displaystyle x} after the Markov chain has run for an infinite amount of time. In DNA evo

    Read more →
  • Janyce Wiebe

    Janyce Wiebe

    Janyce Marbury Wiebe (1959–2018) was an American computer science specializing in natural language processing and known for her work on subjectivity, sentiment analysis, opinion mining, discourse processing, and word-sense disambiguation. == Early life and education == Wiebe was born in 1959, in Albany, New York. She majored in English at the Binghamton University, graduating in 1981, and completed a Ph.D. in computer science in 1990, at the University at Buffalo. Her dissertation, Recognizing Subjective Sentences: A Computational Investigation of Narrative Text, was supervised by philosopher William J. Rapaport. == Career == After postdoctoral research at the University of Toronto, she became an assistant professor at New Mexico State University in 1992. In 2000, she moved to the University of Pittsburgh, where she became a professor of computer science and director of the Intelligent Systems Program. == Recognition == Wiebe was named a Fellow of the Association for Computational Linguistics in 2015. == Death == She died of leukemia on December 10, 2018.

    Read more →
  • Tango (platform)

    Tango (platform)

    Tango (named Project Tango while in testing) was an augmented reality computing platform, developed and authored by the Advanced Technology and Projects (ATAP), a skunkworks division of Google. It used computer vision to enable mobile devices, such as smartphones and tablets, to detect their position relative to the world around them without using GPS or other external signals. This allowed application developers to create user experiences that include indoor navigation, 3D mapping, physical space measurement, environmental recognition, augmented reality, and windows into a virtual world. The first product to emerge from ATAP, Tango was developed by a team led by computer scientist Johnny Lee, a core contributor to Microsoft's Kinect. In an interview in June 2015, Lee said, "We're developing the hardware and software technologies to help everything and everyone understand precisely where they are, anywhere." Google produced two devices to demonstrate the Tango technology: the Peanut phone and the Yellowstone 7-inch tablet. More than 3,000 of these devices had been sold as of June 2015, chiefly to researchers and software developers interested in building applications for the platform. In the summer of 2015, Qualcomm and Intel both announced that they were developing Tango reference devices as models for device manufacturers who use their mobile chipsets. At CES, in January 2016, Google announced a partnership with Lenovo to release a consumer smartphone during the summer of 2016 to feature Tango technology marketed at consumers, noting a less than $500 price-point and a small form factor below 6.5 inches. At the same time, both companies also announced an application incubator to get applications developed to be on the device on launch. On 15 December 2017, Google announced that they would be ending support for Tango on March 1, 2018, in favor of ARCore. == Overview == Tango was different from other contemporary 3D-sensing computer vision products, in that it was designed to run on a standalone mobile phone or tablet and was chiefly concerned with determining the device's position and orientation within the environment. The software worked by integrating three types of functionality: Motion-tracking: using visual features of the environment, in combination with accelerometer and gyroscope data, to closely track the device's movements in space Area learning: storing environment data in a map that can be re-used later, shared with other Tango devices, and enhanced with metadata such as notes, instructions, or points of interest Depth perception: detecting distances, sizes, and surfaces in the environment Together, these generate data about the device in "six degrees of freedom" (3 axes of orientation plus 3 axes of position) and detailed three-dimensional information about the environment. Project Tango was also the first project to graduate from Google X in 2012 Applications on mobile devices use Tango's C and Java APIs to access this data in real time. In addition, an API was also provided for integrating Tango with the Unity game engine; this enabled the conversion or creation of games that allow the user to interact and navigate in the game space by moving and rotating a Tango device in real space. These APIs were documented on the Google developer website. == Applications == Tango enabled apps to track a device's position and orientation within a detailed 3D environment, and to recognize known environments. This allowed the creations of applications such as in-store navigation, visual measurement and mapping utilities, presentation and design tools, and a variety of immersive games. At Augmented World Expo 2015, Johnny Lee demonstrated a construction game that builds a virtual structure in real space, an AR showroom app that allows users to view a full-size virtual automobile and customize its features, a hybrid Nerf gun with mounted Tango screen for dodging and shooting AR monsters superimposed on reality, and a multiplayer VR app that lets multiple players converse in a virtual space where their avatar movements match their real-life movements. Tango apps are distributed through Play. Google has encouraged the development of more apps with hackathons, an app contest, and promotional discounts on the development tablet. == Devices == As a platform for software developers and a model for device manufacturers, Google created two Tango devices. === The Peanut phone === "Peanut" was the first production Tango device, released in the first quarter of 2014. It was a small Android phone with a Qualcomm MSM8974 quad-core processor and additional special hardware including a fisheye motion camera, "RGB-IR" camera for color image and infrared depth detection, and Movidius Vision processing units. A high-performance accelerometer and gyroscope were added after testing several competing models in the MARS lab at the University of Minnesota. Several hundred Peanut devices were distributed to early-access partners including university researchers in computer vision and robotics, as well as application developers and technology startups. Google stopped supporting the Peanut device in September 2015, as by then the Tango software stack had evolved beyond the versions of Android that run on the device. === The Yellowstone tablet === "Yellowstone" was a 7-inch tablet with full Tango functionality, released in June 2014, and sold as the Project Tango Tablet Development Kit. It featured a 2.3 GHz quad-core Nvidia Tegra K1 processor, 128GB flash memory, 1920x1200-pixel touchscreen, 4MP color camera, fisheye-lens (motion-tracking) camera, an IR projector with RGB-IR camera for integrated depth sensing, and 4G LTE connectivity. As of May 27, 2017, the Tango tablet is considered officially unsupported by Google. ==== Testing by NASA ==== In May 2014, two Peanut phones were delivered to the International Space Station to be part of a NASA project to develop autonomous robots that navigate in a variety of environments, including outer space. The soccer-ball-sized, 18-sided polyhedral SPHERES robots were developed at the NASA Ames Research Center, adjacent to the Google campus in Mountain View, California. Andres Martinez, SPHERES manager at NASA, said "We are researching how effective [Tango's] vision-based navigation abilities are for performing localization and navigation of a mobile free flyer on ISS. === Intel RealSense smartphone === Announced at Intel's Developer Forum in August 2015, and offered to public through a Developer Kit since January 2016. It incorporated a RealSense ZR300 camera which had optical features required for Tango, such as the fisheye camera. === Lenovo Phab 2 Pro === Lenovo Phab 2 Pro was the first commercial smartphone with the Tango Technology, the device was announced at the beginning of 2016, launched in August, and available for purchase in the US in November. The Phab 2 Pro had a 6.4 inch screen, a Snapdragon 652 processor, and 64 GB of internal storage, with a rear facing 16 Megapixels camera and 8 MP front camera. === Asus Zenfone AR === Asus Zenfone AR, announced at CES 2017, was the second commercial smartphone with the Tango Technology. It ran Tango AR & Daydream VR on Snapdragon 821, with 6GB or 8GB of RAM and 128 or 256GB of internal memory depending on the configuration.

    Read more →
  • Selmer Bringsjord

    Selmer Bringsjord

    Selmer Bringsjord (born November 24, 1958) is a professor of computer science and cognitive science and a former chair of the Department of Cognitive Science at Rensselaer Polytechnic Institute. He also holds an appointment in the Lally School of Management & Technology and teaches artificial Intelligence (AI), formal logic, human and machine reasoning, and philosophy of AI. == Biography == Bringsjord's education includes a B.A. in philosophy from the University of Pennsylvania and a Ph.D. in philosophy from Brown University, where he studied under Roderick Chisholm. He conducts research in AI as the director of the Rensselaer AI & Reasoning (RAIR) Laboratory. He specializes in the logico-mathematical and philosophical foundations of AI and cognitive science, and in collaboratively building AI systems on the basis of computational logic. Bringsjord believes that "the human mind will forever be superior to AI", and that "much of what many humans do for a living will be better done by indefatigable machines who require not a cent in pay". Bringsjord has stated that the "ultimate growth industry will be building smarter and smarter such machines on the one hand, and philosophizing about whether they are truly conscious and free on the other". Bringsjord has an argument for P = NP using digital physics. Other research includes developing a new computational-logic framework allowing the formalization of deliberative multi-agent "mindreading" as applied to the realm of nuclear strategy, with the goal of creating a model and simulation to enable reliable prediction. He has published an opinion piece advocating for counter-terrorism security ensured by pervasive, all-seeing sensors; automated reasoners; and autonomous, lethal robots. Bringsjord received a National Science Foundation award to research Social Robotics and the Covey Award for the advancement of philosophy of computing awarded by the International Association for Computing And Philosophy, among several others prizes. == Books authored == with Yang, Y. Mental Metalogic: A New, Unifying Theory of Human and Machine Reasoning (Mahwah, NJ: Lawrence Erlbaum).(2007) with Zenzen, M. Superminds: People Harness Hypercomputation, and More (Dordrecht, The Netherlands: Kluwer). (2003) ISBN 978-1402010958 with Ferrucci, D. Artificial Intelligence and Literary Creativity: Inside the Mind of Brutus, A Storytelling Machine (Mahwah, NJ: Lawrence Erlbaum).(2000) Abortion: A Dialogue (Indianapolis, IN: Hackett).(1997) What Robots Can and Can’t Be (Dordrecht, The Netherlands: Kluwer).(1992) Soft Wars (New York, NY: Penguin USA). A novel.(1991)

    Read more →
  • Isolation forest

    Isolation forest

    Isolation forest is an unsupervised learning algorithm for anomaly detection that works on the principle of isolating anomalies, instead of the most common techniques of profiling normal points. In statistics, an anomaly (a.k.a. outlier) is an observation or event that deviates so much from other events to arouse suspicion it was generated by a different mean. For example, the graph in Fig.1 represents ingress traffic to a web server, expressed as the number of requests in 3-hours intervals, for a period of one month. It is quite evident by simply looking at the picture that some points (marked with a red circle) are unusually high, to the point of inducing suspect that the web server might have been under attack at that time. On the other hand, the flat segment indicated by the red arrow also seems unusual and might possibly be a sign that the server was down during that time period. Anomalies in a big dataset may follow very complicated patterns, which are difficult to detect "by eye" in the great majority of cases. This is the reason why the field of anomaly detection is well suited for the application of machine learning techniques. The most common techniques employed for anomaly detection are based on the construction of a profile of what is "normal": anomalies are reported as those instances in the dataset that do not conform to the normal profile. Isolation Forest uses a different approach: instead of trying to build a model of normal instances, it explicitly isolates anomalous points in the dataset. The main advantage of this approach is the possibility of exploiting sampling techniques to an extent that is not allowed to the profile-based methods, creating a very fast algorithm with a low memory demand. == History == The Isolation Forest (iForest) algorithm was initially proposed by Fei Tony Liu, Kai Ming Ting and Zhi-Hua Zhou in 2008. The authors took advantage of two quantitative properties of anomalous data points in a sample, that is: they are the minority consisting of fewer instances and they have attribute-values that are very different from those of normal instances Since anomalies are typically few and very different from the other points in the sample, they must be easier to "isolate" compared to normal points. On the basis of this principle, Isolation Forest builds an ensemble of "Isolation Trees" (iTrees) for the data set and marks as anomalies the points that have short average path lengths on the iTrees. In a later paper, published in 2012 the same authors described a set of experiments to prove that iForest: has a low linear time complexity and a small memory requirement is able to deal with high dimensional data with irrelevant attributes can be trained with or without anomalies in the training set can provide detection results with different levels of granularity without re-training In 2013 Zhiguo Ding and Minrui Fei proposed a framework based on iForest to resolve the problem of detecting anomalies in streaming data. More application of iForest to streaming data are described in papers by Swee Chuan Tan et al., G. A. Susto et al. and Yu Weng et al. One of the main problems of the application of iForest to anomaly detection was not with the model itself, but rather in the way the "anomaly score" was computed. This problem was highlighted by Sahand Hariri, Matias Carrasco Kind and Robert J. Brunner in a 2018 paper, wherein they proposed an improved iForest model named Extended Isolation Forest (EIF). In the same paper the authors describe the improvements made to the original model and how they are able to enhance the consistency and reliability of the anomaly score produced for a given data point. == Algorithm == At the basis of the Isolation Forest algorithm there is the tendency of anomalous instances in a dataset to be easier to separate from the rest of the sample (isolate), compared to normal points. In order to isolate a data point the algorithm recursively generates partitions on the sample by randomly selecting an attribute and then randomly selecting a split value for the attribute, between the minimum and maximum values allowed for that attribute. An example of random partitioning in a 2D dataset of normally distributed points is given in Fig. 2 for a non-anomalous point and Fig. 3 for a point that's more likely to be an anomaly. It is apparent from the pictures how anomalies require fewer random partitions to be isolated, compared to normal points. From a mathematical point of view, recursive partitioning can be represented by a tree structure named Isolation Tree, while the number of partitions required to isolate a point can be interpreted as the length of the path, within the tree, to reach a terminating node starting from the root. For example, the path length of point xi in Fig. 2 is greater than the path length of xj in Fig. 3. More formally, let X = { x1, ..., xn } be a set of d-dimensional points and X' ⊂ X a subset of X. An Isolation Tree (iTree) is defined as a data structure with the following properties: for each node T in the Tree, T is either an external-node with no child, or an internal-node with one "test" and exactly two daughter nodes (Tl, Tr) a test at node T consists of an attribute q and a split value p such that the test q < p determines the traversal of a data point to either Tl or Tr. In order to build an iTree, the algorithm recursively divides X' by randomly selecting an attribute q and a split value p, until either (i) the node has only one instance or (ii) all data at the node have the same values. When the iTree is fully grown, each point in X is isolated at one of the external nodes. Intuitively, the anomalous points are those (easier to isolate, hence) with the smaller path length in the tree, where the path length h(xi) of point x i ∈ X {\displaystyle x_{i}\in X} is defined as the number of edges xi traverses from the root node to get to an external node. A probabilistic explanation of iTree is provided in the iForest original paper. == Properties of Isolation Forest == Sub-sampling: since iForest does not need to isolate all of normal instances, it can frequently ignore the big majority of the training sample. As a consequence, iForest works very well when the sampling size is kept small, a property that is in contrast with the great majority of existing methods, where large sampling size is usually desirable. Swamping: when normal instances are too close to anomalies, the number of partitions required to separate anomalies increases, a phenomena known as swamping, which makes it more difficult for iForest to discriminate between anomalies and normal points. One of the main reasons for swamping is the presence of too many data for the purpose of anomaly detection, which implies one possible solution to the problem is sub-sampling. Since iForest respond very well to sub-sampling in terms of performance, the reduction of the number of points in the sample is also a good way to reduce the effect of swamping. Masking: when the number of anomalies is high it is possible that some of those aggregate in a dense and large cluster, making it more difficult to separate the single anomalies and, in turn, to detect such points as anomalous. Similarly to swamping, this phenomena (known as "masking") is also more likely when the number of points in the sample is big, and can be alleviated through sub-sampling. High Dimensional Data: one of the main limitation to standard, distance-based methods is their inefficiency in dealing with high dimensional datasets:. The main reason for that is, in a high dimensional space every point is equally sparse, so using a distance-based measure of separation is pretty ineffective. Unfortunately, high-dimensional data also affects the detection performance of iForest, but the performance can be vastly improved by adding a features selection test like Kurtosis to reduce the dimensionality of the sample space. Normal Instances Only: iForest performs well even if the training set does not contain any anomalous point, the reason being that iForest describes data distributions in such a way that high values of the path length h(xi) correspond to the presence of data points. As a consequence, the presence of anomalies is pretty irrelevant to iForest's detection performance. == Anomaly Detection with Isolation Forest == Anomaly detection with Isolation Forest is a process composed of two main stages: in the first stage, a training dataset is used to build iTrees as described in previous sections. in the second stage, each instance in test set is passed through the iTrees build in the previous stage, and a proper "anomaly score" is assigned to the instance using the algorithm described below Once all the instances in the test set have been assigned an anomaly score, it is possible to mark as "anomaly" any point whose score is greater than a predefined threshold, which depends on the domain the analysis is being applied to. === Anomaly Score === Th

    Read more →
  • Edward Stabler

    Edward Stabler

    Edward Stabler is a Professor of Linguistics at the University of California, Los Angeles. His primary areas of research are (1) Natural Language Processing (NLP), (2) Parsing and formal language theory, and (3) Philosophy of Logic and Language. He was a member of the faculty at UCLA from 1984 to 2016. His work involves the production of software for minimalist grammars (MGs) and related systems. == Early life and education == Stabler received his Ph.D. from the Department of Linguistics and Philosophy at MIT in 1981. == Recent publications == Edward Stabler (2011) Computational perspectives on minimalism. Revised version in C. Boeckx, ed, Oxford Handbook of Linguistic Minimalism, pp. 617–642. Edward Stabler (2010) A defense of this perspective against the Evans&Levinson critique appears here, with revised version in Lingua 120(12): 2680-2685. Edward Stabler (2010) After GB. Revised version in J. van Benthem & A. ter Meulen, eds, Handbook of Logic and Language, pp. 395–414. Edward Stabler (2010) Recursion in grammar and performance. Presented at the 2009 UMass recursion conference. Edward Stabler (2009) Computational models of language universals. Revised version appears in M. H. Christiansen, C. Collins, and S. Edelman, eds., Language Universals, Oxford: Oxford University Press, pages 200-223. Edward Stabler (2008) Tupled pregroup grammars. Revised version appears in P. Casadio and J. Lambek, eds., Computational Algebraic Approaches to Natural Language, Milan: Polimetrica, pages 23–52. Edward Stabler (2006) Sidewards without copying. Proceedings of the 11th Conference on Formal Grammar, edited by P. Monachesi, G. Penn, G. Satta, and S. Wintner. Stanford: CSLI Publications, 2006, pages 133-146.

    Read more →
  • Viola–Jones object detection framework

    Viola–Jones object detection framework

    The Viola–Jones object detection framework is a machine learning object detection framework proposed in 2001 by Paul Viola and Michael Jones. It was motivated primarily by the problem of face detection, although it can be adapted to the detection of other object classes. In short, it consists of a sequence of classifiers. Each classifier is a single perceptron with several binary masks (Haar features). To detect faces in an image, a sliding window is computed over the image. For each image, the classifiers are applied. If at any point, a classifier outputs "no face detected", then the window is considered to contain no face. Otherwise, if all classifiers output "face detected", then the window is considered to contain a face. The algorithm is efficient for its time, able to detect faces in 384 by 288 pixel images at 15 frames per second on a conventional 700 MHz Intel Pentium III. It is also robust, achieving high precision and recall. While it has lower accuracy than more modern methods such as convolutional neural network, its efficiency and compact size (only around 50k parameters, compared to millions of parameters for typical CNN like DeepFace) means it is still used in cases with limited computational power. For example, in the original paper, they reported that this face detector could run on the Compaq iPAQ at 2 fps (this device has a low power StrongARM without floating point hardware). == Problem description == Face detection is a binary classification problem combined with a localization problem: given a picture, decide whether it contains faces, and construct bounding boxes for the faces. To make the task more manageable, the Viola–Jones algorithm only detects full view (no occlusion), frontal (no head-turning), upright (no rotation), well-lit, full-sized (occupying most of the frame) faces in fixed-resolution images. The restrictions are not as severe as they appear, as one can normalize the picture to bring it closer to the requirements for Viola-Jones. any image can be scaled to a fixed resolution for a general picture with a face of unknown size and orientation, one can perform blob detection to discover potential faces, then scale and rotate them into the upright, full-sized position. the brightness of the image can be corrected by white balancing. the bounding boxes can be found by sliding a window across the entire picture, and marking down every window that contains a face. This would generally detect the same face multiple times, for which duplication removal methods, such as non-maximal suppression, can be used. The "frontal" requirement is non-negotiable, as there is no simple transformation on the image that can turn a face from a side view to a frontal view. However, one can train multiple Viola-Jones classifiers, one for each angle: one for frontal view, one for 3/4 view, one for profile view, a few more for the angles in-between them. Then one can at run time execute all these classifiers in parallel to detect faces at different view angles. The "full-view" requirement is also non-negotiable, and cannot be simply dealt with by training more Viola-Jones classifiers, since there are too many possible ways to occlude a face. == Components of the framework == A full presentation of the algorithm is in. Consider an image I ( x , y ) {\displaystyle I(x,y)} of fixed resolution ( M , N ) {\displaystyle (M,N)} . Our task is to make a binary decision: whether it is a photo of a standardized face (frontal, well-lit, etc) or not. Viola–Jones is essentially a boosted feature learning algorithm, trained by running a modified AdaBoost algorithm on Haar feature classifiers to find a sequence of classifiers f 1 , f 2 , . . . , f k {\displaystyle f_{1},f_{2},...,f_{k}} . Haar feature classifiers are crude, but allows very fast computation, and the modified AdaBoost constructs a strong classifier out of many weak ones. At run time, a given image I {\displaystyle I} is tested on f 1 ( I ) , f 2 ( I ) , . . . f k ( I ) {\displaystyle f_{1}(I),f_{2}(I),...f_{k}(I)} sequentially. If at any point, f i ( I ) = 0 {\displaystyle f_{i}(I)=0} , the algorithm immediately returns "no face detected". If all classifiers return 1, then the algorithm returns "face detected". For this reason, the Viola-Jones classifier is also called "Haar cascade classifier". === Haar feature classifiers === Consider a perceptron f w , b {\displaystyle f_{w,b}} defined by two variables w ( x , y ) , b {\displaystyle w(x,y),b} . It takes in an image I ( x , y ) {\displaystyle I(x,y)} of fixed resolution, and returns f w , b ( I ) = { 1 , if ∑ x , y w ( x , y ) I ( x , y ) + b > 0 0 , else {\displaystyle f_{w,b}(I)={\begin{cases}1,\quad {\text{if }}\sum _{x,y}w(x,y)I(x,y)+b>0\\0,\quad {\text{else}}\end{cases}}} A Haar feature classifier is a perceptron f w , b {\displaystyle f_{w,b}} with a very special kind of w {\displaystyle w} that makes it extremely cheap to calculate. Namely, if we write out the matrix w ( x , y ) {\displaystyle w(x,y)} , we find that it takes only three possible values { + 1 , − 1 , 0 } {\displaystyle \{+1,-1,0\}} , and if we color the matrix with white on + 1 {\displaystyle +1} , black on − 1 {\displaystyle -1} , and transparent on 0 {\displaystyle 0} , the matrix is in one of the 5 possible patterns shown on the right. Each pattern must also be symmetric to x-reflection and y-reflection (ignoring the color change), so for example, for the horizontal white-black feature, the two rectangles must be of the same width. For the vertical white-black-white feature, the white rectangles must be of the same height, but there is no restriction on the black rectangle's height. ==== Rationale for Haar features ==== The Haar features used in the Viola-Jones algorithm are a subset of the more general Haar basis functions, which have been used previously in the realm of image-based object detection. While crude compared to alternatives such as steerable filters, Haar features are sufficiently complex to match features of typical human faces. For example: The eye region is darker than the upper-cheeks. The nose bridge region is brighter than the eyes. Composition of properties forming matchable facial features: Location and size: eyes, mouth, bridge of nose Value: oriented gradients of pixel intensities Further, the design of Haar features allows for efficient computation of f w , b ( I ) {\displaystyle f_{w,b}(I)} using only constant number of additions and subtractions, regardless of the size of the rectangular features, using the summed-area table. === Learning and using a Viola–Jones classifier === Choose a resolution ( M , N ) {\displaystyle (M,N)} for the images to be classified. In the original paper, they recommended ( M , N ) = ( 24 , 24 ) {\displaystyle (M,N)=(24,24)} . ==== Learning ==== Collect a training set, with some containing faces, and others not containing faces. Perform a certain modified AdaBoost training on the set of all Haar feature classifiers of dimension ( M , N ) {\displaystyle (M,N)} , until a desired level of precision and recall is reached. The modified AdaBoost algorithm would output a sequence of Haar feature classifiers f 1 , f 2 , . . . , f k {\displaystyle f_{1},f_{2},...,f_{k}} . The details of the modified AdaBoost algorithm is detailed below. ==== Using ==== To use a Viola-Jones classifier with f 1 , f 2 , . . . , f k {\displaystyle f_{1},f_{2},...,f_{k}} on an image I {\displaystyle I} , compute f 1 ( I ) , f 2 ( I ) , . . . f k ( I ) {\displaystyle f_{1}(I),f_{2}(I),...f_{k}(I)} sequentially. If at any point, f i ( I ) = 0 {\displaystyle f_{i}(I)=0} , the algorithm immediately returns "no face detected". If all classifiers return 1, then the algorithm returns "face detected". === Learning algorithm === The speed with which features may be evaluated does not adequately compensate for their number, however. For example, in a standard 24x24 pixel sub-window, there are a total of M = 162336 possible features, and it would be prohibitively expensive to evaluate them all when testing an image. Thus, the object detection framework employs a variant of the learning algorithm AdaBoost to both select the best features and to train classifiers that use them. This algorithm constructs a "strong" classifier as a linear combination of weighted simple “weak” classifiers. h ( x ) = sgn ⁡ ( ∑ j = 1 M α j h j ( x ) ) {\displaystyle h(\mathbf {x} )=\operatorname {sgn} \left(\sum _{j=1}^{M}\alpha _{j}h_{j}(\mathbf {x} )\right)} Each weak classifier is a threshold function based on the feature f j {\displaystyle f_{j}} . h j ( x ) = { − s j if f j < θ j s j otherwise {\displaystyle h_{j}(\mathbf {x} )={\begin{cases}-s_{j}&{\text{if }}f_{j}<\theta _{j}\\s_{j}&{\text{otherwise}}\end{cases}}} The threshold value θ j {\displaystyle \theta _{j}} and the polarity s j ∈ ± 1 {\displaystyle s_{j}\in \pm 1} are determined in the training, as well as the coefficients α j {\displaystyle \alpha _{j}} . Here a simplified version of the lea

    Read more →
  • Synchronizing word

    Synchronizing word

    In computer science, more precisely, in the theory of deterministic finite automata (DFA), a synchronizing word or reset sequence is a word in the input alphabet of the DFA that sends any state of the DFA to one and the same state. That is, if an ensemble of copies of the DFA are each started in different states, and all of the copies process the synchronizing word, they will all end up in the same state. Not every DFA has a synchronizing word; for instance, a DFA with two states, one for words of even length and one for words of odd length, can never be synchronized. == Existence == Given a DFA, the problem of determining if it has a synchronizing word can be solved in polynomial time using a theorem due to Ján Černý. A simple approach considers the power set of states of the DFA, and builds a directed graph where nodes belong to the power set, and a directed edge describes the action of the transition function. A path from the node of all states to a singleton state shows the existence of a synchronizing word. This algorithm is exponential in the number of states. A polynomial algorithm results however, due to a theorem of Černý that exploits the substructure of the problem, and shows that a synchronizing word exists if and only if every pair of states has a synchronizing word. == Length == The problem of estimating the length of synchronizing words has a long history and was posed independently by several authors, but it is commonly known as the Černý conjecture. In 1969, Ján Černý conjectured that (n − 1)2 is the upper bound for the length of the shortest synchronizing word for any n-state complete DFA (a DFA with complete state transition graph). If this is true, it would be tight: in his 1964 paper, Černý exhibited a class of automata (indexed by the number n of states) for which the shortest reset words have this length. The best upper bound known is 0.1654n3, far from the lower bound. For n-state DFAs over a k-letter input alphabet, an algorithm by David Eppstein finds a synchronizing word of length at most 11n3/48 + O(n2), and runs in time complexity O(n3+kn2). This algorithm does not always find the shortest possible synchronizing word for a given automaton; as Eppstein also shows, the problem of finding the shortest synchronizing word is NP-complete. However, for a special class of automata in which all state transitions preserve the cyclic order of the states, he describes a different algorithm with time O(kn2) that always finds the shortest synchronizing word, proves that these automata always have a synchronizing word of length at most (n − 1)2 (the bound given in Černý's conjecture), and exhibits examples of automata with this special form whose shortest synchronizing word has length exactly (n − 1)2. == Road coloring == The road coloring problem is the problem of labeling the edges of a regular directed graph with the symbols of a k-letter input alphabet (where k is the outdegree of each vertex) in order to form a synchronizable DFA. It was conjectured in 1970 by Benjamin Weiss and Roy Adler that any strongly connected and aperiodic regular digraph can be labeled in this way; their conjecture was proven in 2007 by Avraham Trahtman. == Related: transformation semigroups == A transformation semigroup is synchronizing if it contains an element of rank 1, that is, an element whose image is of cardinality 1. A DFA corresponds to a transformation semigroup with a distinguished generator set.

    Read more →
  • Tomáš Mikolov

    Tomáš Mikolov

    Tomáš Mikolov is a Czech computer scientist working in the field of machine learning. In March 2020, Mikolov became a senior research scientist at the Czech Institute of Informatics, Robotics and Cybernetics. == Career == Mikolov obtained his PhD in Computer Science from Brno University of Technology for his work on recurrent neural network-based language models. He is the lead author of the 2013 paper that introduced the Word2vec technique in natural language processing and is an author on the FastText architecture. Mikolov came up with the idea to generate text from neural language models in 2007 and his RNNLM toolkit was the first to demonstrate the capability to train language models on large corpora, resulting in large improvements over the state of the art. Prior to joining Facebook in 2014, Mikolov worked as a visiting researcher at Johns Hopkins University, Université de Montréal, Microsoft and Google. He left Facebook at some time in 2019/2020 to join the Czech Institute of Informatics, Robotics and Cybernetics. Mikolov has argued that humanity might be at a greater existential risk if an artificial general intelligence is not developed.

    Read more →
  • The Best Free AI Analytics Tool for Beginners

    The Best Free AI Analytics Tool for Beginners

    Trying to pick the best AI analytics tool? An AI analytics tool is software that uses machine learning to help you get more done — it scales effortlessly from a single task to thousands. The best picks balance beginner-friendly simplicity with the depth power users need, and they ship updates often. Whether you are a beginner or a pro, the right AI analytics tool slots into your workflow and pays for itself fast. This guide breaks down the top picks, their pros and cons, and who each one is best for.

    Read more →
  • Apache CarbonData

    Apache CarbonData

    Apache CarbonData is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile and ORC. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. == History == CarbonData was developed at Huawei in 2013. The project was donated to the Apache Community in 2015 submitted to the Apache Incubator in June 2016. The project won top honors in the BlackDuck 2016 Open Source Rookies of the Year's Big Data category. Apache CarbonData has been a top-level Apache Software Foundation (ASF)-sponsored project since May 1, 2017.

    Read more →
  • Salvatore J. Stolfo

    Salvatore J. Stolfo

    Salvatore J. Stolfo is an academic and professor of computer science at Columbia University, specializing in computer security. == Early life == Born in Brooklyn, New York, Stolfo received a Bachelor of Science degree in Computer Science and Mathematics from Brooklyn College in 1974. He received his Ph.D. from NYU Courant Institute in 1979 and has been on the faculty of Columbia ever since, where he's taught courses in Artificial Intelligence, Intrusion and Anomaly Detection Systems, Introduction to Programming, Fundamental Algorithms, Data Structures, and Knowledge-Based Expert Systems. == Academic research == While at Columbia, Stolfo has received close to $50M in funding for research that has broadly focused on Security, Intrusion Detection, Anomaly Detection, Machine Learning and includes early work in parallel computing and artificial intelligence. He has published or co-authored over 250 papers and has over 46,000 citations with an H-index of 102. In 1996 he proposed a project with DARPA that applies machine learning to behavioral patterns to detect fraud or intrusion in networks. DADO, developed by in part by Stolfo, introduced the parallel computing primitive: “Broadcast, Resolve, Report”, a hardwire implemented mechanism that today is called MapReduce. Among his earliest work, Stolfo along with colleague Greg Vesonder of Bell Labs, developed a large-scale expert data analysis system, called ACE (Automated Cable Expertise) for the nation's phone system. AT&T Bell Labs distributed ACE to a number of telephone wire centers to improve the management and scheduling of repairs in the local loop. Stolfo coined the term FOG computing (not to be confused with fog computing) where technology is used “to launch disinformation attacks against malicious insiders, preventing them from distinguishing the real sensitive customer data from fake worthless data.” In 2005 Stolfo received funding from the Army Research Office to conduct a workshop to bring together a group of researchers to help identify a research program to focus on insider threats. He was elevated to IEEE Fellow in 2018 "for his contributions to machine learning based cybersecurity." He was elected as an ACM Fellow in 2019 "for contributions to machine-learning-based cybersecurity and parallel hardware for database inference systems". == Career == Founded in 2011, Red Balloon Security (or RBS) is a cyber security company founded by Dr Sal Stolfo and Dr Ang Cui. A spinout from the IDS lab, RBS developed a symbiote technology called FRAK as a host defense for embedded systems under the sponsorship of DARPA's Cyber Fast Track program. Created based on their IDS lab research for the DARPA Active Authentication and the Anomaly Detection at Multiple Scales program, Dr Sal Stolfo and Dr. Angelos Keromytis founded Allure Security Technologies. Using active behavioral authentication and decoy technology Stolfo pioneered and patented in 1996. Founded in 2009, Allure Security Technology was created based on work done under DARPA sponsorship in Columbia's IDS lab based on DARPA prompts to research how to detect hackers once they are inside an organization's perimeter and how to continuously authenticate a user without a password. Stolfo's company Electronic Digital Documents produced a “DataBlade” technology, which Informix marketed during their strategy of acquisition and development in the mid 80's. Stolfo's patented merge/purge technology called EDD DataCleanser DataBlade was licensed by Informix. Since its acquisition by IBM in 2005, IBM Informix is one of the world's most widely used database servers, with users ranging from the world's largest corporations to startups. System Detection was one of the companies founded by Prof. Stolfo to commercialize the Anomaly Detection technology developed in the IDS lab. The company ultimately reorganized and was rebranded as Trusted Computer Solutions. That company was recently acquired by Raytheon. Recently a jury awarded Columbia University $185 million for patent infringement for one of Prof. Stolfo's inventions, the Application Communities technology. https://news.columbia.edu/news/columbia-university-awarded-185-million-patent-infringement-nortonlifelock-inc. The final order from the judge applied nearly treble damages: https://www.reuters.com/legal/litigation/gen-digital-owes-columbia-481-mln-us-patent-fight-judge-says-2023-10-02/

    Read more →
  • Janyce Wiebe

    Janyce Wiebe

    Janyce Marbury Wiebe (1959–2018) was an American computer science specializing in natural language processing and known for her work on subjectivity, sentiment analysis, opinion mining, discourse processing, and word-sense disambiguation. == Early life and education == Wiebe was born in 1959, in Albany, New York. She majored in English at the Binghamton University, graduating in 1981, and completed a Ph.D. in computer science in 1990, at the University at Buffalo. Her dissertation, Recognizing Subjective Sentences: A Computational Investigation of Narrative Text, was supervised by philosopher William J. Rapaport. == Career == After postdoctoral research at the University of Toronto, she became an assistant professor at New Mexico State University in 1992. In 2000, she moved to the University of Pittsburgh, where she became a professor of computer science and director of the Intelligent Systems Program. == Recognition == Wiebe was named a Fellow of the Association for Computational Linguistics in 2015. == Death == She died of leukemia on December 10, 2018.

    Read more →