Apache Pig

Apache Pig

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language. == History == Apache Pig was originally developed at Yahoo Research around 2006 for researchers to have an ad hoc way of creating and executing MapReduce jobs on very large data sets. In 2007, it was moved into the Apache Software Foundation. === Naming === Regarding the naming of the Pig programming language, the name was chosen arbitrarily and stuck because it was memorable, easy to spell, and for novelty. The story goes that the researchers working on the project initially referred to it simply as 'the language'. Eventually they needed to call it something. Off the top of his head, one researcher suggested Pig, and the name stuck. It is quirky yet memorable and easy to spell. While some have hinted that the name sounds coy or silly, it has provided us with an entertaining nomenclature, such as Pig Latin for the language, Grunt for the shell, and PiggyBank for the CPAN-like shared repository. == Example == Below is an example of a "Word Count" program in Pig Latin: The above program will generate parallel executable tasks which can be distributed across multiple machines in a Hadoop cluster to count the number of words in a dataset such as all the webpages on the internet. == Pig vs SQL == In comparison to SQL, Pig has a nested relational model, uses lazy evaluation, uses extract, transform, load (ETL), is able to store data at any point during a pipeline, declares execution plans, supports pipeline splits, thus allowing workflows to proceed along DAGs instead of strictly sequential pipelines. On the other hand, it has been argued DBMSs are substantially faster than the MapReduce system once the data is loaded, but that loading the data takes considerably longer in the database systems. It has also been argued RDBMSs offer out of the box support for column-storage, working with compressed data, indexes for efficient random data access, and transaction-level fault tolerance. Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL is instead declarative. In SQL users can specify that data from two tables must be joined, but not what join implementation to use (You can specify the implementation of JOIN in SQL, thus "... for many SQL applications the query writer may not have enough knowledge of the data or enough expertise to specify an appropriate join algorithm."). Pig Latin allows users to specify an implementation or aspects of an implementation to be used in executing a script in several ways. In effect, Pig Latin programming is similar to specifying a query execution plan, making it easier for programmers to explicitly control the flow of their data processing task. SQL is oriented around queries that produce a single result. SQL handles trees naturally, but has no built in mechanism for splitting a data processing stream and applying different operators to each sub-stream. Pig Latin script describes a directed acyclic graph (DAG) rather than a pipeline. Pig Latin's ability to include user code at any point in the pipeline is useful for pipeline development. If SQL is used, data must first be imported into the database, and then the cleansing and transformation process can begin.

Luminoso

Luminoso is a Cambridge, MA-based text analytics and artificial intelligence company. It spun out of the MIT Media Lab and its crowd-sourced Open Mind Common Sense (OMCS) project. The company has raised $20.6 million in financing, and its clients include Sony, Autodesk, Scotts Miracle-Gro, and GlaxoSmithKline. == History == Luminoso was co-founded in 2010 by Dennis Clark, Jason Alonso, Robyn Speer, and Catherine Havasi, a research scientist at MIT in artificial intelligence and computational linguistics. The company builds on the knowledge base of MIT’s Open Mind Common Sense (OMCS) project, co-founded in 1999 by Havasi, who continues to serve as its director. The OCMS knowledge base has since been combined with knowledge from other crowdsourced resources to become ConceptNet. ConceptNet consists of approximately 28 million statements in 304 languages, with full support for 10 languages and moderate support for 77 languages. ConceptNet is a resource for making an AI that understands the meanings of the words people use. During the World Cup in June 2014, the company provided a widely reported real-time sentiment analysis of the U.S. vs. Germany match, analyzing 900,000 posts on Twitter, Facebook and Google+. == Applications == The company uses artificial intelligence, natural language processing, and machine learning to derive insights from unstructured data such as contact center interactions, chatbot and live chat transcripts, product reviews, open-ended survey responses, and email. Luminoso's software identifies and quantifies patterns and relationships in text-based data, including domain-specific or creative language. Rather than human-powered keyword searches of data, the software automates taxonomy creation around concepts, allowing related words and phrases to be dynamically generated and tracked. Commercial applications include analyzing, prioritizing, and routing contact center interactions; identifying consumer complaints before they begin to trend; and tracking sentiment during product launches. The software natively analyzes text in fourteen languages, as well as emoji. == Products == Luminoso's technology can be accessed via two products: Luminoso Daylight and Luminoso Compass. Luminoso Daylight enables a deep-dive analysis into batch or real-time data, whereas Luminoso Compass automates the categorization of real-time data. Both products offer a user interface as well as an API. Luminoso's products can be implemented through either a cloud-based or an on-premise solution. == Research == Luminoso continues to actively conduct research in natural language processing and word embeddings and regularly participates in evaluations such as SemEval. At SemEval 2017, Luminoso participated in Task 2, measuring the semantic similarity of word pairs within and across five languages. Its solution outperformed all competing systems in every language pair tested, with the exception of Persian. == Recognition == Luminoso has been listed as a "Cool Vendor in AI for Marketing" by Gartner, and has also been named a "Boston Artificial Intelligence Startup to Watch" by BostInno. In May 2017, Luminoso was recognized as having the Best Application for AI in the Enterprise by AI Business, and was also shortlisted as the Best AI Breakthrough and Best Innovation in NLP. == Competitors == Major competitors include Clarabridge and Lexalytics. == Investors == The company raised $1.5 million from angel investors led by Basis Technology in 2012. Its first institutional funding round of $6.5 was completed in July 2014, led by Acadia Woods with participation from Japan’s Digital Garage. The company followed that with a $10M series B funding round in December 2018, led by DVI Equity Partners, with participation from Liberty Global Ventures, DF Enterprises, Raptor Holdco, Acadia Woods Partners, and Accord Ventures, among others.

Timeline of algorithms

The following timeline of algorithms outlines the development of algorithms (mainly "mathematical recipes") since their inception. == Antiquity == Before – writing about "recipes" (on cooking, rituals, agriculture and other themes) c. 1700–2000 BC – Egyptians develop earliest known algorithms for multiplying two numbers c. 1600 BC – Babylonians develop earliest known algorithms for factorization and finding square roots c. 300 BC – Euclid's algorithm c. 200 BC – the Sieve of Eratosthenes 263 AD – Gaussian elimination described by Liu Hui == Medieval Period == 628 – Chakravala method described by Brahmagupta c. 820 – Al-Khawarizmi described algorithms for solving linear equations and quadratic equations in his Algebra; the word algorithm comes from his name 825 – Al-Khawarizmi described the algorism, algorithms for using the Hindu–Arabic numeral system, in his treatise On the Calculation with Hindu Numerals, which was translated into Latin as Algoritmi de numero Indorum, where "Algoritmi", the translator's rendition of the author's name gave rise to the word algorithm (Latin algorithmus) with a meaning "calculation method" c. 850 – cryptanalysis and frequency analysis algorithms developed by Al-Kindi (Alkindus) in A Manuscript on Deciphering Cryptographic Messages, which contains algorithms on breaking encryptions and ciphers c. 1025 – Ibn al-Haytham (Alhazen), was the first mathematician to derive the formula for the sum of the fourth powers, and in turn, he develops an algorithm for determining the general formula for the sum of any integral powers c. 1400 – Ahmad al-Qalqashandi gives a list of ciphers in his Subh al-a'sha which include both substitution and transposition, and for the first time, a cipher with multiple substitutions for each plaintext letter; he also gives an exposition on and worked example of cryptanalysis, including the use of tables of letter frequencies and sets of letters which can not occur together in one word == Before 1940 == 1540 – Lodovico Ferrari discovered a method to find the roots of a quartic polynomial 1545 – Gerolamo Cardano published Cardano's method for finding the roots of a cubic polynomial 1614 – John Napier develops method for performing calculations using logarithms 1671 – Newton–Raphson method developed by Isaac Newton 1690 – Newton–Raphson method independently developed by Joseph Raphson 1706 – John Machin develops a quickly converging inverse-tangent series for π and computes π to 100 decimal places 1768 – Leonhard Euler publishes his method for numerical integration of ordinary differential equations in problem 85 of Institutiones calculi integralis 1789 – Jurij Vega improves Machin's formula and computes π to 140 decimal places, 1805 – FFT-like algorithm known by Carl Friedrich Gauss 1842 – Ada Lovelace writes the first algorithm for a computing engine 1903 – A fast Fourier transform algorithm presented by Carle David Tolmé Runge 1918 - Soundex 1926 – Borůvka's algorithm 1926 – Primary decomposition algorithm presented by Grete Hermann 1927 – Hartree–Fock method developed for simulating a quantum many-body system in a stationary state. 1934 – Delaunay triangulation developed by Boris Delaunay 1936 – Turing machine, an abstract machine developed by Alan Turing, with others developed the modern notion of algorithm. == 1940s == 1942 – A fast Fourier transform algorithm developed by G.C. Danielson and Cornelius Lanczos 1945 – Merge sort developed by John von Neumann 1947 – Simplex algorithm developed by George Dantzig == 1950s == 1950 – Hamming codes developed by Richard Hamming 1952 – Huffman coding developed by David A. Huffman 1953 – Simulated annealing introduced by Nicholas Metropolis 1954 – Radix sort computer algorithm developed by Harold H. Seward 1964 – Box–Muller transform for fast generation of normally distributed numbers published by George Edward Pelham Box and Mervin Edgar Muller. Independently pre-discovered by Raymond E. A. C. Paley and Norbert Wiener in 1934. 1956 – Kruskal's algorithm developed by Joseph Kruskal 1956 – Ford–Fulkerson algorithm developed and published by R. Ford Jr. and D. R. Fulkerson 1957 – Prim's algorithm developed by Robert Prim 1957 – Bellman–Ford algorithm developed by Richard E. Bellman and L. R. Ford, Jr. 1959 – Dijkstra's algorithm developed by Edsger Dijkstra 1959 – Shell sort developed by Donald L. Shell 1959 – De Casteljau's algorithm developed by Paul de Casteljau 1959 – QR factorization algorithm developed independently by John G.F. Francis and Vera Kublanovskaya 1959 – Rabin–Scott powerset construction for converting NFA into DFA published by Michael O. Rabin and Dana Scott == 1960s == 1960 – Karatsuba multiplication 1961 – CRC (Cyclic redundancy check) invented by W. Wesley Peterson 1962 – AVL trees 1962 – Quicksort developed by C. A. R. Hoare 1962 – Bresenham's line algorithm developed by Jack E. Bresenham 1962 – Gale–Shapley 'stable-marriage' algorithm developed by David Gale and Lloyd Shapley 1964 – Heapsort developed by J. W. J. Williams 1964 – multigrid methods first proposed by R. P. Fedorenko 1965 – Cooley–Tukey algorithm rediscovered by James Cooley and John Tukey 1965 – Levenshtein distance developed by Vladimir Levenshtein 1965 – Cocke–Younger–Kasami (CYK) algorithm independently developed by Tadao Kasami 1965 – Buchberger's algorithm for computing Gröbner bases developed by Bruno Buchberger 1965 – LR parsers invented by Donald Knuth 1966 – Dantzig algorithm for shortest path in a graph with negative edges 1967 – Viterbi algorithm proposed by Andrew Viterbi 1967 – Cocke–Younger–Kasami (CYK) algorithm independently developed by Daniel H. Younger 1968 – A graph search algorithm described by Peter Hart, Nils Nilsson, and Bertram Raphael 1968 – Risch algorithm for indefinite integration developed by Robert Henry Risch 1969 – Strassen algorithm for matrix multiplication developed by Volker Strassen == 1970s == 1970 – Dinic's algorithm for computing maximum flow in a flow network by Yefim (Chaim) A. Dinitz 1970 – Knuth–Bendix completion algorithm developed by Donald Knuth and Peter B. Bendix 1970 – BFGS method of the quasi-Newton class 1970 – Needleman–Wunsch algorithm published by Saul B. Needleman and Christian D. Wunsch 1972 – Edmonds–Karp algorithm published by Jack Edmonds and Richard Karp, essentially identical to Dinic's algorithm from 1970 1972 – Graham scan developed by Ronald Graham 1972 – Red–black trees and B-trees discovered 1973 – RSA encryption algorithm discovered by Clifford Cocks 1973 – Jarvis march algorithm developed by R. A. Jarvis 1973 – Hopcroft–Karp algorithm developed by John Hopcroft and Richard Karp 1974 – Pollard's p − 1 algorithm developed by John Pollard 1974 – Quadtree developed by Raphael Finkel and J.L. Bentley 1975 – Genetic algorithms popularized by John Holland 1975 – Pollard's rho algorithm developed by John Pollard 1975 – Aho–Corasick string matching algorithm developed by Alfred V. Aho and Margaret J. Corasick 1975 – Cylindrical algebraic decomposition developed by George E. Collins 1976 – Salamin–Brent algorithm independently discovered by Eugene Salamin and Richard Brent 1976 – Knuth–Morris–Pratt algorithm developed by Donald Knuth and Vaughan Pratt and independently by J. H. Morris 1977 – Boyer–Moore string-search algorithm for searching the occurrence of a string into another string. 1977 – RSA encryption algorithm rediscovered by Ron Rivest, Adi Shamir, and Len Adleman 1977 – LZ77 algorithm developed by Abraham Lempel and Jacob Ziv 1977 – multigrid methods developed independently by Achi Brandt and Wolfgang Hackbusch 1978 – LZ78 algorithm developed from LZ77 by Abraham Lempel and Jacob Ziv 1978 – Bruun's algorithm proposed for powers of two by Georg Bruun 1979 – Khachiyan's ellipsoid method developed by Leonid Khachiyan 1979 – ID3 decision tree algorithm developed by Ross Quinlan == 1980s == 1980 – Brent's Algorithm for cycle detection Richard P. Brendt 1981 – Quadratic sieve developed by Carl Pomerance 1981 – Smith–Waterman algorithm developed by Temple F. Smith and Michael S. Waterman 1983 – Simulated annealing developed by S. Kirkpatrick, C. D. Gelatt and M. P. Vecchi 1983 – Classification and regression tree (CART) algorithm developed by Leo Breiman, et al. 1984 – LZW algorithm developed from LZ78 by Terry Welch 1984 – Karmarkar's interior-point algorithm developed by Narendra Karmarkar 1984 – ACORN PRNG discovered by Roy Wikramaratna and used privately 1985 – Simulated annealing independently developed by V. Cerny 1985 – Car–Parrinello molecular dynamics developed by Roberto Car and Michele Parrinello 1985 – Splay trees discovered by Sleator and Tarjan 1986 – Blum Blum Shub proposed by L. Blum, M. Blum, and M. Shub 1986 – Push relabel maximum flow algorithm by Andrew Goldberg and Robert Tarjan 1986 – Barnes–Hut tree method developed by Josh Barnes and Piet Hut for fast approximate simulation of n-body problems 1987 – Fast multipole method developed by Leslie Greengard and Vladimir

Metadata controller

Metadata controller (or MDC) is a storage area network (SAN) technology for managing file locking, space allocation and data access authorization. This is needed when several clients are given block level access to the same disk volume, data storage sharing. MDCs are only used on high-end servers. These are never found on user computers. In the absence of MDC over a SAN there is no possible way of ensuring privacy of the stored data. This controller can also play its role as a sharing device in case the administrators allow other servers to access certain blocks in a particular SAN. The access granted to the servers is of different levels. Some times it may happen that the server is not able to see a block or make changes in it in case of a locked file. This is caused by grant of low level access. If different clients on SAN happen to know each other, access may be granted to shift a certain block from one server to another. This allows the recipient server to use the block and make changes in it. MDCs work as enzymes. They require certain types of SANs and networks to work properly. If a controller is connected to the right network it will boost its output. In case of wrong connection i.e. with the incorrect network, it will decrease its performance.

Webometrics

The science of webometrics (also referred to as cybermetrics) aims to quantify the World Wide Web to get knowledge about the number and types of hyperlinks, the structure of the World Wide Web, and using patterns. According to Björneborn and Ingwersen, the definition of webometrics is "the study of the quantitative aspects of the construction and use of information resources, structures and technologies on the Web drawing on bibliometric and informetric approaches." The term webometrics was coined by Almind and Ingwersen (1997). A second definition of webometrics has also been introduced, "the study of web-based content with primarily quantitative methods for social science research goals using techniques that are not specific to one field of study", which emphasizes the development of applied methods for use in the wider social sciences. The purpose of this alternative definition was to help publicize appropriate methods outside the information-science discipline rather than to replace the original definition within information science. Similar scientific fields are: bibliometrics, informetrics, scientometrics, virtual ethnography, and web mining. One relatively straightforward measure is the "web impact factor" (WIF) introduced by Ingwersen (1998). The WIF measure may be defined as the number of web pages in a web site receiving links from other web sites, divided by the number of web pages published in the site that are accessible to the crawler. However, the use of WIF has been disregarded due to the mathematical artifacts derived from power law distributions of these variables. Other similar indicators using size of the institution instead of number of webpages have been proved more useful.

Radek Maneuver

The Radek Maneuver is a scale-up-then-scale-down tactic used in the administration of web services, specifically those deployed under a cloud computing paradigm (by a provider e.g. Amazon Elastic Compute Cloud or Microsoft Azure). == History == Developed by Olivier "Radek" Dabrowski in the mid-2010s, the Radek Maneuver was originally conceived of in using and maintaining applications running on a PaaS system. == Execution == The Radek Maneuver consists of a series of steps, usually executed via the PaaS or web portal interface. The tactic should be used when a service is misbehaving or otherwise experiencing errors, and the suspected cause is the underlying cloud layer, rather than the application layer. This includes networking issues and other "bad box" problems. The steps are as follows: Identify the application or service which is misbehaving. Increase the compute resource (number of CPU cores, amount of ram) for the instance on which the application is running. This is also known as scaling up. Wait for the application to re-deploy and stabilize. Scale back down to the original instance size. == Principle of action == This scale-up-scale-down method is understood to shift the application to a different physical machine underlying the PaaS service or application virtual machine. While this layer of the cloud computing stack is generally out of the access of an application developer (instead in the hands of the cloud provider), the maneuver allows troubleshooting and dodging errors in that layer.

Information scientist

The term information scientist developed in the latter part of the twentieth century by Wm. Hovey Smith to describe an individual, usually with a relevant subject degree (such as one in Information and Computer Science - CIS) or high level of subject knowledge, providing focused information to scientific and technical research staff in industry. It is a role quite distinct from and complementary to that of a librarian. Developments in end-user searching, together with some convergence between the roles of librarian and information scientist, have led to a diminution in its use in this context, and the term information officer or information professional (information specialist) are also now used. The term was, and is, also used for an individual carrying out research in information science. Brian C. Vickery mentions that the Institute of Information Scientists (IIS) was established in London during 1958 and lists the criteria put forward by this institute "Criteria for Information Science" (appendix 1) as well as his own "Areas of study in information science" (appendix 2). The IIS merged with the Library Association in 2002 to form the Chartered Institute of Library and Information Professionals (CILIP). == Notable Information Scientists == See also Award of Merit - Association for Information Science and Technology Marcia Bates David Blair (information technologist) Samuel C. Bradford Michael Buckland John M. Carroll Blaise Cronin Emilia Currás Brenda Dervin Eugene Garfield Paul B. Kantor Frederick Wilfrid Lancaster Calvin Mooers Tefko Saracevic Linda C. Smith Robert Saxton Taylor Brian Campbell Vickery Thomas D. Wilson == Additional reading == Ellis, David and Merete Haugan. (1997) "Modelling the information seeking patterns of engineers and research scientists in an industrial environment" (Journal of Documentation, Volume 53(4): pp. 384–403) Poole, Alex H. (2024). "'There's a big difference between going through life with the wind at your back, and going through life leaning into the wind': Feminism in Post-World War II Information Science". Proceedings of the Association for Information Science and Technology. 61: 300–313. doi:10.1002/pra2.1029. Vickery, Brian Campbell (1988) "Essays presented to B. C. Vickery" (Journal of Documentation, Volume 44, pp. 199–283). Vickery, B. & Vickery, A. (1987) Information Science in theory and practice (London: Bowker-Saur, pp. 361–369)