PureXML

PureXML

pureXML is the native XML storage feature in the IBM Db2 data server. pureXML provides query languages, storage technologies, indexing technologies, and other features to support XML data. The word pure in pureXML was chosen to indicate that Db2 natively stores and natively processes XML data in its inherent hierarchical structure, as opposed to treating XML data as plain text or converting it into a relational format. == Technical information == Db2 includes two distinct storage mechanisms: one for efficiently managing traditional SQL data types, and another for managing XML data. The underlying storage mechanism is transparent to users and applications; they simply use SQL (including SQL with XML extensions or SQL/XML) or XQuery to work with the data. XML data is stored in columns of Db2 tables that have the XML data type. XML data is stored in a parsed format that reflects the hierarchical nature of the original XML data. As such, pureXML uses trees and nodes as its model for storing and processing XML data. If you instruct Db2 to validate XML data against an XML schema prior to storage, Db2 annotates all nodes in the XML hierarchy with information about the schema types; otherwise, it will annotate the nodes with default type information. Upon storage, Db2 preserves the internal structure of XML data, converting its tag names and other information into integer values. Doing so helps conserve disk space and also improves the performance of queries that use navigational expressions. However, users aren't aware of this internal representation. Finally, Db2 automatically splits XML nodes across multiple database pages, as needed. XML schemas specify which XML elements are valid, in what order these elements should appear in XML data, which XML data types are associated with each element, and so on. pureXML allows you to validate the cells in a column of XML data against no schema, one schema, or multiple schemas. pureXML also provides tools to support evolving XML schemas. IBM has enhanced its programming language interfaces to support access to its XML data. These enhancements span Java (JDBC), C (embedded SQL and call-level interface), COBOL (embedded SQL), PHP, and Microsoft's .NET Framework (through the DB2.NET provider). == History == pureXML was first included in the DB2 9 for Linux, Unix, and Microsoft Windows release, which was codenamed Viper, in June 2006. It was available on DB2 9 for z/OS in March 2007. In October 2007, IBM released DB2 9.5 with improved XML data transaction performance and improved storage savings. In June 2009, IBM released DB2 9.7 with XML supported for database-partitioned, range-partitioned, and multi-dimensionally clustered tables as well as compression of XML data and indices. == Competition == Db2 is a hybrid data server—it offers data management for traditional relational data, as well as providing native XML data management. Other vendors that offer data management for both relational data and native XML storage include Oracle with its 11g product and Microsoft with its SQL Server product. pureXML also competes with native XML databases like BaseX, eXist, MarkLogic or Sedna. == Books == IBM International Technical Support Organization (ITSO) has published the following books, which are available in print or as free e-books: DB2 9: pureXML Overview and Fast Start DB2 9 pureXML Guide The following books are also available for purchase: DB2 pureXML Cookbook: Master the Power of IBM Hybrid Data Server == Education and training == The following pureXML classroom and online courses are available from IBM Education: Query and Manage XML Data with DB2 9. IBM course CG130. Classroom. Duration: 4 days. Query XML Data with DB2 9. IBM course CG100. Classroom. Duration: 2 days (first 2 days of CG130). Managing XML Data in DB2 9. IBM course CG160. Classroom. Duration: 2 days (last 2 days of CG130). DB2 pureXML. IBM Course CT140. Self-paced study plus Live Virtual Classroom.

PerfKitBenchmarker

PerfKit Benchmarker is an open source benchmarking tool used to measure and compare cloud offerings. PerfKit Benchmarker is licensed under the Apache 2 license terms. PerfKit Benchmarker is a community effort involving over 500 participants including researchers, academic institutions and companies together with the originator, Google. == General == PerfKit Benchmarker (PKB) is a community effort to deliver a repeatable, consistent, and open way of measuring Cloud Performance. It supports a growing list of cloud providers including: Alibaba Cloud, Amazon Web Services, CloudStack, DigitalOcean, Google Cloud Platform, Kubernetes, Microsoft Azure, OpenStack, Rackspace, IBM Bluemix (Softlayer). In addition to Cloud Providers to supports container orchestration including Kubernetes [1] and Mesos [2] and local "static" workstations and clusters of computers [3]. The goal is to create an open source living benchmark [framework] that represents how Cloud developers are building applications, evaluating Cloud alternatives, learning how to architect applications for each cloud. Living because it will change and morph quickly as developers change. PerfKit Benchmarker measures the end to end time to provision resources in the cloud, in addition to reporting on the most standard metrics of peak performance, e.g.: latency, throughput, time-to-complete, IOPS. PerfKit Benchmarker reduces the complexity in running benchmarks on supported cloud providers by unified and simple commands. It's designed to operate via vendor provided command line tools. PerfKit Benchmarker contains a canonical set of public benchmarks. All benchmarks are running with default/initial state and configuration (Not tuned to in favor of any providers). This provides a way to benchmark across cloud platforms, while getting a transparent view of application throughput, latency, variance, and overhead. == History == PerfKit Benchmarker (PKB) was started by Anthony F. Voellm, Alain Hamel, and Eric Hankland at Google in 2014. Once an initial "alpha" was in place Anthony F. Voellm and Ivan Santa Maria Filho built a community including ARM, Broadcom, Canonical, CenturyLink, Cisco, CloudHarmony, CloudSpectator, EcoCloud@EPFL, Intel, Mellanox, Microsoft, Qualcomm Technologies, Inc., Rackspace, Red Hat, Tradeworx Inc., and Thesys Technologies LLC. This community worked together behind the scenes in a private GitHub project to create an open way to measure cloud performance. This community released the first public "beta" was released on February 11, 2015, and announced in a blog post at which point the GitHub project was open to everyone. After almost a year and with large adaption (600+ participants on GitHub) the V1.0.0 was released along with a detailed architectural design on December 10, 2015. == Benchmarks == A list of available benchmarks from PerfKitBenchmarker: (The latest set of benchmarks can be found at GitHub readme file.) == Industry participants == Since Google open sourced the PerfKitBenchmarker, it became a community effort from over 30 leading researchers, academic schools and industry companies. Those organizations include: ARM, Broadcom, Canonical, CenturyLink, Cisco, CloudHarmony, Cloud Spectator, EcoCloud@EPFL, Intel, Mellanox, Microsoft, Qualcomm Technologies, Rackspace, Red Hat, and Thesys Technologies. In addition, Stanford and MIT are leading quarterly discussions on default benchmarks and settings proposed by the community. EcoCloud@EPFL is integrating CloudSuite into PerfKit Benchmarker. == Example runs == On Google Cloud Platform On AWS On Azure On Rackspace On a local machine

Joseph Nechvatal

Joseph Nechvatal (born January 15, 1951) is an American post-conceptual digital artist and art theoretician who creates computer-assisted paintings and computer animations, often using custom computer viruses. == Life and work == Joseph Nechvatal was born in Chicago. He studied fine art and philosophy at Southern Illinois University Carbondale, Cornell University, and Columbia University. He earned a Doctor of Philosophy in Philosophy of Art and Technology at the Planetary Collegium at University of Wales, Newport and has taught art theory and art history at the School of Visual Arts. He has had many solo exhibitions and is one of five artists that art historian Patrick Frank examines in his 2024 book Art of the 1980s: As If the Digital Mattered. His work in the late 1970s and early 1980s chiefly consisted of postminimal gray palimpsest-like drawings that were often photo-mechanically enlarged. Beginning in 1979 he became associated with the artist group Colab, organized the Public Arts International/Free Speech series, and helped established the non-profit group ABC No Rio. In 1983 he co-founded the avant-garde electronic art music audio project Tellus Audio Cassette Magazine. In 1984, Nechvatal began work on an opera called XS: The Opera Opus (1984-6) with the no wave musical composer Rhys Chatham. He began using computers and robotics to make post-conceptual paintings in 1986 and later, in his signature work, began to employ self-created computer viruses. From 1991 to 1993, he was artist-in-residence at the Louis Pasteur Atelier in Arbois, France and at the Saline Royale/Ledoux Foundation's computer lab. There he worked on The Computer Virus Project, his first artistic experiment with computer viruses and computer virus animation. He exhibited computer-robotic paintings at Documenta 8 in 1987. In 2002 he extended his experimentation into viral artificial life through a collaboration with the programmer Stephane Sikora of music2eye in a work called the Computer Virus Project II. Nechvatal has also created a noise music work called viral symphOny, a collaborative sound symphony created by using his computer virus software at the Institute for Electronic Arts at Alfred University. In 2021 Pentiments released Nechvatal's retrospective audio cassette called Selected Sound Works (1981-2021) and in 2022 his The Viral Tempest, a double vinyl LP of new audio work. In 2025, he joined the roster of artists/musicians at Table of the Elements with two CD/book releases: Selected Sound Works (1981-2021) and The Marriage of Orlando and Artaud, Even. From 1999 to 2013, Nechvatal taught art theories of immersive virtual reality and the viractual at the School of Visual Arts in New York City (SVA). A book of his collected essays entitled Towards an Immersive Intelligence: Essays on the Work of Art in the Age of Computer Technology and Virtual Reality (1993–2006) was published by Edgewise Press in 2009. Also in 2009, his virtual reality art theory and art history book Immersive Ideals / Critical Distances was published. In 2011, his book Immersion Into Noise was published by Open Humanities Press in conjunction with the University of Michigan Library's Scholarly Publishing Office. Nechvatal has also published three books with Punctum Books: Minóy (noise music—ed.—2014), Destroyer of Naivetés (poetry—2015), and Styling Sagaciousness (poetry—2022). In 2023 his art theory cybersex farce novella venus©~Ñ~vibrator, even was published by Orbis Tertius Press The Joseph Nechvatal archive is housed at The Fales Library Downtown Collection at the NYU Special Collections Library in New York City. === Viractualism === Viractualism is an art theory concept developed by Nechvatal in 1999 from Ph.D. research Nechvatal conducted at the Planetary Collegium at University of Wales, Newport. There he developed his concept of the viractual, which strives to create an interface between the actual and the virtual.

Apache Giraph

Apache Giraph is an Apache project to perform graph processing on big data. Giraph utilizes Apache Hadoop's MapReduce implementation to process graphs. Facebook used Giraph with some performance improvements to analyze one trillion edges using 200 machines in 4 minutes. Giraph is based on a paper published by Google about its own graph processing system called Pregel. It can be compared to other Big Graph processing libraries such as Cassovary. As of September 2023, it is no longer actively developed.

Tanagra (machine learning)

Tanagra is a free suite of machine learning software for research and academic purposes developed by Ricco Rakotomalala at the Lumière University Lyon 2, France. Tanagra supports several standard data mining tasks such as: Visualization, Descriptive statistics, Instance selection, feature selection, feature construction, regression, factor analysis, clustering, classification and association rule learning. Tanagra is an academic project. It is widely used in French-speaking universities. Tanagra is frequently used in real studies and in software comparison papers. == History == The development of Tanagra was started in June 2003. The first version was distributed in December 2003. Tanagra is the successor of Sipina, another free data mining tool which is intended only for supervised learning tasks (classification), especially the interactive and visual construction of decision trees. Sipina is still available online and is maintained. Tanagra is an "open source project" as every researcher can access the source code and add their own algorithms, as long as they agree and conform to the software distribution license. The main purpose of the Tanagra project is to give researchers and students a user-friendly data mining software, conforming to the present norms of the software development in this domain (especially in the design of its GUI and the way to use it), and allowing the analyzation of either real or synthetic data. From 2006, Ricco Rakotomalala made an important documentation effort. A large number of tutorials are published on a dedicated website. They describe the statistical and machine learning methods and their implementation with Tanagra on real case studies. The use of other free data mining tools on the same problems is also widely described. The comparison of the tools enables readers to understand the possible differences in the presentation of results. == Description == Tanagra works similarly to current data mining tools. The user can design visually a data mining process in a diagram. Each node is a statistical or machine learning technique, the connection between two nodes represents the data transfer. But unlike the majority of tools which are based on the workflow paradigm, Tanagra is very simplified. The treatments are represented in a tree diagram. The results are displayed in an HTML format. This makes it is easy to export the outputs in order to visualize the results in a browser. It is also possible to copy the result tables to a spreadsheet. Tanagra makes a good compromise between statistical approaches (e.g. parametric and nonparametric statistical tests), multivariate analysis methods (e.g. factor analysis, correspondence analysis, cluster analysis, regression) and machine learning techniques (e.g. neural network, support vector machine, decision trees, random forest).

Data remanence

Data remanence is the residual representation of digital data that remains even after attempts have been made to remove or erase the data. This residue may result from data being left intact by a nominal file deletion operation, by reformatting of storage media that does not remove data previously written to the media, or through physical properties of the storage media that allow previously written data to be recovered. Data remanence may make inadvertent disclosure of sensitive information possible should the storage media be released into an uncontrolled environment (e.g., thrown in refuse containers or lost). Various techniques have been developed to counter data remanence. These techniques are classified as clearing, purging/sanitizing, or destruction. Specific methods include overwriting, degaussing, encryption, and media destruction. Effective application of countermeasures can be complicated by several factors, including media that are inaccessible, media that cannot effectively be erased, advanced storage systems that maintain histories of data throughout the data's life cycle, and persistence of data in memory that is typically considered volatile. Several standards exist for the secure removal of data and the elimination of data remanence. == Causes == Many operating systems, file managers, and other software provide a facility where a file is not immediately deleted when the user requests that action. Instead, the file is moved to a holding area (i.e. the "trash"), making it easy for the user to undo a mistake. Similarly, many software products automatically create backup copies of files that are being edited, to allow the user to restore the original version, or to recover from a possible crash (autosave feature). Even when an explicit deleted file retention facility is not provided or when the user does not use it, operating systems do not actually remove the contents of a file when it is deleted unless they are aware that explicit erasure commands are required, like on a solid-state drive. (In such cases, the operating system will issue the Serial ATA TRIM command or the SCSI UNMAP command to let the drive know to no longer maintain the deleted data.) Instead, they simply remove the file's entry from the file system directory because this requires less work and is therefore faster, and the contents of the file—the actual data—remain on the storage medium. The data will remain there until the operating system reuses the space for new data. In some systems, enough filesystem metadata are also left behind to enable easy undeletion by commonly available utility software. Even when undelete has become impossible, the data, until it has been overwritten, can be read by software that reads disk sectors directly. Computer forensics often employs such software. Likewise, reformatting, repartitioning, or reimaging a system is unlikely to write to every area of the disk, though all will cause the disk to appear empty or, in the case of reimaging, empty except for the files present in the image, to most software. Finally, even when the storage media is overwritten, physical properties of the media may permit recovery of the previous contents. In most cases however, this recovery is not possible by just reading from the storage device in the usual way, but requires using laboratory techniques such as disassembling the device and directly accessing/reading from its components. § Complications below gives further explanations for causes of data remanence. == Countermeasures == There are three levels commonly recognized for eliminating remnant data: === Clearing === Clearing is the removal of sensitive data from storage devices in such a way that there is assurance that the data may not be reconstructed using normal system functions or software file/data recovery utilities. The data may still be recoverable, but not without special laboratory techniques. Clearing is typically an administrative protection against accidental disclosure within an organization. For example, before a hard drive is re-used within an organization, its contents may be cleared to prevent their accidental disclosure to the next user. === Purging === Purging or sanitizing is the physical rewrite of sensitive data from a system or storage device done with the specific intent of rendering the data unrecoverable at a later time. Purging, proportional to the sensitivity of the data, is generally done before releasing media beyond control, such as before discarding old media, or moving media to a computer with different security requirements. === Destruction === The storage media is made unusable for conventional equipment. Effectiveness of destroying the media varies by medium and method. Depending on recording density of the media, and/or the destruction technique, this may leave data recoverable by laboratory methods. Conversely, destruction using appropriate techniques is the most secure method of preventing retrieval. == Specific methods == === Overwriting === A common method used to counter data remanence is to overwrite the storage media with new data. This is often called wiping or shredding a disk or file, by analogy to common methods of destroying print media, although the mechanism bears no similarity to these. Because such a method can often be implemented in software alone, and may be able to selectively target only part of the media, it is a popular, low-cost option for some applications. Overwriting is generally an acceptable method of clearing, as long as the media is writable and not damaged. The simplest overwrite technique writes the same data everywhere—often just a pattern of all zeros. At a minimum, this will prevent the data from being retrieved simply by reading from the media again using standard system functions. The UEFI in modern machines may offer an ATA class disk erase function as well. The ATA-6 standard governs secure erases specifications. Bitlocker is whole disk encryption and illegible without the key. Writing a fresh GPT allows a new file system to be established. Blocks will set empty but LBA read is illegible. New data will be unaffected and work fine. In an attempt to counter more advanced data recovery techniques, specific overwrite patterns and multiple passes have often been prescribed. These may be generic patterns intended to eradicate any trace signatures; an example is the seven-pass pattern 0xF6, 0x00, 0xFF, , 0x00, 0xFF, , sometimes erroneously attributed to US standard DOD 5220.22-M. One challenge with overwriting is that some areas of the disk may be inaccessible, due to media degradation or other errors. Software overwrite may also be problematic in high-security environments, which require stronger controls on data commingling than can be provided by the software in use. The use of advanced storage technologies may also make file-based overwrite ineffective (see the related discussion below under § Complications). There are specialized machines and software that are capable of doing overwriting. The software can sometimes be a standalone operating system specifically designed for data destruction. There are also machines specifically designed to wipe hard drives to the department of defense specifications DOD 5220.22-M. Writing zero to each block on hard disks and SSDs has the advantage of affording the firmware to deploy spare blocks when bad blocks are identified. Bitlocker has the advantage that data is illegible without the key. Seatools and other tools can erase disks with zero which is typical to revive old consumer class disks but they can wipe server disks albeit slowly. Modern 28TB and larger disks have an enormous number of LBA48 blocks. 40TB and 60TB disks will take proportionately longer times to wipe. ==== Feasibility of recovering overwritten data ==== Peter Gutmann investigated data recovery from nominally overwritten media in the mid-1990s. He suggested magnetic force microscopy may be able to recover such data, and developed specific patterns, for specific drive technologies, designed to counter such. These patterns have come to be known as the Gutmann method. Gutmann's belief in the possibility of data recovery is based on many questionable assumptions and factual errors that indicate a low level of understanding of how hard drives work. Daniel Feenberg, an economist at the private National Bureau of Economic Research, claims that the chances of overwritten data being recovered from a modern hard drive amount to "urban legend". He also points to the "18+1⁄2-minute gap" Rose Mary Woods created on a tape of Richard Nixon discussing the Watergate break-in. Erased information in the gap has not been recovered, and Feenberg claims doing so would be an easy task compared to recovery of a modern high density digital signal. As of November 2007, the United States Department of Defense considers overwriting acceptable for clearing magnetic media within the same security area/

Andrej Mrvar

Andrej Mrvar is a Slovenian computer scientist and a professor at the University of Ljubljana's Faculty of Social Sciences. He is known for his work in network analysis, graph drawing, decision making, virtual reality, timing and data processing of sports competitions. == Education and career == He is well known for his work on Pajek, a free software for analysis and visualization of large networks. Mrvar began work on Pajek in 1996 with Vladimir Batagelj. His book Exploratory Social Network Analysis with Pajek, coauthored with Wouter de Nooy and Vladimir Batagelj, is his most cited work. It was published by Cambridge University Press in three editions (first 2005, second 2011, and third 2018). The book was translated into Japanese (2009) and Chinese (first edition 2012, second 2014). With Anuška Ferligoj, he was a founding co-editor-in-chief of the Metodološki zvezki - Advances in Methodology and Statistics journal. == Awards and honors == Vidmar Award (Faculty of Electrical and Computer Engineering, University of Ljubljana): 1988, 1990 First prizes for contributions (with Vladimir Batagelj) to Graph Drawing Contests in years: 1995, 1996, 1997, 1998, 1999, 2000 and 2005 / Graph Drawing Hall of Fame. Award of University of Ljubljana for contributions in education and research (Svečana listina Univerze v Ljubljani za pomembne dosežke na področju vzgojnoizobraževalnega in znanstvenoraziskovalega dela): 2001 The INSNA's William D. Richards Software award for work on Pajek (with Vladimir Batagelj): 2013 Award of Faculty of Social Sciences, University of Ljubljana for scientific excellence (Priznanje za znanstveno odličnost): 2013 == Selected publications == Wouter de Nooy, Andrej Mrvar, Vladimir Batagelj, Mark Granovetter (Series Editor), Exploratory Social Network Analysis with Pajek (Structural Analysis in the Social Sciences), Cambridge University Press (First Edition: 2005, Second Edition: 2011, Third Edition: 2018 ). Japanese Translation (2010). Chinese Translation (First Edition: 2012, Second Edition: 2014) Andrej Mrvar and Vladimir Batagelj, Analysis and visualization of large networks with program package Pajek. Complex Adaptive Systems Modeling, 4:6. SpringerOpen, 2016 Vladimir Batagelj and Andrej Mrvar, Some Analyses of Erdős Collaboration Graph, Social Networks, 22, 173–186, 2000 Vladimir Batagelj and Andrej Mrvar, A Subquadratic Triad Census Algorithm for Large Sparse Networks with Small Maximum Degree. Social Networks, 23, 237–243, 2001 Patrick Doreian and Andrej Mrvar, A Partitioning Approach to Structural Balance, Social Networks, 18, 149–168, 1996 Patrick Doreian and Andrej Mrvar, Partitioning Signed Social Networks, Social Networks, 31, 1–11, 2009 Andrej Mrvar and Patrick Doreian, Partitioning Signed Two-Mode Networks, Journal of Mathematical Sociology, 33, 196–221, 2009 Patrick Doreian and Andrej Mrvar, The international reach of the Koch brothers network. In: Antonyuk, A. and Basov, N. (Eds.): Networks in the Global World V. NetGloW 2020. Lecture Notes in Networks and Systems, 181, 225–235. Springer, 2021 Patrick Doreian and Andrej Mrvar, Delineating Changes in the Fundamental Structure of Signed Networks, Frontiers in Physics, 294, 1–11, 2021 Patrick Doreian and Andrej Mrvar, Hubs and Authorities in the Koch Brothers Network. Social Networks, Social Networks, 64, 148–157, 2021 Patrick Doreian and Andrej Mrvar, Public issues, policy proposals, social movements, and the interests of the Koch Brothers network of allies, Quality and Quantity, 56, 305–322, 2022 Douglas R. White, Vladimir Batagelj, Andrej Mrvar, Analyzing Large Kinship and Marriage Networks with Pgraph and Pajek. Social Science Computer Review, 17, 245–274, 1999 Ion Georgiou, Ronald Concer, Andrej Mrvar, A Systemic Approach to Sociometric Group Research: Advancing The Work of Leslie Day Zeleny, 1939–1947, Social Networks, 63, 174–200, 2020