AI Chatbot Youtube Ad

AI Chatbot Youtube Ad — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Security awareness

    Security awareness

    Security awareness is the knowledge and attitude members of an organization possess regarding the protection of the physical, and especially informational, assets of that organization. However, it is very tricky to implement because organizations are not able to impose such awareness directly on employees as there are no ways to explicitly monitor people's behavior. That being said, the literature does suggest several ways that such security awareness could be improved. Many organizations require formal security awareness training for all workers when they join the organization and periodically thereafter, usually annually. Another main force that is found to have a strong correlation with employees' security awareness is managerial security participation. It also bridges security awareness with other organizational aspects. == Relationship between Security Awareness and Human Factors == Employees' behavior, cognitive biases, and decision-making processes influence the effectiveness of security measures. Research indicates that psychological factors, such as optimism bias, overconfidence, and habitual behaviors, can undermine security awareness initiatives. To address these challenges, organizations are increasingly using behavioral analytics and security nudges—subtle prompts like password reminders and phishing warnings—to encourage secure behavior. Human error remains the leading cause of cybersecurity incidents. A 2023 IBM Security report found that 95% of breaches are due to human mistakes, including falling for phishing emails, using weak passwords, and mishandling sensitive data. Organizations emphasize security awareness training as a key strategy to mitigate this risk. It is particularly important for leadership to foster a culture of cybersecurity and to provide targeted training to increase security awareness among all employees across the organization. == Coverage == Topics covered in security awareness training include: The nature of sensitive material and physical assets they may come in contact with, such as trade secrets, privacy concerns and government classified information Employee and contractor responsibilities in handling sensitive information, including review of employee nondisclosure agreements Requirements for proper handling of sensitive material in physical form, including marking, transmission, storage and destruction Proper methods for protecting sensitive information on computer systems, including password policy and use of two-factor authentication Other computer security concerns, including malware, phishing, social engineering, etc. Workplace security, including building access, wearing of security badges, reporting of Incidents, forbidden articles, etc. Consequences of failure to properly protect information, including potential loss of employment, economic consequences to the firm, damage to individuals whose private records are divulged, and possible civil and criminal penalties Security awareness means understanding that there is the potential for some people to deliberately or accidentally steal, damage, or misuse the data that is stored within a company's computer systems and throughout its organization. Therefore, it would be prudent to support the assets of the institution (information, physical, and personal) by trying to stop that from happening. According to the European Network and Information Security Agency, "Awareness of the risks and available safeguards is the first line of defence for the security of information systems and networks." "The focus of Security Awareness consultancy should be to achieve a long term shift in the attitude of employees towards security, whilst promoting a cultural and behavioural change within an organisation. Security policies should be viewed as key enablers for the organisation, not as a series of rules restricting the efficient working of your business." == Role of Gamification and Interactive Training == Modern security awareness programs increasingly utilize gamification, phishing simulations, and interactive learning modules. Studies have shown that engaging employees through serious games, reward systems, and real-world attack simulations improves retention and application of security practices. One example is phishing simulation training, where employees receive simulated phishing emails to test their ability to recognize threats. Research indicates that repeated exposure to such exercises leads to long-term improvements in security awareness. == Legislation and Compliance Requirements == Many industries mandate security awareness training to comply with regulations such as: General Data Protection Regulation (GDPR) – requires organizations to ensure data protection awareness among employees. Health Insurance Portability and Accountability Act (HIPAA) – mandates security awareness programs for healthcare providers. Payment Card Industry Data Security Standard (PCI-DSS) – enforces security training for businesses handling payment card information. == Measuring security awareness == In a 2016 study, researchers developed a method of measuring security awareness. Specifically they measured "understanding about circumventing security protocols, disrupting the intended functions of systems or collecting valuable information, and not getting caught" (p. 38). The researchers created a method that could distinguish between experts and novices by having people organize different security scenarios into groups. Experts will organize these scenarios based on centralized security themes where novices will organize the scenarios based on superficial themes. Security awareness is also assessed through real-time security metrics, such as tracking phishing click rates, password reuse tendencies, and policy adherence rates. Organizations are adopting continuous monitoring strategies to provide immediate feedback to employees about risky behavior and suggest corrective actions. == Evolving cyber threats and security awareness strategies == As cyber threats continue to evolve, security awareness programs must adapt to new attack vectors, such as AI-driven cyberattacks, deepfakes, and insider threats. ENISA's Threat Landscape report highlights the increasing prominence of these emerging threats, stressing the need for security measures that address both traditional attacks like ransomware and malware, as well as more sophisticated techniques such as Living Off Trusted Sites (LOTS) and advanced evasion methods used by cybercriminals.

    Read more →
  • Data governance

    Data governance

    Data governance is a term used on both a macro and a micro level. The former is a political concept and forms part of international relations and Internet governance; the latter is a data management concept and forms part of corporate/organizational data governance. Data governance involves delegating authority over data and exercising that authority through decision-making processes. It plays a role in enhancing the value of data assets. == Macro level == Data governance at the macro level involves regulating cross-border data flows among countries, which is more precisely termed international data governance. This field was first formed in the early 2000s, and consists of "norms, principles and rules governing various types of data." There have been several international groups established by research organizations that aim to grant access to their data. These groups that enable an exchange of data are, as a result, exposed to domestic and international legal interpretations that ultimately decide how data is used. However, as of 2023, there are no international laws or agreements specifically focused on data protection. == Data governance (Data Management) == Data governance is the set of principles, policies, and processes that guide the effective and responsible use of data within an organization. It creates a framework for decision making, accountability, and oversight across the data lifecycle, from creation and storage to sharing and disposal. Data governance is closely linked with data management, which provides the practical methods to carry out governance objectives. These methods include data quality assurance, metadata management, master data management, security controls, and compliance monitoring. Together, governance and management aim to maximize the value of data as a strategic asset, reduce risks from misuse or inaccuracy, and ensure compliance with regulatory, ethical, and business requirements. The importance of this discipline has grown with the rise of big data, cloud computing, and artificial intelligence, where consistent standards and stewardship are essential for privacy protection, interoperability, and informed decision making. == Data governance drivers == While data governance initiatives can be driven by a desire to improve data quality, they are often driven by C-level leaders responding to external regulations. In a recent report conducted by the CIO WaterCooler community, 54% stated the key driver was efficiencies in processes; 39% - regulatory requirements; and only 7% customer service. Examples of these regulations include Sarbanes–Oxley Act, Basel I, Basel II, HIPAA, GDPR, cGMP, and a number of data privacy regulations. To achieve compliance with these regulations, business processes and controls require formal management processes to govern the data subject to these regulations. Successful programs identify drivers that are meaningful to both supervisory and executive leadership. Common themes among the external regulations center on the need to manage risk. The risks can be financial misstatement, inadvertent release of sensitive data, or poor data quality for key decisions. Methods to manage these risks vary from industry to industry. Examples of commonly referenced best practices and guidelines include COBIT, ISO/IEC 38500, and others. The proliferation of regulations and standards creates challenges for data governance professionals, particularly when multiple regulations overlap the data being managed. Organizations often launch data governance initiatives to address these challenges. == Data governance initiatives (Dimensions) == Data governance initiatives improve the quality of data by assigning a team responsible for data's accuracy, completeness, consistency, timeliness, validity, and uniqueness. This team usually consists of executive leadership, project management, line-of-business managers, and data stewards. The team usually employs a methodology for tracking and improving enterprise data, such as Six Sigma, and tools for data mapping, profiling, cleansing, and monitoring data. Data governance initiatives may be aimed at achieving a number of objectives including offering better visibility to internal and external customers (such as supply chain management), compliance with regulatory law, improving operations after rapid company growth or corporate mergers, or to aid the efficiency of enterprise knowledge workers by reducing confusion and error and increasing their scope of knowledge. Many data governance initiatives are also inspired by past attempts to fix information quality at the departmental level, which can lead to incongruent and redundant data quality processes. Most large companies have many applications and databases that can not easily share information. Therefore, knowledge workers within large organizations may not have access to the data they need to best do their jobs. When they do have access to the data, the data quality may be poor. By setting up a data governance practice or corporate data authority (individual or area responsible for determining how to proceed, in the best interest of the business, when a data issue arises), these problems can be mitigated. == Implementation == Implementation of a data governance initiative may vary in scope as well as origin. Sometimes, an executive mandate will arise to initiate an enterprise-wide effort. Sometimes the mandate will be to create a pilot project or projects, limited in scope and objectives, aimed at either resolving existing issues or demonstrating value. Sometimes, an initiative originates from lower down in the organization's hierarchy and will be deployed in a limited scope to demonstrate value to potential sponsors higher up in the organization. The initial scope of an implementation can vary greatly as well, from review of a one-off IT system to a cross-organization initiative. == Data governance tools == Leaders of successful data governance programs declared at the Data Governance Conference in Orlando, FL, in December 2006, that data governance is about 80 to 95 percent communication. That stated, it is a given that many of the objectives of a data governance program must be accomplished with appropriate tools. Many vendors are now positioning their products as data governance tools. Due to the different focus areas of various data governance initiatives, a given tool may or may not be appropriate. Additionally, many tools that are not marketed as governance tools address governance needs and demands.

    Read more →
  • Cryptographic nonce

    Cryptographic nonce

    In cryptography, a nonce is an arbitrary number that can be used just once in a cryptographic communication. It is often a random or pseudo-random number issued in an authentication protocol to ensure that each communication session is unique, and therefore that old communications cannot be reused in replay attacks. Nonces can also be useful as initialization vectors and in cryptographic hash functions. == Definition == A nonce is an arbitrary number used only once in a cryptographic communication, in the spirit of a nonce word. They are often random or pseudo-random numbers. Many nonces also include a timestamp to ensure exact timeliness, though this requires clock synchronisation between organisations. The addition of a client nonce ("cnonce") helps to improve the security in some ways as implemented in digest access authentication. To ensure that a nonce is used only once, it should be time-variant (including a suitably fine-grained timestamp in its value), or generated with enough random bits to ensure an insignificantly low chance of repeating a previously generated value. Some authors define pseudo-randomness (or unpredictability) as a requirement for a nonce. Nonce is a word dating back to Middle English for something only used once or temporarily (often with the construction "for the nonce"). It descends from the construction "then anes" ("the one [purpose]"). A false etymology claiming it to stand for "number used once" or similar is incorrect. == Usage == === Authentication === Authentication protocols may use nonces to ensure that old communications cannot be reused in replay attacks. For instance, nonces are used in HTTP digest access authentication to calculate an MD5 digest of the password. The nonces are different each time the 401 authentication challenge response code is presented, thus making replay attacks virtually impossible. The scenario of ordering products over the Internet can provide an example of the usefulness of nonces in replay attacks. An attacker could take the encrypted information and—without needing to decrypt—could continue to send a particular order to the supplier, thereby ordering products over and over again under the same name and purchase information. The nonce is used to give 'originality' to a given message so that if the company receives any other orders from the same person with the same nonce, it will discard those as invalid orders. A nonce may be used to ensure security for a stream cipher. Where the same key is used for more than one message and then a different nonce is used to ensure that the keystream is different for different messages encrypted with that key; often the message number is used. Secret nonce values are used by the Lamport signature scheme as a signer-side secret which can be selectively revealed for comparison to public hashes for signature creation and verification. === Hashing === Nonces are used in proof-of-work systems to vary the input to a cryptographic hash function so as to obtain a hash for a certain input that fulfils certain arbitrary conditions. In doing so, it becomes far more difficult to create a "desirable" hash than to verify it, shifting the burden of work onto one side of a transaction or system. For example, proof of work, using hash functions, was considered as a means to combat email spam by forcing email senders to find a hash value for the email (which included a timestamp to prevent pre-computation of useful hashes for later use) that had an arbitrary number of leading zeroes, by hashing the same input with a large number of values until a "desirable" hash was obtained. Similarly, the Bitcoin blockchain hashing algorithm can be tuned to an arbitrary difficulty by changing the required minimum/maximum value of the hash so that the number of bitcoins awarded for new blocks does not increase linearly with increased network computation power as new users join. This is likewise achieved by forcing Bitcoin miners to add nonce values to the value being hashed to change the hash algorithm output. As cryptographic hash algorithms cannot easily be predicted based on their inputs, this makes the act of blockchain hashing and the possibility of being awarded bitcoins something of a lottery, where the first "miner" to find a nonce that delivers a desirable hash is awarded bitcoins.

    Read more →
  • Data set (IBM mainframe)

    Data set (IBM mainframe)

    In the context of IBM mainframe computers in the IBM System/360 line and its successors, a data set (IBM preferred) or dataset is a computer file having a record organization. Use of this term began with, e.g., DOS/360 and OS/360, and is still used by their successors, including the current VSE and z/OS. Documentation for these systems historically preferred this term rather than file. A data set is typically stored on a direct access storage device (DASD) or magnetic tape, however unit record devices, such as punch card readers, card punches, line printers and page printers can provide input/output (I/O) for a data set (file). Data sets are not unstructured streams of bytes, but rather are organized in various logical record and block structures determined by the DSORG (data set organization), RECFM (record format), and other parameters. These parameters are specified at the time of the data set allocation (creation), for example with Job Control Language DD statements. Within a running program they are stored in the Data Control Block (DCB) or Access Control Block (ACB), which are data structures used to access data sets using access methods. Records in a data set may be fixed, variable, or “undefined” length. == Data set organization == For OS/360, the DCB's DSORG parameter specifies how the data set is organized. It may be CQ Queued Telecommunications Access Method (QTAM) in Message Control Program (MCP) CX Communications line group DA Basic Direct Access Method (BDAM) GS Graphics device for Graphics Access Method(GAM) IS Indexed Sequential Access Method (ISAM) MQ QTAM message queue in application PO Partitioned Organization PS Physical Sequential among others. Data sets on tape may only be DSORG=PS. The choice of organization depends on how the data is to be accessed, and in particular, how it is to be updated. Programmers utilize various access methods (such as QSAM or VSAM) in programs for reading and writing data sets. Access method depends on the given data set organization. == Record format (RECFM) == Regardless of organization, the physical structure of each record is essentially the same, and is uniform throughout the data set. This is specified in the DCB RECFM parameter. RECFM=F means that the records are of fixed length, specified via the LRECL parameter. RECFM=V specifies a variable-length record. V records when stored on media are prefixed by a Record Descriptor Word (RDW) containing the integer length of the record in bytes and flag bits. With RECFM=FB and RECFM=VB, multiple logical records are grouped together into a single physical block on tape or DASD. FB and VB are fixed-blocked, and variable-blocked, respectively. RECFM=U (undefined) is also variable length, but the length of the record is determined by the length of the block rather than by a control field. The BLKSIZE parameter specifies the maximum length of the block. RECFM=FBS could be also specified, meaning fixed-blocked standard, meaning all the blocks except the last one were required to be in full BLKSIZE length. RECFM=VBS, or variable-blocked spanned, means a logical record could be spanned across two or more blocks, with flags in the RDW indicating whether a record segment is continued into the next block and/or was continued from the previous one. This mechanism eliminates the need for using any "delimiter" byte value to separate records. Thus data can be of any type, including binary integers, floating-point, or characters, without introducing a false end-of-record condition. The data set is an abstraction of a collection of records, in contrast to files as unstructured streams of bytes. == Partitioned data set == A partitioned data set (PDS) is a data set containing multiple members, each of which holds a separate sub-data set, similar to a directory in other types of file systems. This type of data set is often used to hold load modules (old format bound executable programs), source program libraries (especially Assembler macro definitions), ISPF screen definitions, and Job Control Language. A PDS may be compared to a Zip file or COM Structured Storage. A Partitioned Data Set can only be allocated on a single volume and have a maximum size of 65,535 tracks. Besides members, a PDS contains also a directory. Each member can be accessed indirectly via the directory structure. Once a member is located, the data stored in that member are handled in the same manner as a PS (sequential) data set. Whenever a member is deleted, the space it occupied is unusable for storing other data. Likewise, if a member is re-written, it is stored in a new spot at the back of the PDS and leaves wasted “dead” space in the middle. The only way to recover “dead” space is to perform file compression. Compression, which is done using the IEBCOPY utility, moves all members to the front of the data space and leaves free usable space at the back. (Note that in modern parlance, this kind of operation might be called defragmentation or garbage collection; data compression nowadays refers to a different, more complicated concept.) PDS files can only reside on DASD, not on magnetic tape, in order to use the directory structure to access individual members. Partitioned data sets are most often used for storing multiple job control language files, utility control statements, and executable modules. An improvement of this scheme is a Partitioned Data Set Extended (PDSE or PDS/E, sometimes just libraries) introduced with DFSMSdfp for MVS/XA and MVS/ESA systems. A PDS/E library can store program objects or other types of members, but not both. BPAM cannot process a PDS/E containing program objects. PDS/E structure is similar to PDS and is used to store the same types of data. However, PDS/E files have a better directory structure which does not require pre-allocation of directory blocks when the PDS/E is defined (and therefore does not run out of directory blocks if not enough were specified). Also, PDS/E automatically stores members in such a way that compression operation is not needed to reclaim "dead" space. PDS/E files can only reside on DASD in order to use the directory structure to access individual members. == Generation Data Group == A Generation Data Group (GDG) is a group of non-VSAM data sets that are successive generations of historically-related data stored on an IBM mainframe (running OS/360 and its successors or DOS/360 and its successors). A GDG is usually cataloged. An individual member of the GDG collection is called a "Generation Data Set." The latter may be identified by an absolute number, ACCTG.OURGDG(1234), or a relative number: (-1) for the previous generation, (0) for the current one, and (+1) the next generation. A GDG specifies how many generations of a data set are to be kept and at what age a generation will be deleted. Whenever a new generation is created, the system checks whether one or more obsolete generations are to be deleted. The purpose of GDGs is to automate archival, using the command language JCL, the data set name given is generic. When DSN appears, the GDG data set appears along with the history number, where (0) is the most recent version (-1), (-2), ... are previous generations (+1) a new generation (see DD) Another use of GDGs is to be able to address all generations simultaneously within a JCL script without having to know the number of currently available generations. To do this, you have to omit the parentheses and the generation number in the JCL when specifying the dataset. === GDG JCL & features === Generation Data Groups are defined using either the BLDG statement of the IEHPROGM utility or the DEFINE GENERATIONGROUP statement of the newer IDCAMS utility, which allows setting various parameters. LIMIT(10) would limit the number of generations limit to 10. SCRATCH FOR (91) would retain each member, up to the limited#generations, at least 91 days. IDCAMS can also delete (and optionally uncatalog) a GDG. ==== Example ==== Creation of a standard GDG for five safety scopes, each at least 35 days old: Delete a standard GDG:

    Read more →
  • Error level analysis

    Error level analysis

    Error level analysis (ELA) is the analysis of compression artifacts in digital data with lossy compression such as JPEG. == Principles == When used, lossy compression is normally applied uniformly to a set of data, such as an image, resulting in a uniform level of compression artifacts. Alternatively, the data may consist of parts with different levels of compression artifacts. This difference may arise from the different parts having been repeatedly subjected to the same lossy compression a different number of times, or the different parts having been subjected to different kinds of lossy compression. A difference in the level of compression artifacts in different parts of the data may therefore indicate that the data has been edited. In the case of JPEG, even a composite with parts subjected to matching compressions will have a difference in the compression artifacts. In order to make the typically faint compression artifacts more readily visible, the data to be analyzed is subjected to an additional round of lossy compression, this time at a known, uniform level, and the result is subtracted from the original data under investigation. The resulting difference image is then inspected manually for any variation in the level of compression artifacts. In 2007, N. Krawetz denoted this method "error level analysis". Additionally, digital data formats such as JPEG sometimes include metadata describing the specific lossy compression used. If in such data the observed compression artifacts differ from those expected from the given metadata description, then the metadata may not describe the actual compressed data, and thus indicate that the data have been edited. == Limitations == By its nature, data without lossy compression, such as a PNG image, cannot be subjected to error level analysis. Consequently, since editing could have been performed on data without lossy compression with lossy compression applied uniformly to the edited, composite data, the presence of a uniform level of compression artifacts does not rule out editing of the data. Additionally, any non-uniform compression artifacts in a composite may be removed by subjecting the composite to repeated, uniform lossy compression. Also, if the image color space is reduced to 256 colors or less, for example, by conversion to GIF, then error level analysis will generate useless results. More significant, the actual interpretation of the level of compression artifacts in a given segment of the data is subjective, and the determination of whether editing has occurred is therefore not robust. == Controversy == In May 2013, Dr Neal Krawetz used error level analysis on the 2012 World Press Photo of the Year and concluded on his Hacker Factor blog that it was "a composite" with modifications that "fail to adhere to the acceptable journalism standards used by Reuters, Associated Press, Getty Images, National Press Photographer's Association, and other media outlets". The World Press Photo organizers responded by letting two independent experts analyze the image files of the winning photographer and subsequently confirmed the integrity of the files. One of the experts, Hany Farid, said about error level analysis that "It incorrectly labels altered images as original and incorrectly labels original images as altered with the same likelihood". Krawetz responded by clarifying that "It is up to the user to interpret the results. Any errors in identification rest solely on the viewer". In May 2015, the citizen journalism team Bellingcat wrote that error level analysis revealed that the Russian Ministry of Defense had edited satellite images related to the Malaysia Airlines Flight 17 disaster. In a reaction to this, image forensics expert Jens Kriese said about error level analysis: "The method is subjective and not based entirely on science", and that it is "a method used by hobbyists". On his Hacker Factor Blog, the inventor of error level analysis Neal Krawetz criticized both Bellingcat's use of error level analysis as "misinterpreting the results" but also on several points Jens Kriese's "ignorance" regarding error level analysis.

    Read more →
  • Copyright

    Copyright

    A copyright is a type of intellectual property that gives its owner the exclusive legal right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time. The creative work may be in a literary, artistic, educational, or musical form. Copyright is intended to protect the original expression of an idea in the form of a creative work, but not the idea itself. A copyright is subject to limitations based on public interest considerations, such as the fair use doctrine in the United States and fair dealing doctrine in the United Kingdom. Some jurisdictions require "fixing" copyrighted works in a tangible form. It is often shared among multiple authors, each of whom holds a set of rights to use or license the work, and who are commonly referred to as rights holders. These rights normally include reproduction, control over derivative works, distribution, public performance, and moral rights such as attribution. Copyrights can be granted by public law and are in that case considered "territorial rights". This means that copyrights granted by the law of a certain state do not extend beyond the territory of that specific jurisdiction. Copyrights of this type vary by country; many countries, and sometimes a large group of countries, have made agreements with other countries on procedures applicable when works "cross" national borders or national rights are inconsistent. Typically, the public law duration of a copyright expires 50 to 100 years after the creator dies, depending on the jurisdiction. Some countries require certain copyright formalities to establishing copyright, others recognize copyright in any completed work, without a formal registration. When the copyright of a work expires, it enters the public domain. == History == === Background === The concept of copyright developed after the printing press came into use in Europe in the 15th and 16th centuries. It was associated with a common law and rooted in the civil law system. The printing press made it much cheaper to produce works, but as there was initially no copyright law, anyone could buy or rent a press and print any text. Popular new works were immediately re-set and re-published by competitors, so printers needed a constant stream of new material. Fees paid to authors for new works were high and significantly supplemented the incomes of many academics. Printing brought profound social changes. The rise in literacy across Europe led to a dramatic increase in the demand for reading matter. Prices of reprints were low, so publications could be bought by poorer people, creating a mass audience. In German-language markets before the advent of copyright, technical materials, like academic papers and handbooks, were inexpensive and widely available; it has been suggested this contributed to Germany's industrial and economic success. === Conception === The concept of copyright first developed in England. In reaction to the printing of "scandalous books and pamphlets", the English Parliament passed the Licensing of the Press Act 1662, which required all intended publications to be registered with the government-approved Stationers' Company, giving the Stationers the right to regulate what material could be printed. The Statute of Anne, enacted in 1710 in England and Scotland, provided the first legislation to protect copyrights (but not authors' rights). The Copyright Act 1814 extended more rights for authors but did not protect British publications from being reprinted in the US. The Berne International Copyright Convention of 1886 finally provided protection for authors among the countries who signed the agreement, although the US did not join the Berne Convention until 1989. In the US, the Constitution grants Congress the right to establish copyright and patent laws. Shortly after the Constitution was passed, Congress enacted the Copyright Act of 1790, modeling it after the Statute of Anne. While the national law protected authors' published works, authority was granted to the states to protect authors' unpublished works. The most recent major overhaul of copyright in the US, the Copyright Act of 1976, extended federal copyright to works as soon as they are created and "fixed", without requiring publication or registration. State law continues to apply to unpublished works that are not otherwise copyrighted by federal law. This act also changed the calculation of copyright term from a fixed term (then a maximum of fifty-six years) to "life of the author plus 50 years". These changes brought the US closer to conformity with the Berne Convention, and in 1989 the United States further revised its copyright law and joined the Berne Convention officially. Copyright laws allow products of creative human activities, such as literary and artistic production, to be preferentially exploited and thus incentivized. Different cultural attitudes, social organizations, economic models and legal frameworks are seen to account for why copyright emerged in Europe and not, for example, in Asia. In the Middle Ages in Europe, there was generally a lack of any concept of literary property due to the general relations of production, the specific organization of literary production and the role of culture in society. The latter refers to the tendency of oral societies, such as that of Europe in the medieval period, to view knowledge as the product and expression of the collective, rather than to see it as individual property. However, with copyright laws, intellectual production comes to be seen as a product of an individual, with attendant rights. The most significant point is that patent and copyright laws support the expansion of the range of creative human activities that can be commodified. This parallels the ways in which capitalism led to the commodification of many aspects of social life that earlier had no monetary or economic value perse. Copyright has developed into a concept that has a significant effect on nearly every modern industry, including not just literary work, but also forms of creative work such as sound recordings, films, photographs, software, and architecture. === National copyrights === Often seen as the first real copyright law, the 1709 British Statute of Anne gave authors and the publishers to whom they did chose to license their works, the right to publish the author's creations for a fixed period, after which the copyright expired. It was "An Act for the Encouragement of Learning, by Vesting the Copies of Printed Books in the Authors or the Purchasers of such Copies, during the Times therein mentioned." The act also alluded to individual rights of the artist. It began: "Whereas Printers, Booksellers, and other Persons, have of late frequently taken the Liberty of Printing ... Books, and other Writings, without the Consent of the Authors ... to their very great Detriment, and too often to the Ruin of them and their Families:". A right to benefit financially from the work is articulated, and court rulings and legislation have recognized a right to control the work, such as ensuring that the integrity of it is preserved. An irrevocable right to be recognized as the work's creator appears in some countries' copyright laws. The Copyright Clause of the United States, Constitution (1787) authorized copyright legislation: "To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries." That is, by guaranteeing them a period of time in which they alone could profit from their works, they would be enabled and encouraged to invest the time required to create them, and this would be good for society as a whole. A right to profit from the work has been the philosophical underpinning for much legislation extending the duration of copyright, to the life of the creator and beyond, to their heirs. Yet scholars like Lawrence Lessig have argued that copyright terms have been extended beyond the scope imagined by the Framers. Lessig refers to the Copyright Clause as the "Progress Clause" to emphasize the social dimension of intellectual property rights. The original length of copyright in the United States was 14 years, and it had to be explicitly applied for. If the author wished, they could apply for a second 14‑year monopoly grant, but after that the work entered the public domain, so it could be used and built upon by others. === Continental law === In many jurisdictions of the European continent, comparable legal concepts to copyright did exist from the 16th century on but did change under Napoleonic rule into another legal concept: authors' rights or creator's right laws, from French: droits d'auteur and German Urheberrecht. In many modern-day publications the terms copyright and authors' rights are being mixed, or used as translations, but in a juridical sense the legal concepts do essentially differ. Authors' rights are, generally speaking,

    Read more →
  • Protecting Our Kids from Social Media Addiction Act

    Protecting Our Kids from Social Media Addiction Act

    Protecting Our Kids from Social Media Addiction Act also known as California SB 976 is a law that was enacted in September 2024 that is meant to address problematic social media usage among minors. The law prohibitions minors to have "addictive feeds" unless they have verifiable parental consent, minor's notifications are also restricted between 12 am to 6 am and during school hours between 8 am and 3 pm it also well requires minors to have default privacies settings and have social media companies to publicly disclose certain metrics about their users. The law was set to take effect in two steps the first being the restrictions on social media feeds, notifications, disclosures from social media companies and default settings which would have taken effect on January 1, 2025, and the age verification provision which would have taken effect on January 1, 2027. However, has faced legal challenges since its enactment delaying its enactment. == Legal Challenges == In November 2024 NetChoice a trade association representing many of the biggest social media companies such as YouTube, Facebook and Instagram sued the attorney general of California Rob Bonta hoping to get an injunction before the first set of the law's provisions would take effect in January of the next year. However, judge Edward Davila would only grant Netchoice's request as to the restrictions on notifications and public disclosures and would deny their request as to the rest of the law. The law was later fully enjoined temporarily by the District Court and Appellant Court pending appeal, and the case is now in the Ninth Circuit Court of Appeals and is pending a decision. === Social media platforms challenges to law === In November 2025 Meta, Google and TikTok filed lawsuits against the law arguing it violates the first amendment.

    Read more →
  • Clustered file system

    Clustered file system

    A clustered file system (CFS) is a file system which is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system (only direct attached storage for each node). Clustered file systems can provide features like location-independent addressing and redundancy which improve reliability or reduce the complexity of the other parts of the cluster. Parallel file systems are a type of clustered file system that spread data across multiple storage nodes, usually for redundancy or performance. == Shared-disk file system == A shared-disk file system uses a storage area network (SAN) to allow multiple computers to gain direct disk access at the block level. Access control and translation from file-level operations that applications use to block-level operations used by the SAN must take place on the client node. The most common type of clustered file system, the shared-disk file system – by adding mechanisms for concurrency control – provides a consistent and serializable view of the file system, avoiding corruption and unintended data loss even when multiple clients try to access the same files at the same time. Shared-disk file-systems commonly employ some sort of fencing mechanism to prevent data corruption in case of node failures, because an unfenced device can cause data corruption if it loses communication with its sister nodes and tries to access the same information other nodes are accessing. The underlying storage area network may use any of a number of block-level protocols, including SCSI, iSCSI, HyperSCSI, ATA over Ethernet (AoE), Fibre Channel, network block device, and InfiniBand. There are different architectural approaches to a shared-disk filesystem. Some distribute file information across all the servers in a cluster (fully distributed). === Examples === == Distributed file systems == Distributed file systems do not share block level access to the same storage but use a network protocol. These are commonly known as network file systems, even though they are not the only file systems that use the network to send data. Distributed file systems can restrict access to the file system depending on access lists or capabilities on both the servers and the clients, depending on how the protocol is designed. The difference between a distributed file system and a distributed data store is that a distributed file system allows files to be accessed using the same interfaces and semantics as local files – for example, mounting/unmounting, listing directories, read/write at byte boundaries, system's native permission model. Distributed data stores, by contrast, require using a different API or library and have different semantics (most often those of a database). === Design goals === Distributed file systems may aim for "transparency" in a number of aspects. That is, they aim to be "invisible" to client programs, which "see" a system which is similar to a local file system. Behind the scenes, the distributed file system handles locating files, transporting data, and potentially providing other features listed below. Access transparency: clients are unaware that files are distributed and can access them in the same way as local files are accessed. Location transparency: a consistent namespace exists encompassing local as well as remote files. The name of a file does not give its location. Concurrency transparency: all clients have the same view of the state of the file system. This means that if one process is modifying a file, any other processes on the same system or remote systems that are accessing the files will see the modifications in a coherent manner. Failure transparency: the client and client programs should operate correctly after a server failure. Heterogeneity: file service should be provided across different hardware and operating system platforms. Scalability: the file system should work well in small environments (1 machine, a dozen machines) and also scale gracefully to bigger ones (hundreds through tens of thousands of systems). Replication transparency: Clients should not have to be aware of the file replication performed across multiple servers to support scalability. Migration transparency: files should be able to move between different servers without the client's knowledge. === History === The Incompatible Timesharing System used virtual devices for transparent inter-machine file system access in the 1960s. More file servers were developed in the 1970s. In 1976, Digital Equipment Corporation created the File Access Listener (FAL), an implementation of the Data Access Protocol as part of DECnet Phase II which became the first widely used network file system. In 1984, Sun Microsystems created the file system called "Network File System" (NFS) which became the first widely used Internet Protocol based network file system. Other notable network file systems are Andrew File System (AFS), Apple Filing Protocol (AFP), NetWare Core Protocol (NCP), and Server Message Block (SMB) which is also known as Common Internet File System (CIFS). In 1986, IBM announced client and server support for Distributed Data Management Architecture (DDM) for the System/36, System/38, and IBM mainframe computers running CICS. This was followed by the support for IBM Personal Computer, AS/400, IBM mainframe computers under the MVS and VSE operating systems, and FlexOS. DDM also became the foundation for Distributed Relational Database Architecture, also known as DRDA. There are many peer-to-peer network protocols for open-source distributed file systems for cloud or closed-source clustered file systems, e. g.: 9P, AFS, Coda, CIFS/SMB, DCE/DFS, WekaFS, Lustre, PanFS, Google File System, Mnet, Chord Project. === Examples === == Network-attached storage == Network-attached storage (NAS) provides both storage and a file system, like a shared disk file system on top of a storage area network (SAN). NAS typically uses file-based protocols (as opposed to block-based protocols a SAN would use) such as NFS (popular on UNIX systems), SMB/CIFS (Server Message Block/Common Internet File System) (used with MS Windows systems), AFP (used with Apple Macintosh computers), or NCP (used with OES and Novell NetWare). == Design considerations == === Avoiding single point of failure === The failure of disk hardware or a given storage node in a cluster can create a single point of failure that can result in data loss or unavailability. Fault tolerance and high availability can be provided through data replication of one sort or another, so that data remains intact and available despite the failure of any single piece of equipment. For examples, see the lists of distributed fault-tolerant file systems and distributed parallel fault-tolerant file systems. === Performance === A common performance measurement of a clustered file system is the amount of time needed to satisfy service requests. In conventional systems, this time consists of a disk-access time and a small amount of CPU-processing time. But in a clustered file system, a remote access has additional overhead due to the distributed structure. This includes the time to deliver the request to a server, the time to deliver the response to the client, and for each direction, a CPU overhead of running the communication protocol software. === Concurrency === Concurrency control becomes an issue when more than one person or client is accessing the same file or block and want to update it. Hence updates to the file from one client should not interfere with access and updates from other clients. This problem is more complex with file systems due to concurrent overlapping writes, where different writers write to overlapping regions of the file concurrently. This problem is usually handled by concurrency control or locking which may either be built into the file system or provided by an add-on protocol. == History == IBM mainframes in the 1970s could share physical disks and file systems if each machine had its own channel connection to the drives' control units. In the 1980s, Digital Equipment Corporation's TOPS-20 and OpenVMS clusters (VAX/ALPHA/IA64) included shared disk file systems.

    Read more →
  • Semantic folding

    Semantic folding

    Semantic folding theory describes a procedure for encoding the semantics of natural language text in a semantically grounded binary representation. This approach provides a framework for modelling how language data is processed by the neocortex. == Theory == Semantic folding theory draws inspiration from Douglas R. Hofstadter's Analogy as the Core of Cognition which suggests that the brain makes sense of the world by identifying and applying analogies. The theory hypothesises that semantic data must therefore be introduced to the neocortex in such a form as to allow the application of a similarity measure and offers, as a solution, the sparse binary vector employing a two-dimensional topographic semantic space as a distributional reference frame. The theory builds on the computational theory of the human cortex known as hierarchical temporal memory (HTM), and positions itself as a complementary theory for the representation of language semantics. A particular strength claimed by this approach is that the resulting binary representation enables complex semantic operations to be performed simply and efficiently at the most basic computational level. == Two-dimensional semantic space == Analogous to the structure of the neocortex, Semantic Folding theory posits the implementation of a semantic space as a two-dimensional grid. This grid is populated by context-vectors in such a way as to place similar context-vectors closer to each other, for instance, by using competitive learning principles. This vector space model is presented in the theory as an equivalence to the well known word space model described in the information retrieval literature. Given a semantic space (implemented as described above) a word-vector can be obtained for any given word Y by employing the following algorithm: For each position X in the semantic map (where X represents cartesian coordinates) if the word Y is contained in the context-vector at position X then add 1 to the corresponding position in the word-vector for Y else add 0 to the corresponding position in the word-vector for Y The result of this process will be a word-vector containing all the contexts in which the word Y appears and will therefore be representative of the semantics of that word in the semantic space. It can be seen that the resulting word-vector is also in a sparse distributed representation (SDR) format [Schütze, 1993] & [Sahlgreen, 2006]. Some properties of word-SDRs that are of particular interest with respect to computational semantics are: high noise resistance: As a result of similar contexts being placed closer together in the underlying map, word-SDRs are highly tolerant of false or shifted "bits". boolean logic: It is possible to manipulate word-SDRs in a meaningful way using boolean (OR, AND, exclusive-OR) and/or arithmetical (SUBtract) functions . sub-sampling: Word-SDRs can be sub-sampled to a high degree without any appreciable loss of semantic information. topological two-dimensional representation: The SDR representation maintains the topological distribution of the underlying map therefore words with similar meanings will have similar word-vectors. This suggests that a variety of measures can be applied to the calculation of semantic similarity, from a simple overlap of vector elements, to a range of distance measures such as: Euclidean distance, Hamming distance, Jaccard distance, cosine similarity, Levenshtein distance, Sørensen-Dice index, etc. == Semantic spaces == Semantic spaces in the natural language domain aim to create representations of natural language that are capable of capturing meaning. The original motivation for semantic spaces stems from two core challenges of natural language: Vocabulary mismatch (the fact that the same meaning can be expressed in many ways) and ambiguity of natural language (the fact that the same term can have several meanings). The application of semantic spaces in natural language processing (NLP) aims at overcoming limitations of rule-based or model-based approaches operating on the keyword level. The main drawback with these approaches is their brittleness, and the large manual effort required to create either rule-based NLP systems or training corpora for model learning. Rule-based and machine learning-based models are fixed on the keyword level and break down if the vocabulary differs from that defined in the rules or from the training material used for the statistical models. Research in semantic spaces dates back more than 20 years. In 1996, two papers were published that raised a lot of attention around the general idea of creating semantic spaces: latent semantic analysis from Microsoft and Hyperspace Analogue to Language from the University of California. However, their adoption was limited by the large computational effort required to construct and use those semantic spaces. A breakthrough with regard to the accuracy of modelling associative relations between words (e.g. "spider-web", "lighter-cigarette", as opposed to synonymous relations such as "whale-dolphin", "astronaut-driver") was achieved by explicit semantic analysis (ESA) in 2007. ESA was a novel (non-machine learning) based approach that represented words in the form of vectors with 100,000 dimensions (where each dimension represents an Article in Wikipedia). However practical applications of the approach are limited due to the large number of required dimensions in the vectors. More recently, advances in neural networking techniques in combination with other new approaches (tensors) led to a host of new recent developments: Word2vec from Google and GloVe from Stanford University. Semantic folding represents a novel, biologically inspired approach to semantic spaces where each word is represented as a sparse binary vector with 16,000 dimensions (a semantic fingerprint) in a 2D semantic map (the semantic universe). Sparse binary representation are advantageous in terms of computational efficiency, and allow for the storage of very large numbers of possible patterns. == Visualization == The topological distribution over a two-dimensional grid (outlined above) lends itself to a bitmap type visualization of the semantics of any word or text, where each active semantic feature can be displayed as e.g. a pixel. As can be seen in the images shown here, this representation allows for a direct visual comparison of the semantics of two (or more) linguistic items. Image 1 clearly demonstrates that the two disparate terms "dog" and "car" have, as expected, very obviously different semantics. Image 2 shows that only one of the meaning contexts of "jaguar", that of "Jaguar" the car, overlaps with the meaning of Porsche (indicating partial similarity). Other meaning contexts of "jaguar" e.g. "jaguar" the animal clearly have different non-overlapping contexts. The visualization of semantic similarity using Semantic Folding bears a strong resemblance to the fMRI images produced in a research study conducted by A.G. Huth et al., where it is claimed that words are grouped in the brain by meaning. voxels, little volume segments of the brain, were found to follow a pattern were semantic information is represented along the boundary of the visual cortex with visual and linguistic categories represented on posterior and anterior side respectively.

    Read more →
  • Social media optimization

    Social media optimization

    Social media optimization (SMO) is the use of online platforms to generate income or publicity to increase the awareness of a brand, event, product or service. Types of social media involved include RSS feeds, blogging sites, social bookmarking sites, social news websites, video sharing websites such as YouTube and social networking sites such as Facebook, Instagram, TikTok and X (Twitter). SMO is similar to search engine optimization (SEO) in that the goal is to drive web traffic, and draw attention to a company or creator. SMO's focal point is on gaining organic links to social media content. In contrast, SEO's core is about reaching the top of the search engine hierarchy. In general, social media optimization refers to optimizing a website and its content to encourage more users to use and share links to the website across social media and networking sites. SMO is used to strategically create online content ranging from well-written text to eye-catching digital photos or video clips that encourages and entices people to engage with a website. Users share this content, via its weblink, with social media contacts and friends. Common examples of social media engagement are "liking and commenting on posts, retweeting, embedding, sharing, and promoting content". Social media optimization is also an effective way of implementing online reputation management (ORM), meaning that if someone posts bad reviews of a business, an SMO strategy can ensure that the negative feedback is not the first link to come up in a list of search engine results. In the 2010s, with social media sites overtaking TV as a source for news for young people, news organizations have become increasingly reliant on social media platforms for generating web traffic. Publishers such as The Economist employ large social media teams to optimize their online posts and maximize traffic, while other major publishers now use advanced artificial intelligence (AI) technology to generate higher volumes of web traffic. == Relationship with search engine optimization == Social media optimization is an increasingly important factor in search engine optimization, which is the process of designing a website in a way so that it has as high a ranking as possible on search engines. Search engines are increasingly utilizing the recommendations of users of social networks such as Reddit, Facebook, Tumblr, Twitter, YouTube, LinkedIn, Pinterest and Instagram to rank pages in the search engine result pages. The implication is that when a webpage is shared or "liked" by a user on a social network, it counts as a "vote" for that webpage's quality. Thus, search engines can use such votes accordingly to properly ranked websites in search engine results pages. Furthermore, since it is more difficult to tip the scales or influence the search engines in this way, search engines are putting more stock into social search. This, coupled with increasingly personalized search based on interests and location, has significantly increased the importance of a social media presence in search engine optimization. Due to personalized search results, location-based social media presences on websites such as Yelp, Google Places, Foursquare, and Yahoo! Local have become increasingly important. While social media optimization is related to search engine marketing, it differs in several ways. Primarily, SMO focuses on driving web traffic from sources other than search engines, though improved search engine ranking is also a benefit of successful social media optimization. Further, SMO is helpful to target particular geographic regions in order to target and reach potential customers. This helps in lead generation (finding new customers) and contributes to high conversion rates (i.e., converting previously uninterested individuals into people who are interested in a brand or organization). == Relationship with viral marketing == Social media optimization is in many ways connected to the technique of viral marketing or "viral seeding" where word of mouth is created through the use of networking in social bookmarking, video and photo sharing websites. An effective SMO campaign can harness the power of viral marketing; for example, 80% of activity on Pinterest is generated through "repinning." Furthermore, by following social trends and utilizing alternative social networks, websites can retain existing followers while also attracting new ones. This allows businesses to build an online following and presence, all linking back to the company's website for increased traffic. For example, with an effective social bookmarking campaign, not only can website traffic be increased, but a site's rankings can also be increased. In a similar way, the engagement with blogs creates a similar result by sharing content through the use of RSS in the blogosphere. Social media optimization is considered an integral part of an online reputation management (ORM) or search engine reputation management (SERM) strategy for organizations or individuals who care about their online presence. SMO is one of six key influencers that affect Social Commerce Construct (SCC). Online activities such as consumers' evaluations and advices on products and services constitute part of what creates a Social Commerce Construct (SCC). Social media optimization is not limited to marketing and brand building. Increasingly, smart businesses are integrating social media participation as part of their knowledge management strategy (i.e., product/service development, recruiting, employee engagement and turnover, brand building, customer satisfaction and relations, business development and more). Additionally, social media optimization can be implemented to foster a community of the associated site, allowing for a healthy business-to-consumer (B2C) relationship. == Origins and implementation == According to technologist Danny Sullivan, the term "social media optimization" was first used and described by marketer Rohit Bhargava on his marketing blog in August 2006. In the same post, Bhargava established the five important rules of social media optimization. Bhargava believed that by following his rules, anyone could influence the levels of traffic and engagement on their site, increase popularity, and ensure that it ranks highly in search engine results. An additional 11 SMO rules have since been added to the list by other marketing contributors. The 16 rules of SMO, according to one source, are as follows: Increase your linkability Make tagging and bookmarking easy Reward inbound links Help your content to "travel" via sharing Encourage the mashup, where users are allowed to remix content Be a user resource, even if it doesn't help you (e.g., provide resources and information for users) Reward helpful and valuable users Participate (join the online conversation) Know how to target your audience Create new, quality content ("web scraping" of existing online content is ignored by good search engines) Be "real" in the tone and style of the posts Don't forget your roots; be humble Don't be afraid to experiment, innovate, try new things and "stay fresh" Develop an SMO strategy Choose your SMO tactics wisely Make SMO a key part of your marketing process and develop company best practices Bhargava's initial five rules were more specifically designed to SMO, while the list is now much broader and addresses everything that can be done across different social media platforms. According to author and CEO of TopRank Online Marketing, Lee Odden, a Social Media Strategy is also necessary to ensure optimization. This is a similar concept to Bhargava's list of rules for SMO. The Social Media Strategy may consider: Objectives e.g. creating brand awareness and using social media for external communications. Listening e.g. monitoring conversations relating to customers and business objectives. Audience e.g. finding out who the customers are, what they do, who they are influenced by, and what they frequently talk about. It is important to work out what customers want in exchange for their online engagement and attention. Participation and content e.g. establishing a presence and community online and engaging with users by sharing useful and interesting information. Measurement e.g. keeping a record of likes and comments on posts, and the number of sales to monitor growth and determine which tactics are most useful in optimizing social media. According to Lon Safko and David K. Brake in The Social Media Bible, it is also important to act like a publisher by maintaining an effective organizational strategy, to have an original concept and unique "edge" that differentiates one's approach from competitors, and to experiment with new ideas if things do not work the first time. If a business is blog-based, an effective method of SMO is using widgets that allow users to share content to their personal social media platforms. This will ultimately reach a wider target audience and drive mor

    Read more →
  • Data deduplication

    Data deduplication

    In computing, data deduplication is a technique for eliminating duplicate copies of repeating data. Successful implementation of the technique can improve storage utilization, which may in turn lower capital expenditure by reducing the overall amount of storage media required to meet storage capacity needs. It can also be applied to network data transfers to reduce the number of bytes that must be sent. The deduplication process requires comparison of data 'chunks' (also known as 'byte patterns') which are unique, contiguous blocks of data. These chunks are identified and stored during a process of analysis, and compared to other chunks within existing data. Whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is dependent on the chunk size), the amount of data that must be stored or transferred can be greatly reduced. A related technique is single-instance (data) storage, which replaces multiple copies of content at the whole-file level with a single shared copy. While possible to combine this with other forms of data compression and deduplication, it is distinct from newer approaches to data deduplication (which can operate at the segment or sub-block level). Deduplication is different from data compression algorithms, such as LZ77 and LZ78. Whereas compression algorithms identify redundant data inside individual files and encodes this redundant data more efficiently, the intent of deduplication is to inspect large volumes of data and identify large sections – such as entire files or large sections of files – that are identical, and replace them with a shared copy. == Functioning principle == For example, a typical email system might contain 100 instances of the same 1 MB (megabyte) file attachment. Each time the email platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; the subsequent instances are referenced back to the saved copy for deduplication ratio of roughly 100 to 1. Deduplication is often paired with data compression for additional storage saving: Deduplication is first used to eliminate large chunks of repetitive data, and compression is then used to efficiently encode each of the stored chunks. In computer code, deduplication is done by, for example, storing information in variables so that they don't have to be written out individually but can be changed all at once at a central referenced location. Examples are CSS classes and named references in MediaWiki. == Benefits == Storage-based data deduplication reduces the amount of storage needed for a given set of files. It is most effective in applications where many copies of very similar or even identical data are stored on a single disk. In the case of data backups, which routinely are performed to protect against data loss, most data in a given backup remain unchanged from the previous backup. Common backup systems try to exploit this by omitting (or hard linking) files that haven't changed or storing differences between files. Neither approach captures all redundancies, however. Hard-linking does not help with large files that have only changed in small ways, such as an email database; differences only find redundancies in adjacent versions of a single file (consider a section that was deleted and later added in again, or a logo image included in many documents). In-line network data deduplication is used to reduce the number of bytes that must be transferred between endpoints, which can reduce the amount of bandwidth required. See WAN optimization for more information. Virtual servers and virtual desktops benefit from deduplication because it allows nominally separate system files for each virtual machine to be coalesced into a single storage space. At the same time, if a given virtual machine customizes a file, deduplication will not change the files on the other virtual machines—something that alternatives like hard links or shared disks do not offer. Backing up or making duplicate copies of virtual environments is similarly improved. == Classification == === Post-process versus in-line deduplication === Deduplication may occur "in-line", as data is flowing, or "post-process" after it has been written. With post-process deduplication, new data is first stored on the storage device and then a process at a later time will analyze the data looking for duplication. The benefit is that there is no need to wait for the hash calculations and lookup to be completed before storing the data, thereby ensuring that store performance is not degraded. Implementations offering policy-based operation can give users the ability to defer optimization on "active" files, or to process files based on type and location. One potential drawback is that duplicate data may be unnecessarily stored for a short time, which can be problematic if the system is nearing full capacity. Alternatively, deduplication hash calculations can be done in-line: synchronized as data enters the target device. If the storage system identifies a block which it has already stored, only a reference to the existing block is stored, rather than the whole new block. The advantage of in-line deduplication over post-process deduplication is that it requires less storage and network traffic, since duplicate data is never stored or transferred. On the negative side, hash calculations may be computationally expensive, thereby reducing the storage throughput. However, certain vendors with in-line deduplication have demonstrated equipment which performs in-line deduplication at high rates. Post-process and in-line deduplication methods are often heavily debated. === Data formats === The SNIA Dictionary identifies two methods: Content-agnostic data deduplication – a data deduplication method that does not require awareness of specific application data formats. Content-aware data deduplication – a data deduplication method that leverages knowledge of specific application data formats. === Source versus target deduplication === Another way to classify data deduplication methods is according to where they occur. Deduplication occurring close to where data is created, is referred to as "source deduplication". When it occurs near where the data is stored, it is called "target deduplication". Source deduplication ensures that data on the data source is deduplicated. This generally takes place directly within a file system. The file system will periodically scan new files creating hashes and compare them to hashes of existing files. When files with same hashes are found then the file copy is removed and the new file points to the old file. Unlike hard links however, duplicated files are considered to be separate entities and if one of the duplicated files is later modified, then using a system called copy-on-write a copy of that changed file or block is created. The deduplication process is transparent to the users and backup applications. Backing up a deduplicated file system will often cause duplication to occur resulting in the backups being bigger than the source data. Source deduplication can be declared explicitly for copying operations, as no calculation is needed to know that the copied data is in need of deduplication. This leads to a new form of link on file systems, called a reference-counted link, or reflink, in some systems (e.g. Linux), or a cloned file on macOS, where one or more inodes (file information entries) are made to share some or all of their data. It is named analogously to hard links, which work at the inode level, and symbolic links, which work at the filename level.The individual entries have a copy-on-write behavior that is non-aliasing, i.e. changing one copy afterwards will not affect other copies. Microsoft's ReFS also supports this operation. Target deduplication is the process of removing duplicates when the data was not generated at that location. Example of this would be a server connected to a SAN/NAS, The SAN/NAS would be a target for the server (target deduplication). The server is not aware of any deduplication, the server is also the point of data generation. A second example would be backup. Generally this will be a backup store such as a data repository or a virtual tape library. === Deduplication methods === One of the most common forms of data deduplication implementations works by comparing chunks of data to detect duplicates. For that to happen, each chunk of data is assigned an identification, calculated by the software, typically using cryptographic hash functions. In many implementations, the assumption is made that if the identification is identical, the data is identical, even though this cannot be true in all cases due to the pigeonhole principle; other implementations do not as

    Read more →
  • Format-preserving encryption

    Format-preserving encryption

    In cryptography, format-preserving encryption (FPE), refers to encrypting in such a way that the output (the ciphertext) is in the same format as the input (the plaintext). The meaning of "format" varies. Typically only finite sets of characters are used; numeric, alphabetic or alphanumeric. For example: Encrypting a 16-digit credit card number so that the ciphertext is another 16-digit number. Encrypting an English word so that the ciphertext is another English word. Encrypting an n-bit number so that the ciphertext is another n-bit number (this is the definition of an n-bit block cipher). For such finite domains, and for the purposes of the discussion below, the cipher is equivalent to a permutation of N integers {0, ... , N−1} where N is the size of the domain. == Motivation == === Restricted field lengths or formats === One motivation for using FPE comes from the problems associated with integrating encryption into existing applications, with well-defined data models. A typical example would be a credit card number, such as 1234567812345670 (16 bytes long, digits only). Adding encryption to such applications might be challenging if data models are to be changed, as it usually involves changing field length limits or data types. For example, output from a typical block cipher would turn credit card number into a hexadecimal (e.g.0x96a45cbcf9c2a9425cde9e274948cb67, 34 bytes, hexadecimal digits) or Base64 value (e.g. lqRcvPnCqUJc3p4nSUjLZw==, 24 bytes, alphanumeric and special characters), which will break any existing applications expecting the credit card number to be a 16-digit number. Apart from simple formatting problems, using AES-128-CBC, this credit card number might get encrypted to the hexadecimal value 0xde015724b081ea7003de4593d792fd8b695b39e095c98f3a220ff43522a2df02. In addition to the problems caused by creating invalid characters and increasing the size of the data, data encrypted using the CBC mode of an encryption algorithm also changes its value when it is decrypted and encrypted again. This happens because the random seed value that is used to initialize the encryption algorithm and is included as part of the encrypted value is different for each encryption operation. Because of this, it is impossible to use data that has been encrypted with the CBC mode as a unique key to identify a row in a database. FPE attempts to simplify the transition process by preserving the formatting and length of the original data, allowing a drop-in replacement of plaintext values with their ciphertexts in legacy applications. == Comparison to truly random permutations == Although a truly random permutation is the ideal FPE cipher, for large domains it is infeasible to pre-generate and remember a truly random permutation. So the problem of FPE is to generate a pseudorandom permutation from a secret key, in such a way that the computation time for a single value is small (ideally constant, but most importantly smaller than O(N)). == Comparison to block ciphers == An n-bit block cipher technically is a FPE on the set {0, ..., 2n-1}. If an FPE is needed on one of these standard sized sets (for example, n = 64 for DES and n = 128 for AES) a block cipher of the right size can be used. However, in typical usage, a block cipher is used in a mode of operation that allows it to encrypt arbitrarily long messages, and with an initialization vector as discussed above. In this mode, a block cipher is not an FPE. == Definition of security == In cryptographic literature (see most of the references below), the measure of a "good" FPE is whether an attacker can distinguish the FPE from a truly random permutation. Various types of attackers are postulated, depending on whether they have access to oracles or known ciphertext/plaintext pairs. == Algorithms == In most of the approaches listed here, a well-understood block cipher (such as AES) is used as a primitive to take the place of an ideal random function. This has the advantage that incorporation of a secret key into the algorithm is easy. Where AES is mentioned in the following discussion, any other good block cipher would work as well. === The FPE constructions of Black and Rogaway === Implementing FPE with security provably related to that of the underlying block cipher was first undertaken in a paper by cryptographers John Black and Phillip Rogaway, which described three ways to do this. They proved that each of these techniques is as secure as the block cipher that is used to construct it. This means that if the AES algorithm is used to create an FPE algorithm, then the resulting FPE algorithm is as secure as AES because an adversary capable of defeating the FPE algorithm can also defeat the AES algorithm. Therefore, if AES is secure, then the FPE algorithms constructed from it are also secure. In all of the following, E denotes the AES encryption operation that is used to construct an FPE algorithm and F denotes the FPE encryption operation. ==== FPE from a prefix cipher ==== One simple way to create an FPE algorithm on {0, ..., N-1} is to assign a pseudorandom weight to each integer, then sort by weight. The weights are defined by applying an existing block cipher to each integer. Black and Rogaway call this technique a "prefix cipher" and showed it was provably as good as the block cipher used. Thus, to create an FPE on the domain {0,1,2,3}, given a key K apply AES(K) to each integer, giving, for example, weight(0) = 0x56c644080098fc5570f2b329323dbf62 weight(1) = 0x08ee98c0d05e3dad3eb3d6236f23e7b7 weight(2) = 0x47d2e1bf72264fa01fb274465e56ba20 weight(3) = 0x077de40941c93774857961a8a772650d Sorting [0,1,2,3] by weight gives [3,1,2,0], so the cipher is F(0) = 3 F(1) = 1 F(2) = 2 F(3) = 0 This method is only useful for small values of N. For larger values, the size of the lookup table and the required number of encryptions to initialize the table gets too big to be practical. ==== FPE from cycle walking ==== If there is a set M of allowed values within the domain of a pseudorandom permutation P (for example P can be a block cipher like AES), an FPE algorithm can be created from the block cipher by repeatedly applying the block cipher until the result is one of the allowed values (within M). CycleWalkingFPE(x) { if P(x) is an element of M then return P(x) else return CycleWalkingFPE(P(x)) } The recursion is guaranteed to terminate. (Because P is one-to-one and the domain is finite, repeated application of P forms a cycle, so starting with a point in M the cycle will eventually terminate in M.) This has the advantage that the elements of M do not have to be mapped to a consecutive sequence {0,...,N-1} of integers. It has the disadvantage, when M is much smaller than P's domain, that too many iterations might be required for each operation. If P is a block cipher of a fixed size, such as AES, this is a severe restriction on the sizes of M for which this method is efficient. For example, an application may want to encrypt 100-bit values with AES in a way that creates another 100-bit value. With this technique, AES-128-ECB encryption can be applied until it reaches a value which has all of its 28 highest bits set to 0, which will take an average of 228 iterations to happen. ==== FPE from a Feistel network ==== It is also possible to make a FPE algorithm using a Feistel network. A Feistel network needs a source of pseudo-random values for the sub-keys for each round, and the output of the AES algorithm can be used as these pseudo-random values. When this is done, the resulting Feistel construction is good if enough rounds are used. One way to implement an FPE algorithm using AES and a Feistel network is to use as many bits of AES output as are needed to equal the length of the left or right halves of the Feistel network. If a 24-bit value is needed as a sub-key, for example, it is possible to use the lowest 24 bits of the output of AES for this value. This may not result in the output of the Feistel network preserving the format of the input, but it is possible to iterate the Feistel network in the same way that the cycle-walking technique does to ensure that format can be preserved. Because it is possible to adjust the size of the inputs to a Feistel network, it is possible to make it very likely that this iteration ends very quickly on average. In the case of credit card numbers, for example, there are 1015 possible 16-digit credit card numbers (accounting for the redundant check digit), and because the 1015 ≈ 249.8, using a 50-bit wide Feistel network along with cycle walking will create an FPE algorithm that encrypts fairly quickly on average. === The Thorp shuffle === A Thorp shuffle is like an idealized card-shuffle, or equivalently a maximally-unbalanced Feistel cipher where one side is a single bit. It is easier to prove security for unbalanced Feistel ciphers than for balanced ones. === VIL mode === For domain sizes that are a power of two, and an existing block cipher with a smaller bl

    Read more →
  • Grammar induction

    Grammar induction

    Grammar induction (or grammatical inference) is the process in machine learning of learning a formal grammar (usually as a collection of re-write rules or productions or alternatively as a finite-state machine or automaton of some kind) from a set of observations, thus constructing a model which accounts for the characteristics of the observed objects. More generally, grammatical inference is that branch of machine learning where the instance space consists of discrete combinatorial objects such as strings, trees and graphs. == Grammar classes == Grammatical inference has often been very focused on the problem of learning finite-state machines of various types (see the article Induction of regular languages for details on these approaches), since there have been efficient algorithms for this problem since the 1980s. Since the beginning of the century, these approaches have been extended to the problem of inference of context-free grammars and richer formalisms, such as multiple context-free grammars and parallel multiple context-free grammars. Other classes of grammars for which grammatical inference has been studied are combinatory categorial grammars, stochastic context-free grammars, contextual grammars and pattern languages. == Learning models == The simplest form of learning is where the learning algorithm merely receives a set of examples drawn from the language in question: the aim is to learn the language from examples of it (and, rarely, from counter-examples, that is, example that do not belong to the language). However, other learning models have been studied. One frequently studied alternative is the case where the learner can ask membership queries as in the exact query learning model or minimally adequate teacher model introduced by Angluin. == Methodologies == There is a wide variety of methods for grammatical inference. Two of the classic sources are Fu (1977) and Fu (1982). Duda, Hart & Stork (2001) also devote a brief section to the problem, and cite a number of references. The basic trial-and-error method they present is discussed below. For approaches to infer subclasses of regular languages in particular, see Induction of regular languages. A more recent textbook is de la Higuera (2010), which covers the theory of grammatical inference of regular languages and finite state automata. D'Ulizia, Ferri and Grifoni provide a survey that explores grammatical inference methods for natural languages. === Induction of probabilistic grammars === There are several methods for induction of probabilistic context-free grammars. === Grammatical inference by trial-and-error === The method proposed in Section 8.7 of Duda, Hart & Stork (2001) suggests successively guessing grammar rules (productions) and testing them against positive and negative observations. The rule set is expanded so as to be able to generate each positive example, but if a given rule set also generates a negative example, it must be discarded. This particular approach can be characterized as "hypothesis testing" and bears some similarity to Mitchel's version space algorithm. The Duda, Hart & Stork (2001) text provide a simple example which nicely illustrates the process, but the feasibility of such an unguided trial-and-error approach for more substantial problems is dubious. === Grammatical inference by genetic algorithms === Grammatical induction using evolutionary algorithms is the process of evolving a representation of the grammar of a target language through some evolutionary process. Formal grammars can easily be represented as tree structures of production rules that can be subjected to evolutionary operators. Algorithms of this sort stem from the genetic programming paradigm pioneered by John Koza. Other early work on simple formal languages used the binary string representation of genetic algorithms, but the inherently hierarchical structure of grammars couched in the EBNF language made trees a more flexible approach. Koza represented Lisp programs as trees. He was able to find analogues to the genetic operators within the standard set of tree operators. For example, swapping sub-trees is equivalent to the corresponding process of genetic crossover, where sub-strings of a genetic code are transplanted into an individual of the next generation. Fitness is measured by scoring the output from the functions of the Lisp code. Similar analogues between the tree structured lisp representation and the representation of grammars as trees, made the application of genetic programming techniques possible for grammar induction. In the case of grammar induction, the transplantation of sub-trees corresponds to the swapping of production rules that enable the parsing of phrases from some language. The fitness operator for the grammar is based upon some measure of how well it performed in parsing some group of sentences from the target language. In a tree representation of a grammar, a terminal symbol of a production rule corresponds to a leaf node of the tree. Its parent nodes corresponds to a non-terminal symbol (e.g. a noun phrase or a verb phrase) in the rule set. Ultimately, the root node might correspond to a sentence non-terminal. === Grammatical inference by greedy algorithms === Like all greedy algorithms, greedy grammar inference algorithms make, in iterative manner, decisions that seem to be the best at that stage. The decisions made usually deal with things like the creation of new rules, the removal of existing rules, the choice of a rule to be applied or the merging of some existing rules. Because there are several ways to define 'the stage' and 'the best', there are also several greedy grammar inference algorithms. These context-free grammar generating algorithms make the decision after every read symbol: Lempel-Ziv-Welch algorithm creates a context-free grammar in a deterministic way such that it is necessary to store only the start rule of the generated grammar. Sequitur and its modifications. These context-free grammar generating algorithms first read the whole given symbol-sequence and then start to make decisions: Byte pair encoding and its optimizations. === Distributional learning === A more recent approach is based on distributional learning. Algorithms using these approaches have been applied to learning context-free grammars and mildly context-sensitive languages and have been proven to be correct and efficient for large subclasses of these grammars. === Learning of pattern languages === Angluin defines a pattern to be "a string of constant symbols from Σ and variable symbols from a disjoint set". The language of such a pattern is the set of all its nonempty ground instances i.e. all strings resulting from consistent replacement of its variable symbols by nonempty strings of constant symbols. A pattern is called descriptive for a finite input set of strings if its language is minimal (with respect to set inclusion) among all pattern languages subsuming the input set. Angluin gives a polynomial algorithm to compute, for a given input string set, all descriptive patterns in one variable x. To this end, she builds an automaton representing all possibly relevant patterns; using sophisticated arguments about word lengths, which rely on x being the only variable, the state count can be drastically reduced. Erlebach et al. give a more efficient version of Angluin's pattern learning algorithm, as well as a parallelized version. Arimura et al. show that a language class obtained from limited unions of patterns can be learned in polynomial time. === Pattern theory === Pattern theory, formulated by Ulf Grenander, is a mathematical formalism to describe knowledge of the world as patterns. It differs from other approaches to artificial intelligence in that it does not begin by prescribing algorithms and machinery to recognize and classify patterns; rather, it prescribes a vocabulary to articulate and recast the pattern concepts in precise language. In addition to the new algebraic vocabulary, its statistical approach was novel in its aim to: Identify the hidden variables of a data set using real world data rather than artificial stimuli, which was commonplace at the time. Formulate prior distributions for hidden variables and models for the observed variables that form the vertices of a Gibbs-like graph. Study the randomness and variability of these graphs. Create the basic classes of stochastic models applied by listing the deformations of the patterns. Synthesize (sample) from the models, not just analyze signals with it. Broad in its mathematical coverage, pattern theory spans algebra and statistics, as well as local topological and global entropic properties. == Applications == The principle of grammar induction has been applied to other aspects of natural language processing, and has been applied (among many other problems) to semantic parsing, natural language understanding, example-based translation, language acquisition, grammar-based compre

    Read more →
  • CARE Principles for Indigenous Data Governance

    CARE Principles for Indigenous Data Governance

    The CARE Principles for Indigenous Data Governance are a set of principles intended to guide open data projects in engaging Indigenous Peoples rights and interests. CARE was created in 2019 by the International Indigenous Data Sovereignty Interest Group, a group that is a part of the Research Data Alliance. It outlines collective rights related to open data in the context of the United Nations Declaration on the Rights of Indigenous Peoples and Indigenous data sovereignty. CARE is an acronym which stands for Collective Benefit, Authority to Control, Responsibility, Ethics. The CARE Principles are 'people and purpose-oriented, reflecting the crucial role of data in advancing Indigenous innovation and self-determination', and intended as a complement to the data-oriented perspective of other standards such as FAIR data (findable, accessible, interoperable, reusable). The CARE principles have been embedded into the Beta version of Standardised Data on Initiatives (STARDIT). CARE principles were the basis of a submission to the UN's Global Digital Compact.

    Read more →
  • MIME Object Security Services

    MIME Object Security Services

    MIME Object Security Services (MOSS) is a protocol that uses the multipart/signed and multipart/encrypted framework to apply digital signature and encryption services to MIME objects. == Details == The services are offered through the use of end-to-end cryptography between an originator and a recipient at the application layer. Asymmetric (public key) cryptography is used in support of the digital signature service and encryption key management. Symmetric (secret key) cryptography is used in support of the encryption service. The procedures are intended to be compatible with a wide range of public key management approaches, including both ad hoc and certificate-based schemes. Mechanisms are provided to support many public key management approaches. == Spreading == MOSS was never widely deployed and is now abandoned, largely due to the popularity of PGP.

    Read more →