AI Detector Check

AI Detector Check — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Admissible heuristic

    Admissible heuristic

    In computer science, specifically in algorithms related to pathfinding, a heuristic function is said to be admissible if it never overestimates the cost of reaching the goal, i.e. the cost it estimates to reach the goal is not higher than the lowest possible cost from the current point in the path. In other words, it should act as a lower bound. It is related to the concept of consistent heuristics. While all consistent heuristics are admissible, not all admissible heuristics are consistent. == Search algorithms == An admissible heuristic is used to estimate the cost of reaching the goal state in an informed search algorithm. In order for a heuristic to be admissible to the search problem, the estimated cost must always be lower than or equal to the actual cost of reaching the goal state. The search algorithm uses the admissible heuristic to find an estimated optimal path to the goal state from the current node. For example, in A search the evaluation function (where n {\displaystyle n} is the current node) is: f ( n ) = g ( n ) + h ( n ) {\displaystyle f(n)=g(n)+h(n)} where f ( n ) {\displaystyle f(n)} = the evaluation function. g ( n ) {\displaystyle g(n)} = the cost from the start node to the current node h ( n ) {\displaystyle h(n)} = estimated cost from current node to goal. h ( n ) {\displaystyle h(n)} is calculated using the heuristic function. With a non-admissible heuristic, the A algorithm could overlook the optimal solution to a search problem due to an overestimation in f ( n ) {\displaystyle f(n)} . == Formulation == n {\displaystyle n} is a node h {\displaystyle h} is a heuristic h ( n ) {\displaystyle h(n)} is cost indicated by h {\displaystyle h} to reach a goal from n {\displaystyle n} h ∗ ( n ) {\displaystyle h^{}(n)} is the optimal cost to reach a goal from n {\displaystyle n} h ( n ) {\displaystyle h(n)} is admissible if, ∀ n {\displaystyle \forall n} h ( n ) ≤ h ∗ ( n ) {\displaystyle h(n)\leq h^{}(n)} == Construction == An admissible heuristic can be derived from a relaxed version of the problem, or by information from pattern databases that store exact solutions to subproblems of the problem, or by using inductive learning methods. == Examples == Two different examples of admissible heuristics apply to the fifteen puzzle problem: Hamming distance Manhattan distance The Hamming distance is the total number of misplaced tiles. It is clear that this heuristic is admissible since the total number of moves to order the tiles correctly is at least the number of misplaced tiles (each tile not in place must be moved at least once). The cost (number of moves) to the goal (an ordered puzzle) is at least the Hamming distance of the puzzle. The Manhattan distance of a puzzle is defined as: h ( n ) = ∑ all tiles d i s t a n c e ( tile, correct position ) {\displaystyle h(n)=\sum _{\text{all tiles}}{\mathit {distance}}({\text{tile, correct position}})} Consider the puzzle below in which the player wishes to move each tile such that the numbers are ordered. The Manhattan distance is an admissible heuristic in this case because every tile will have to be moved at least the number of spots in between itself and its correct position. The subscripts show the Manhattan distance for each tile. The total Manhattan distance for the shown puzzle is: h ( n ) = 3 + 1 + 0 + 1 + 2 + 3 + 3 + 4 + 3 + 2 + 4 + 4 + 4 + 1 + 1 = 36 {\displaystyle h(n)=3+1+0+1+2+3+3+4+3+2+4+4+4+1+1=36} == Optimality proof == If an admissible heuristic is used in an algorithm that, per iteration, progresses only the path of lowest evaluation (current cost + heuristic) of several candidate paths, terminates the moment its exploration reaches the goal and, crucially, closes all optimal paths before terminating (something that's possible with A search algorithm if special care isn't taken), then this algorithm can only terminate on an optimal path. To see why, consider the following proof by contradiction: Assume such an algorithm managed to terminate on a path T with a true cost Ttrue greater than the optimal path S with true cost Strue. This means that before terminating, the evaluated cost of T was less than or equal to the evaluated cost of S (or else S would have been picked). Denote these evaluated costs Teval and Seval respectively. The above can be summarized as follows, Strue < Ttrue Teval ≤ Seval If our heuristic is admissible it follows that at this penultimate step Teval = Ttrue because any increase on the true cost by the heuristic on T would be inadmissible and the heuristic cannot be negative. On the other hand, an admissible heuristic would require that Seval ≤ Strue which combined with the above inequalities gives us Teval < Ttrue and more specifically Teval ≠ Ttrue. As Teval and Ttrue cannot be both equal and unequal our assumption must have been false and so it must be impossible to terminate on a more costly than optimal path. As an example, let us say we have costs as follows:(the cost above/below a node is the heuristic, the cost at an edge is the actual cost) 0 10 0 100 0 START ---- O ----- GOAL | | 0| |100 | | O ------- O ------ O 100 1 100 1 100 So clearly we would start off visiting the top middle node, since the expected total cost, i.e. f ( n ) {\displaystyle f(n)} , is 10 + 0 = 10 {\displaystyle 10+0=10} . Then the goal would be a candidate, with f ( n ) {\displaystyle f(n)} equal to 10 + 100 + 0 = 110 {\displaystyle 10+100+0=110} . Then we would clearly pick the bottom nodes one after the other, followed by the updated goal, since they all have f ( n ) {\displaystyle f(n)} lower than the f ( n ) {\displaystyle f(n)} of the current goal, i.e. their f ( n ) {\displaystyle f(n)} is 100 , 101 , 102 , 102 {\displaystyle 100,101,102,102} . So even though the goal was a candidate, we could not pick it because there were still better paths out there. This way, an admissible heuristic can ensure optimality. However, note that although an admissible heuristic can guarantee final optimality, it is not necessarily efficient.

    Read more →
  • Personal web page

    Personal web page

    Personal web pages are World Wide Web pages created by an individual to contain content of a personal nature rather than content pertaining to a company, organization or institution. Personal web pages are primarily used for informative or entertainment purposes but can also be used for personal career marketing (by containing a list of the individual's skills, experience and a CV), social networking with other people with shared interests, or as a space for personal expression. These terms do not usually refer to just a single "page" or HTML file, but to a website—a collection of webpages and related files under a common URL or Web address. In strictly technical terms, a site's actual home page (index page) often only contains sparse content with some catchy introductory material and serves mostly as a pointer or table of contents to the more content-rich pages inside, such as résumés, family, hobbies, family genealogy, a web log/diary ("blog"), opinions, online journals and diaries or other writing, examples of written work, digital audio sound clips, digital video clips, digital photos, or information about a user's other interests. Many personal pages only include information of interest to friends and family of the author. However, some webpages set up by hobbyists or enthusiasts of certain subject areas can be valuable topical web directories. == History == In the 1990s, most Internet service providers (ISPs) provided a free small personal, user-created webpage along with free Usenet News service. These were all considered part of full Internet service. Also several free web hosting services such as GeoCities provided free web space for personal web pages. These free web hosting services would typically include web-based site management and a few pre-configured scripts to easily integrate an input form or guestbook script into the user's site. Early personal web pages were often called "home pages" and were intended to be set as a default page in a web browser's preferences, usually by their owner. These pages would often contain links, to-do lists, and other information their author found useful. In the days when search engines were in their infancy, these pages (and the links they contained) could be an important resource in navigating the web. Since the early 2000s, the rise of blogging and the development of user friendly web page designing software made it easier for amateur users who did not have computer programming or website designer training to create personal web pages. Some website design websites provided free ready-made blogging scripts, where all the user had to do was input their content into a template. At the same time, a personal web presence became easier with the increased popularity of social networking services, some with blogging platforms such as LiveJournal and Blogger. These websites provided an attractive and easy-to-use content management system for regular users. Most of the early personal websites were Web 1.0 style, in which a static display of text and images or photos was displayed to individuals who came to the page. About the only interaction that was possible on these early websites was signing the virtual "guestbook". With the collapse of the dot-com bubble in the late 1990s, the ISP industry consolidated, and the focus of web hosting services shifted away from the surviving ISP companies to independent Internet hosting services and to ones with other affiliations. For example, many university departments provided personal pages for professors and television broadcasters provided them for their on-air personalities. These free webpages served as a perquisite ("perk") for staff, while at the same time boosting the Web visibility of the parent organization. Web hosting companies either charge a monthly fee, or provide service that is "free" (advertising based) for personal web pages. These are priced or limited according to the total size of all files in bytes on the host's hard drive, or by bandwidth, (traffic), or by some combination of both. For those customers who continue to use their ISP for these services, national ISPs commonly continue to provide both disk space and help including ready-made drop-in scripts. With the rise of Web 2.0-style websites, both professional websites and user-created, amateur websites tended to contain interactive features, such as "clickable" links to online newspaper articles or favourite websites, the option to comment on content displayed on the website, the option to "tag" images, videos or links on the site, the option of "clicking" on an image to enlarge it or find out more information, the option of user participation for website guests to evaluate or review the pages, or even the option to create new user-generated content for others to see. A key difference between Web 1.0 personal webpages and Web 2.0 personal pages was while the former tended to be created by hackers, computer programmers and computer hobbyists, the latter were created by a much wider variety of users, including individuals whose main interests lay in hobbies or topics outside of computers (e.g., indie music fans, political activists, and social entrepreneurs). == Motivations == In a study done by Zinkhan, participants had four main reasons to create personal web pages. First, people use personal web pages as a portrayal of self, in a sense marketing themselves, since creators have the freedom to portray their own identities. Second, personal web pages are a way to interact with people who have similar interests as the creator, possible employers, or colleagues. Third, personal web pages can gain social acceptance with groups that the creator is interested in depending on the information that the creator reveals about themselves. Fourth, personal web pages can give creators a sense of connection to the world since these web pages are public and a way to introduce oneself to other people around the globe. People may maintain personal web pages to serve as a showcase for their skills in professional life, creative skills or self promotion of their business, charity or band. The use of personal web pages to display an individual's professional life has become more common in the 21st century. Mary Madden, an expert researcher on privacy and technology, did a study that found a tenth of American jobs require Personal web pages that advertise an individual online. Personal web pages have become a source of initial impression of possible employees used by employers. It can also be used to express opinions on issues ranging from news and politics to movies. Others may use their personal web page as a communication method. For example, an aspiring artist might give out business cards with their personal web page, and invite people to visit their page and see their artwork, "like" their page or sign their guestbook. A personal web page gives the owner generally more control on presence in search results and how they wish to be viewed online. It also allows more freedom in types and quantity of content than a social network profile offers, and can link various social media profiles with each other. It can be used to correct the record on something, or clear up potential confusion between you and someone with the same name. In the 2010s, some amateur writers, bands and filmmakers release digital versions of their stories, songs and short films online, with the aim of gaining an audience and becoming more well-known. While the huge number of aspiring artists posting their work online makes it unlikely for individuals and groups to become popular via the Internet, there are a small number of YouTube stars who were unknown until their online performances garnered them a huge audience. == Sites of academics == Academic professionals (especially at the college and university level), including professors and researchers, are often given online space for creating and storing personal web documents, including personal web pages, CVs and a list of their books, academic papers and conference presentations, on the websites of their employers. This goes back to the early decade of the World Wide Web and its original purpose of providing a quick and easy way for academics to share research papers and data. Researchers may have a personal website to share more information about themselves, about their academic activities and for sharing (unpublished) results of their research. This has been noted as part of the success of open-access repositories such as arXiv.

    Read more →
  • Data deduplication

    Data deduplication

    In computing, data deduplication is a technique for eliminating duplicate copies of repeating data. Successful implementation of the technique can improve storage utilization, which may in turn lower capital expenditure by reducing the overall amount of storage media required to meet storage capacity needs. It can also be applied to network data transfers to reduce the number of bytes that must be sent. The deduplication process requires comparison of data 'chunks' (also known as 'byte patterns') which are unique, contiguous blocks of data. These chunks are identified and stored during a process of analysis, and compared to other chunks within existing data. Whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is dependent on the chunk size), the amount of data that must be stored or transferred can be greatly reduced. A related technique is single-instance (data) storage, which replaces multiple copies of content at the whole-file level with a single shared copy. While possible to combine this with other forms of data compression and deduplication, it is distinct from newer approaches to data deduplication (which can operate at the segment or sub-block level). Deduplication is different from data compression algorithms, such as LZ77 and LZ78. Whereas compression algorithms identify redundant data inside individual files and encodes this redundant data more efficiently, the intent of deduplication is to inspect large volumes of data and identify large sections – such as entire files or large sections of files – that are identical, and replace them with a shared copy. == Functioning principle == For example, a typical email system might contain 100 instances of the same 1 MB (megabyte) file attachment. Each time the email platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; the subsequent instances are referenced back to the saved copy for deduplication ratio of roughly 100 to 1. Deduplication is often paired with data compression for additional storage saving: Deduplication is first used to eliminate large chunks of repetitive data, and compression is then used to efficiently encode each of the stored chunks. In computer code, deduplication is done by, for example, storing information in variables so that they don't have to be written out individually but can be changed all at once at a central referenced location. Examples are CSS classes and named references in MediaWiki. == Benefits == Storage-based data deduplication reduces the amount of storage needed for a given set of files. It is most effective in applications where many copies of very similar or even identical data are stored on a single disk. In the case of data backups, which routinely are performed to protect against data loss, most data in a given backup remain unchanged from the previous backup. Common backup systems try to exploit this by omitting (or hard linking) files that haven't changed or storing differences between files. Neither approach captures all redundancies, however. Hard-linking does not help with large files that have only changed in small ways, such as an email database; differences only find redundancies in adjacent versions of a single file (consider a section that was deleted and later added in again, or a logo image included in many documents). In-line network data deduplication is used to reduce the number of bytes that must be transferred between endpoints, which can reduce the amount of bandwidth required. See WAN optimization for more information. Virtual servers and virtual desktops benefit from deduplication because it allows nominally separate system files for each virtual machine to be coalesced into a single storage space. At the same time, if a given virtual machine customizes a file, deduplication will not change the files on the other virtual machines—something that alternatives like hard links or shared disks do not offer. Backing up or making duplicate copies of virtual environments is similarly improved. == Classification == === Post-process versus in-line deduplication === Deduplication may occur "in-line", as data is flowing, or "post-process" after it has been written. With post-process deduplication, new data is first stored on the storage device and then a process at a later time will analyze the data looking for duplication. The benefit is that there is no need to wait for the hash calculations and lookup to be completed before storing the data, thereby ensuring that store performance is not degraded. Implementations offering policy-based operation can give users the ability to defer optimization on "active" files, or to process files based on type and location. One potential drawback is that duplicate data may be unnecessarily stored for a short time, which can be problematic if the system is nearing full capacity. Alternatively, deduplication hash calculations can be done in-line: synchronized as data enters the target device. If the storage system identifies a block which it has already stored, only a reference to the existing block is stored, rather than the whole new block. The advantage of in-line deduplication over post-process deduplication is that it requires less storage and network traffic, since duplicate data is never stored or transferred. On the negative side, hash calculations may be computationally expensive, thereby reducing the storage throughput. However, certain vendors with in-line deduplication have demonstrated equipment which performs in-line deduplication at high rates. Post-process and in-line deduplication methods are often heavily debated. === Data formats === The SNIA Dictionary identifies two methods: Content-agnostic data deduplication – a data deduplication method that does not require awareness of specific application data formats. Content-aware data deduplication – a data deduplication method that leverages knowledge of specific application data formats. === Source versus target deduplication === Another way to classify data deduplication methods is according to where they occur. Deduplication occurring close to where data is created, is referred to as "source deduplication". When it occurs near where the data is stored, it is called "target deduplication". Source deduplication ensures that data on the data source is deduplicated. This generally takes place directly within a file system. The file system will periodically scan new files creating hashes and compare them to hashes of existing files. When files with same hashes are found then the file copy is removed and the new file points to the old file. Unlike hard links however, duplicated files are considered to be separate entities and if one of the duplicated files is later modified, then using a system called copy-on-write a copy of that changed file or block is created. The deduplication process is transparent to the users and backup applications. Backing up a deduplicated file system will often cause duplication to occur resulting in the backups being bigger than the source data. Source deduplication can be declared explicitly for copying operations, as no calculation is needed to know that the copied data is in need of deduplication. This leads to a new form of link on file systems, called a reference-counted link, or reflink, in some systems (e.g. Linux), or a cloned file on macOS, where one or more inodes (file information entries) are made to share some or all of their data. It is named analogously to hard links, which work at the inode level, and symbolic links, which work at the filename level.The individual entries have a copy-on-write behavior that is non-aliasing, i.e. changing one copy afterwards will not affect other copies. Microsoft's ReFS also supports this operation. Target deduplication is the process of removing duplicates when the data was not generated at that location. Example of this would be a server connected to a SAN/NAS, The SAN/NAS would be a target for the server (target deduplication). The server is not aware of any deduplication, the server is also the point of data generation. A second example would be backup. Generally this will be a backup store such as a data repository or a virtual tape library. === Deduplication methods === One of the most common forms of data deduplication implementations works by comparing chunks of data to detect duplicates. For that to happen, each chunk of data is assigned an identification, calculated by the software, typically using cryptographic hash functions. In many implementations, the assumption is made that if the identification is identical, the data is identical, even though this cannot be true in all cases due to the pigeonhole principle; other implementations do not as

    Read more →
  • Social media newsroom

    Social media newsroom

    A social media newsroom is a company resource, set up to increase the functionality and usability of the traditional online newsroom. Social media newsrooms (SMNs) are intended to encourage dialogue and information sharing. Unlike online newsrooms, content is accessible to more than just journalists, but to all those with whom the company engages such as bloggers, their prospects, customers, business partners and investors. It gives these stakeholders access to news, public relations announcements, images, audio, video and other multimedia files. In addition to posting press releases and corporate news, companies can integrate other social content from sites such as YouTube, Flickr and Slideshow as well as streams from corporate Twitter accounts. Traditional tools for journalists such as corporate fast facts, leadership information, a multimedia library, financial information, awards and other recent media coverage are also included in an SMN. Examples of companies effectively using social media newsrooms include Opel Group, Pressat, First Direct, MyNewsdesk, Scania and Newport Beach.

    Read more →
  • VOCEDplus

    VOCEDplus

    VOCEDplus is a free international research database about tertiary education, maintained and developed by staff at the c (NCVER) in Adelaide, South Australia. The focus of the database content is the relation of post-compulsory education and training to workforce needs, skills development, and social inclusion. == Structure == The content of the VOCEDplus database encompasses vocational education and training (VET), higher education, lifelong learning, informal learning, VET in schools, adult and community education, apprenticeships/traineeships, international education, providers of education and training, and workforce development. It is international in scope and contains over 84,000 English language records, many with links to full text documents. VOCEDplus contains extensive Australian materials and includes a wide range of international information, covering outcomes of tertiary education in the shape of published research, practice, policy, and statistics. Entries are included for the following types of publications: reports; annual reports; papers; discussion papers; occasional papers; working papers; books; book chapters; conference papers; conference proceedings; journals; journal articles; policy documents; published statistics; theses; podcasts; and teaching and training materials. Each database entry contains standard bibliographic information and an abstract. Many entries include full text access via the publisher's website or a digitised copy. == History == === 1989-1997 === In the early years VOCEDplus was known as VOCED. The original database was produced by a network of clearinghouses across Australia with the aim of sharing activities in the technical and further education (TAFE) sector. VOCED was produced in hardcopy and an electronic version was distributed on diskette. === 1997-2001 === 1997 - the first web version of VOCED was made available from the National Centre for Vocational Education Research (NCVER) organisational website 1998 - a major project to upgrade the database and expand its international coverage commenced 2001 - creation of VOCED's own website 2001 - VOCED endorsed as the UNESCO international database for technical and vocational education and training (TVET) research information === 2001-2009 === Many changes to the database and website occurred during this period with a focus on continuous improvement to meet the needs of users and utilise emerging technologies. 2006 - materials produced for two adult literacy and learning programs funded by the Australian Department of Education, Employment and Workplace Relations (DEEWR) - the Workplace English Language and Learning (WELL) Programme and the Adult Literacy National Project (ALNP) included in VOCED 2007 - the Australian clearinghouse network transferred most of the hardcopy collections to NCVER, to form a centralised repository of resources 2009 - materials produced by Reframing the Future (RTF) a vocational education and training workforce development initiative of the Australian, State and Territory Governments included in VOCED === 2009-2014 === A major rebuild of the database and website was undertaken during this period to take advantage of the potential of new technologies to provide improved services and incorporate Web 2.0 technologies (RSS feeds, and share and bookmark tools). 2009 - scope expanded to more fully encompass the higher education sector 2011 - launch of VOCEDplus with the name change representing the enhanced features and extended focus 2012 - a major retrospective digitisation project commenced and by the end of the 2012-2013 financial year a total of 9,328 publications (593,534 pages/microfiche frames) had been digitised, ensuring these publications are available electronically for free === 2014-2019 === A number of significant curated content products were released during this period. 2015 - release of a refreshed look to adopt the new NCVER branding plus a number of search enhancements (Guided search, Expert search, and Glossary search) were added 2015 - first in the series of 'Focus on...' pages released 2016 - launch of the 'Pod Network', a convenient and efficient platform that allows instant access to research and a multitude of resources on a range of subjects 2017 - completion of the 'Pod Network', consisting of 20 Pods (on broad subjects including Apprenticeships and traineeships, Foundation skills, Teaching and learning, Career development, and Students) and 74 Podlets (on narrow topics including Online learning, Social media, VET in schools, STEM skills, and Adult literacy) 2018 - launch of the 'Timeline of Australian VET Policy Initiatives' and the 'VET Knowledge Bank' which contains a suite of products capturing Australia's diverse, complex and ever-changing VET system 2019 - after an internal review, a refreshed, streamlined version of the 'Pod Network' was released, consisting of 13 Pods and 20 Podlets 2019 - launch of the 'VET Practitioner Resource' which contains a range of information to support VET practitioners in their work and is organised into three sections: (1) Teaching, training and assessment: standards, guidance, research and good practice resources to inform daily work; (2) Practitioners as researchers: information for undertaking practitioner-led research; and (3) The VET workforce: information about VET teachers and trainers, and the professional development needs of the VET workforce 2019 - VOCEDplus celebrated 30 years of providing information to the tertiary education sector and the homepage was refreshed to make it more modern and easier to use === 2020- === VOCEDplus continued to be accessible throughout the COVID-19 pandemic. 2020-2021 - the VET Knowledge Bank added a dedicated page, 'COVID-19 announcements', that showcases the measures introduced by the Australian, state and territory governments to mitigate the impact of the pandemic and promote economic recovery 2020-2024 - published research about the effects of the pandemic on education and training, providers, students, labour markets, employment and employees was collected and made permanently available in the database 2024 - VOCEDplus celebrated 35 years of providing information to the tertiary education sector. The homepage was refreshed and a number of enhancements and new features were implemented including a new My Profile feature, improvements to My Selection, accessible search history and saved searches, enhanced search functionality, and improved navigation.

    Read more →
  • List of cryptography journals

    List of cryptography journals

    List of cryptography journals includes notable peer-reviewed academic journals that focus on cryptography, cryptanalysis, information security, and related areas in computer science and mathematics. == Notable journals == Cryptologia Designs, Codes and Cryptography IEEE Transactions on Information Theory International Journal of Information Security Journal of Cryptology Journal of Computer Security ACM Transactions on Privacy and Security Information Processing Letters Information and Computation

    Read more →
  • ESign (India)

    ESign (India)

    Aadhaar eSign is an online electronic signature service in India to facilitate an Aadhaar holder to digitally sign a document. The signature service is facilitated by authenticating the Aadhaar holder via the Aadhaar-based e-KYC (electronic Know Your Customer) service. To eSign a document, one has to have an Aadhaar card and a mobile number registered with Aadhaar. With these two things, an Indian citizen can sign a document remotely without being physically present. == Procedure == The notification issued by Government of India in this regard stipulates the following procedure for the e-authentication using Aadhaar e-KYC services. Authentication of an electronic record by e-authentication technique, which shall be done by the applicable use of e-authentication, hash function, and asymmetric cryptosystem techniques, leading to issuance of digital signature certificate by Certifying Authority, a trusted third party service by subscriber's key pair generation, storing of the key pairs on hardware security module and creation of digital signature provided that the trusted third party shall be offered by the certifying authority (the trusted third party shall send application form and certificate signing request to the Certifying Authority for issuing a digital signature certificate to the subscriber), issuance of digital signature certificate by Certifying Authority shall be based on e-authentication, particulars given in the prescribed format, digitally signed verified information from Aadhaar e-KYC services and electronic consent of digital signature certificate applicant, the manner and requirements for e-authentication shall be as issued by the Controller from time to time, the security procedure for creating the subscriber's key pair shall be in accordance with the e-authentication guidelines issued by the Controller, the standards referred to in rule 6 of the Information Technology (Certifying Authorities) Rules, 2000 shall be complied with, in so far as they relate to the certification function of public key of Digital Signature Certificate applicant, and the manner in which information is authenticated by means of digital signature shall comply with the standards specified in rule 6 of the Information Technology (Certifying Authorities) Rules, 2000 in so far as they relate to the creation, storage and transmission of Digital Signature. == eSign Service Providers == Organisations and individuals seeking to obtain the eSigning Service can utilize the services of various service providers. There are empanelled service providers with whom organisations can register as an Application Service Prover after submitting the requisite documents, getting UAT access, building the application around the service and going through an IT Audit by an CERT-IN empanelled auditor. However, the process of registering as an Application Service Provider is cumbersome, and requires huge investments of time, money and resources in complying with the regulations and building a suitable application. Most organisations prefer using services of plug-n-play gateway providers who take the responsibility of complying with the regulations, hence simplifying the process for the market.

    Read more →
  • Control break

    Control break

    In computer programming, a control break is a change in the value of one of the keys on which a file is sorted, which requires some extra processing. For example, with an input file sorted by post code, the number of items found in each postal district might need to be printed on a report, and a heading shown for the next district. Quite often there is a hierarchy of nested control breaks in a program, such as streets within districts within areas, with the need for a grand total at the end. Structured programming techniques have been developed to ensure correct processing of control breaks in languages such as COBOL and to ensure that conditions such as empty input files and sequence errors are handled properly. With fourth-generation languages such as SQL, the programming language should handle most of the details of control breaks automatically.

    Read more →
  • KoalaPad

    KoalaPad

    The KoalaPad is a graphics tablet, released in 1983 by US company Koala Technologies Corporation, for the Apple II, TRS-80 Color Computer (as the TRS-80 Touch Pad), Atari 8-bit computers, Commodore 64, and IBM PC compatibles. Originally designed by Dr. David Thornburg as a low-cost computer drawing tool for schools, the Koala Pad and the bundled drawing program, KoalaPainter, was popular with home users as well. KoalaPainter was called KoalaPaint in some versions for the Apple II, and PC Design for the IBM PC. A program called Graphics Exhibitor was included for creating slideshow presentations from KoalaPainter drawings. == Description == The pad was four inches square (i.e. roughly 10×10 cm) and mounted on a slightly inclined base with the back of the pad higher than the front. At the top, "behind" the pad, were two buttons. The pad hooked into the computer using the analog signals of the joystick ports (the so-called paddle inputs), which meant that it had a low resolution and tended to jostle the cursor if moved during use. As an alternative to the drawing stylus, the pad could as easily be operated by the user's fingers for tasks that demanded less precision, such as selecting between menu items (thus using the pad as a kind of "indirect touch screen"). The top-mounted buttons tended to be somewhat frustrating to use, as the user had to "reach around" the stylus to push the buttons in order to start or stop drawing. A similar tablet from Atari, the Atari CX77 Touch Tablet, addressed this with a built-in button on the stylus, which some enterprising users adapted for use with their KoalaPad. == KoalaPainter == The pad shipped with a simple bitmap graphics editor developed by Audio Light called KoalaPainter, PC Design or Micro Illustrator depending on the target machine (see release history). Although bundled with the pad, KoalaPainter could also be operated using an ordinary digital joystick. One unique feature of the program, for its time, was that it held two pictures in the computer's memory, allowing the user to flip from one to the other—a function commonly used in order to study the differences between an original and a modified picture, and to copy and paste between two different pictures. Some third-party bitmap editors could also be used with the KoalaPad, such as Broderbund's Dazzle Draw for the Apple II. === Release history === KoalaPainter for Commodore 64 (1983) and Atari 8-bit computers (1983) PC Design for the IBM PC (1983) Micro Illustrator for the Apple II (1983), Atari 8-bit computers (1983) and Commodore Plus/4 (1984) KoalaPainter II for Commodore 64 (1984) === Reception === Ahoy! called KoalaPainter "a very powerful and effective color drawing package", and concluded that it and the KoalaPad were "excellent in ease of use, a fine choice for a beginner as well as young children". BYTE's reviewer stated in December 1984 that he made far fewer errors when using an Apple Mouse with MousePaint than with a KoalaPad and its software. He found that MousePaint was easier to use and more efficient, predicting that the mouse would receive more software support than the pad. Cassie Stahl in InfoWorld's Essential Guide to Atari Computers praised the tablet and its documentation, rating it "Excellent" among all categories and stating that "Playing with the KoalaPad becomes addictive. It does everything it claims to, and it does it well". She also liked Micro Illustrator, rating it "Excellent" except for "Good" for Performance. While criticizing the limited erase function, Stahl reported an undocumented feature enabling exporting pictures to other software. === File format === The Commodore 64 version of KoalaPainter used a fairly simple file format corresponding directly to the way bitmapped graphics are handled on the computer: A two-byte load address, followed immediately by 8,000 bytes of raw bitmap data, 1,000 bytes of raw "Video Matrix" data, 1,000 bytes of raw "Color RAM" data, and a one-byte Background Color field. == KoalaWare == Koala Technologies offered more software beyond the bundled KoalaPainter and Graphics Exhibitor for use with the pad. Among these applications, marketed under the moniker KoalaWare (like KoalaPainter itself), was educational software for use with customized keypads and overlays, such as spelling tools, music programs, and mathematics instruction software, as well as software for "translating" graphical designs into Logo programs.

    Read more →
  • SIGINT Activity Designator

    SIGINT Activity Designator

    A SIGINT Activity Designator (or SIGAD) identifies a signals intelligence (SIGINT) line of collection activity associated with a signals collection station, such as a base or a ship. For example, the SIGAD for Menwith Hill in the UK is USD1000. SIGADs are used by the signals intelligence agencies of Australia, Canada, New Zealand, the United Kingdom, and the United States (the Five Eyes). There are several thousand SIGADs including the substation SIGADs denoted with a trailing alpha character. Several dozen of these are significant. The leaked Boundless Informant reporting screenshot showed that it summarized 504 active SIGADs during a 30-day period in March 2013. == General format == A SIGAD consists of five to eight case insensitive alphanumeric characters. It takes the general form of an alphanumeric designator normally composed of a two- or three-letter prefix followed by one to three numbers. Often a dash is used to separate the alphabetic and numeric characters in the primary part of the designator, but less frequently a space is used as a separator or the alphabetic and numeric characters are concatenated together. An additional alphabetic character can be added to denote a sub-designator for a subset of the primary unit, such as a detachment. Lastly, a numeric character can be added after the aforementioned alphabetic to provide for a sub-sub-designator. In the examples below an X represents an alphabetic character and an N represents a numeric character that are part of the primary designator. Likewise, an x represents an alphabetic character and an n represents a numeric character that are part of a sub-designator. Here are valid generalized examples of SIGADs: The first two characters show which country operates the particular SIGINT facility, which can be US for the United States, UK for the United Kingdom, CA for Canada, AU for Australia and NZ for New Zealand. A third letter shows what sort of staff runs the station. SIGADs beginning with US without a third letter are used for intercept facilities run by the NSA. == PRISM SIGAD == One prominent SIGAD as of April 2013 is US-984XN, with an unclassified codename of PRISM. It is "the number one source of raw intelligence used for NSA analytic reports" according to National Security Agency sources in a document leaked by Edward Snowden. The President's Daily Brief, an all-source intelligence product, cited SIGAD US-984XN as a source in 1,477 items in 2012. The U.S. government operates the PRISM electronic surveillance collection program through NSA's Special Source Operations, an alliance with trusted telecommunications providers. == SIGADs for spy ships == The declassified SIGAD for the USS Liberty (AGTR-5) was USN-855. The USS Liberty incident occurred on 8 June 1967, during the Six-Day War, when Israeli Air Force jet fighter aircraft and Israeli Navy motor torpedo boats attacked the USS Liberty in international waters. The USS Pueblo (AGER-2) was a technical research ship, which was boarded and captured by North Korean forces on 23 January 1968, in what is known as the Pueblo incident. The declassified SIGAD for the NSA Direct Support Unit (DSU) from the Naval Security Group (NSG) on the USS Pueblo patrol involved in the incident was USN-467Y. The USS Pueblo, which officially remains a commissioned vessel of the United States Navy, is the only ship of the U.S. Navy currently being held captive. == Vietnam War SIGADs == The following are the Vietnam War-era declassified SIGADs from inside South Vietnam during the period of 1969 to 1975: Some locations have multiple SIGADs due to different types of collection activities and/or collection at different times during the period. The SIGADs beginning with USA were operated by the United States Air Force's United States Air Force Security Service (USAFSS). The SIGADs beginning with USM were operated by the United States Army's Army Security Agency (ASA). Lastly, the SIGADs beginning with USN were operated by the United States Navy's Naval Security Group (NAVSECGRU). All three of these units have been merged into other units or inactivated. The above list consists of the higher-echelon SIGADs. It does not include the numerous miscellaneous and temporary detachments, or direction finding stations belonging to major units or sites unless that detachment or site was the only one stationed in South Vietnam. Many of the "dets" were short-lived, often formed to support ongoing MACV operations or forward deployments of combat operational or maneuver units. These detachments usually were designated by a letter suffix attached to the higher-echelon SIGAD such as "USM-633J," which was a detachment of the 372d Radio Research Company, USM-633, supporting the United States Army's 25th Infantry Division. === Supporting Southeast Asia SIGADs === The following declassified SIGADs were highly relevant to the Vietnam Campaign, but were located in areas outside of South Vietnam in Southeast Asia. Again, detachments are not listed separately. In the case of the USS Maddox, naval Direct Support Units (DSUs) used the SIGAD USN-467 as a generic designator for their missions. Each specific patrol received a letter suffix for its duration. The subsequent mission would receive the next letter in an alphabetic sequence. Thus, SIGAD USN-467N specifically designates the USS Maddox patrol involved with the Gulf of Tonkin incident. == Joint Base SIGADs == In November 2005, the US Congress performed a fifth round of Base Realignment and Closure. This 2005 law also created twelve joint bases by merging adjacent installations belonging to different services in an effort to reduce costs and improve efficiencies. Joint bases with a primarily SIGINT mission have SIGADs that begin with USJ. A joint base would have a primary SIGAD in the general form of USJ-NNN, where NNN are numeric characters. An actual example is not given, since these units are currently active.

    Read more →
  • Social media mining

    Social media mining

    Social media mining is the process of obtaining data from user-generated content on social media in order to extract actionable patterns, form conclusions about users, and act upon the information. Mining supports targeting advertising to users or academic research. The term is an analogy to the process of mining for minerals. Mining companies sift through raw ore to find the valuable minerals; likewise, social media mining sifts through social media data in order to discern patterns and trends about matters such as social media usage, online behaviour, content sharing, connections between individuals, buying behaviour. These patterns and trends are of interest to companies, governments and not-for-profit organizations, as such organizations can use the analyses for tasks such as design strategies, introduce programs, products, processes or services. Social media mining uses concepts from computer science, data mining, machine learning, and statistics. Mining is based on social network analysis, network science, sociology, ethnography, optimization and mathematics. It attempts to formally represent, measure and model patterns from social media data. In the 2010s, major corporations, governments and not-for-profit organizations began mining to learn about customers, clients and others. Platforms such as Google, Facebook (partnered with Datalogix and BlueKai) conduct mining to target users with advertising. Scientists and machine learning researchers extract insights and design product features. Users may not understand how platforms use their data. Users tend to click through Terms of Use agreements without reading them, leading to ethical questions about whether platforms adequately protect users' privacy. During the 2016 United States presidential election, Facebook allowed Cambridge Analytica, a political consulting firm linked to the Trump campaign, to analyze the data of an estimated 87 million Facebook users to profile voters, creating controversy when this was revealed. == Background == As defined by Kaplan and Haenlein, social media is the "group of internet-based applications that build on the ideological and technological foundations of Web 2.0, and that allow the creation and exchange of user-generated content." There are many categories of social media including, but not limited to, social networking (Facebook or LinkedIn), microblogging (Twitter), photo sharing (Flickr, Instagram, Photobucket, or Picasa), news aggregation (Google Reader, StumbleUpon, or Feedburner), video sharing (YouTube, MetaCafe), livecasting (Ustream or Twitch), virtual worlds (Kaneva), social gaming (World of Warcraft), social search (Google, Bing, or Ask.com), and instant messaging (Google Talk, Skype, or Yahoo! messenger). The first social media website was introduced by GeoCities in 1994. It enabled users to create their own homepages without having a sophisticated knowledge of HTML coding. The first social networking site, SixDegrees.com, was introduced in 1997. Since then, many other social media sites have been introduced, each providing service to millions of people. These individuals form a virtual world in which individuals (social atoms), entities (content, sites, etc.) and interactions (between individuals, between entities, between individuals and entities) coexist. Social norms and human behavior govern this virtual world. By understanding these social norms and models of human behavior and combining them with the observations and measurements of this virtual world, one can systematically analyze and mine social media. Social media mining is the process of representing, analyzing, and extracting meaningful patterns from data in social media, resulting from social interactions. It is an interdisciplinary field encompassing techniques from computer science, data mining, machine learning, social network analysis, network science, sociology, ethnography, statistics, optimization, and mathematics. Social media mining faces grand challenges such as the big data paradox, obtaining sufficient samples, the noise removal fallacy, and evaluation dilemma. Social media mining represents the virtual world of social media in a computable way, measures it, and designs models that can help us understand its interactions. In addition, social media mining provides necessary tools to mine this world for interesting patterns, analyze information diffusion, study influence and homophily, provide effective recommendations, and analyze novel social behavior in social media. == Uses == Social media mining is used across several industries including business development, social science research, health services, and educational purposes. Once the data received goes through social media analytics, it can then be applied to these various fields. Often, companies use the patterns of connectivity that pervade social networks, such as assortativity—the social similarity between users that are induced by influence, homophily, and reciprocity and transitivity. These forces are then measured via statistical analysis of the nodes and connections between these nodes. Social analytics also uses sentiment analysis, because social media users often relay positive or negative sentiment in their posts. This provides important social information about users' emotions on specific topics. These three patterns have several uses beyond pure analysis. For example, influence can be used to determine the most influential user in a particular network. Companies would be interested in this information in order to decide who they may hire for influencer marketing. These influencers are determined by recognition, activity generation, and novelty—three requirements that can be measured through the data mined from these sites. Analysts also value measures of homophily: the tendency of two similar individuals to become friends. Users have begun to rely on information of other users' opinions in order to understand diverse subject matter. These analyses can also help create recommendations for individuals in a tailored capacity. By measuring influence and homophily, online and offline companies are able to suggest specific products for individuals consumers, and groups of consumers. Social media networks can use this information themselves to suggest to their users possible friends to add, pages to follow, and accounts to interact with. == Perception == Modern social media mining is a controversial practice that has led to exponential gains in user growth for tech giants such as Facebook, Inc., Twitter, and Google. Companies such as these, considered "Big Tech" are companies that build algorithms that take advantage of user input to understand their preferences, and keep them on the platform as much as possible. These inputs, that can be as simple as time spent on a given screen, provide the data being mined, and lead to companies profiting heavily from using that data to capitalize on extremely accurate predictions about user behavior. The growth of platforms accelerated rapidly once these strategies were put in place; Most of the largest platforms now average over 1 billion active users per month as of 2021. It has been claimed by a multitude of anti-algorithm personalities, like Tristan Harris or Chamath Palihapitiya, that certain companies (specifically Facebook) valued growth above all else, and ignored potential negative impacts from these growth engineering tactics. At the same time, users have now created their own data arbitrages with the help of their own data, through content monetization and becoming influencers. Users typically have access to a varied set of analytics specific to people that interact with them on social media, and can use these as building blocks for their own targeting and growth strategies through ads and posts that cater to their audiences. Influencers also commonly promote products and services for established brands, creating one of the largest digital industries: Influencer marketing. Instagram, Facebook, Twitter, YouTube, Google, and others have long given access to platform analytics, and allowed third parties to access that information as well, at times unbeknownst to even the user whose data is being viewed/bought. == Research == === Research areas === Social media event detection – Social networks enable users to freely communicate with each other and share their recent news, ongoing activities or views about different topics. As a result, they can be seen as a potentially viable source of information to understand the current emerging topics/events. Public health monitoring and surveillance - Using large-scale analysis of social media to study large cohorts of patients and the general public, e.g. to obtain early warning signals of drug-drug interactions and adverse drug reactions, or understand human reproduction and sexual interest. Community structure (Community Detection/Evolution/Evaluation) – Identifying communities on social networks, how t

    Read more →
  • Data lineage

    Data lineage

    Data lineage refers to the process of tracking how data is generated, transformed, transmitted and used across systems over time. It documents data's origins, transformations and movements, providing detailed visibility into its life cycle. This process simplifies the identification of errors in data analytics workflows, by enabling users to trace issues back to their root causes. Data lineage facilitates the ability to replay specific segments or inputs of the dataflow. This can be used in debugging or regenerating lost outputs. In database systems, this concept is closely related to data provenance, which involves maintaining records of inputs, entities, systems and processes that influence data. Data provenance provides a historical record of data origins and transformations. It supports forensic activities such as data-dependency analysis, error/compromise detection, recovery, auditing and compliance analysis: "Lineage is a simple type of why provenance." Data governance plays a critical role in managing metadata by establishing guidelines, strategies and policies. Enhancing data lineage with data quality measures and master data management adds business value. Although data lineage is typically represented through a graphical user interface (GUI), the methods for gathering and exposing metadata to this interface can vary. Based on the metadata collection approach, data lineage can be categorized into three types: Those involving software packages for structured data, programming languages and Big data systems. Data lineage information includes technical metadata about data transformations. Enriched data lineage may include additional elements such as data quality test results, reference data, data models, business terminology, data stewardship information, program management details and enterprise systems associated with data points and transformations. Data lineage visualization tools often include masking features that allow users to focus on information relevant to specific use cases. To unify representations across disparate systems, metadata normalization or standardization may be required. == Representation of data lineage == Representation broadly depends on the scope of the metadata management and reference point of interest. Data lineage provides sources of the data and intermediate data flow hops from the reference point with backward data lineage, leading to the final destination's data points and its intermediate data flows with forward data lineage. These views can be combined with end-to-end lineage for a reference point that provides a complete audit trail of that data point of interest from sources to their final destinations. As the data points or hops increase, the complexity of such representation becomes incomprehensible. Thus, the best feature of the data lineage view is the ability to simplify the view by temporarily masking unwanted peripheral data points. Tools with the masking feature enable scalability of the view and enhance analysis with the best user experience for both technical and business users. Data lineage also enables companies to trace sources of specific business data to track errors, implement changes in processes and implementing system migrations to save significant amounts of time and resources. Data lineage can improve efficiency in business intelligence BI processes. Data lineage can be represented visually to discover the data flow and movement from its source to destination via various changes and hops on its way in the enterprise environment. This includes how the data is transformed along the way, how the representation and parameters change and how the data splits or converges after each hop. A simple representation of the Data Lineage can be shown with dots and lines, where dots represent data containers for data points, and lines connecting them represent transformations the data undergoes between the data containers. Data lineage can be visualized at various levels based on the granularity of the view. At a very high-level, data lineage is visualized as systems that the data interacts with before it reaches its destination. At its most granular, visualizations at the data point level can provide the details of the data point and its historical behavior, attribute properties and trends and data quality of the data passed through that specific data point in the data lineage. The scope of the data lineage determines the volume of metadata required to represent its data lineage. Usually, data governance and data management of an organization determine the scope of the data lineage based on their regulations, enterprise data management strategy, data impact, reporting attributes and critical data elements of the organization. == Rationale == Distributed systems like Google Map Reduce, Microsoft Dryad, Apache Hadoop (an open-source project) and Google Pregel provide such platforms for businesses and users. However, even with these systems, Big Data analytics can take several hours, days or weeks to run, simply due to the data volumes involved. For example, a ratings prediction algorithm for the Netflix Prize challenge took nearly 20 hours to execute on 50 cores, and a large-scale image processing task to estimate geographic information took 3 days to complete using 400 cores. "The Large Synoptic Survey Telescope is expected to generate terabytes of data every night and eventually store more than 50 petabytes, while in the bioinformatics sector, the 12 largest genome sequencing houses in the world now store petabytes of data apiece. It is very difficult for a data scientist to trace an unknown or an unanticipated result. === Big data debugging === Big data analytics is the process of examining large data sets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. Machine learning, among other algorithms, is used to transform and analyze the data. Due to the large size of the data, there could be unknown features in the data. The massive scale and unstructured nature of data, the complexity of these analytics pipelines, and long runtimes pose significant manageability and debugging challenges. Even a single error in these analytics can be extremely difficult to identify and remove. While one may debug them by re-running the entire analytics through a debugger for stepwise debugging, this can be expensive due to the amount of time and resources needed. Auditing and data validation are other major problems due to the growing ease of access to relevant data sources for use in experiments, the sharing of data between scientific communities and use of third-party data in business enterprises. As such, more cost-efficient ways of analyzing data intensive scale-able computing (DISC) are crucial to their continued effective use. === Challenges in Big Data debugging === ==== Massive scale ==== According to an EMC/IDC study, 2.8 ZB of data were created and replicated in 2012. Furthermore, the same study states that the digital universe will double every two years between now and 2020, and that there will be approximately 5.2 TB of data for every person in 2020. Based on current technology, the storage of this much data will mean greater energy usage by data centers. ==== Unstructured data ==== Unstructured data usually refers to information that doesn't reside in a traditional row-column database. Unstructured data files often include text and multimedia content, such as e-mail messages, word processing documents, videos, photos, audio files, presentations, web pages and many other kinds of business documents. While these types of files may have an internal structure, they are still considered "unstructured" because the data they contain doesn't fit neatly into a database. The amount of unstructured data in enterprises is growing many times faster than structured databases are growing. Big data can include both structured and unstructured data, but IDC estimates that 90 percent of Big Data is unstructured data. The fundamental challenge of unstructured data sources is that they are difficult for non-technical business users and data analysts alike to unbox, understand and prepare for analytic use. Beyond issues of structure, the sheer volume of this type of data contributes to such difficulty. Because of this, current data mining techniques often leave out valuable information and make analyzing unstructured data laborious and expensive. In today's competitive business environment, companies have to find and analyze the relevant data they need quickly. The challenge is going through the volumes of data and accessing the level of detail needed, all at a high speed. The challenge only grows as the degree of granularity increases. One possible solution is hardware. Some vendors are using increased memory and parallel processing to crunch large volumes of data quickly. Another method is putting data in-memory but using a grid

    Read more →
  • Seq2seq

    Seq2seq

    Seq2seq is a family of machine learning approaches used for natural language processing. Originally developed by Lê Viết Quốc, a Vietnamese computer scientist and a machine learning pioneer at Google Brain, this framework has become foundational in many modern AI systems. Applications include language translation, image captioning, conversational models, speech recognition, and text summarization. Seq2seq uses sequence transformation: it turns one sequence into another sequence. == History == One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: 'This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. seq2seq is an approach to machine translation (or more generally, sequence transduction) with roots in information theory, where communication is understood as an encode-transmit-decode process, and machine translation can be studied as a special case of communication. This viewpoint was elaborated, for example, in the noisy channel model of machine translation. In practice, seq2seq maps an input sequence into a real-numerical vector by using a neural network (the encoder), and then maps it back to an output sequence using another neural network (the decoder). The idea of encoder-decoder sequence transduction had been developed in the early 2010s. The papers most commonly cited as the originators that produced seq2seq are two papers from 2014. In the seq2seq as proposed by them, both the encoder and the decoder were LSTMs. This had the "bottleneck" problem, since the encoding vector has a fixed size, so for long input sequences, information would tend to be lost, as they are difficult to fit into the fixed-length encoding vector. The attention mechanism, proposed in 2014, resolved the bottleneck problem. They called their model RNNsearch, as it "emulates searching through a source sentence during decoding a translation". A problem with seq2seq models at this point was that recurrent neural networks are difficult to parallelize. The 2017 publication of Transformers resolved the problem by replacing the encoding RNN with self-attention Transformer blocks ("encoder blocks"), and the decoding RNN with cross-attention causally-masked Transformer blocks ("decoder blocks"). === Priority dispute === One of the papers cited as the originator for seq2seq is (Sutskever et al 2014), published at Google Brain while they were on Google's machine translation project. The research allowed Google to overhaul Google Translate into Google Neural Machine Translation in 2016. Tomáš Mikolov claims to have developed the idea (before joining Google Brain) of using a "neural language model on pairs of sentences... and then [generating] translation after seeing the first sentence"—which he equates with seq2seq machine translation, and to have mentioned the idea to Ilya Sutskever and Quoc Le (while at Google Brain), who failed to acknowledge him in their paper. Mikolov had worked on RNNLM (using RNN for language modelling) for his PhD thesis, and is more notable for developing word2vec. == Architecture == The main reference for this section is. === Encoder === The encoder is responsible for processing the input sequence and capturing its essential information, which is stored as the hidden state of the network and, in a model with attention mechanism, a context vector. The context vector is the weighted sum of the input hidden states and is generated for every time instance in the output sequences. === Decoder === The decoder takes the context vector and hidden states from the encoder and generates the final output sequence. The decoder operates in an autoregressive manner, producing one element of the output sequence at a time. At each step, it considers the previously generated elements, the context vector, and the input sequence information to make predictions for the next element in the output sequence. Specifically, in a model with attention mechanism, the context vector and the hidden state are concatenated together to form an attention hidden vector, which is used as an input for the decoder. The seq2seq method developed in the early 2010s uses two neural networks: an encoder network converts an input sentence into numerical vectors, and a decoder network converts those vectors to sentences in the target language. The Attention mechanism was grafted onto this structure in 2014 and is shown below. Later it was refined into the encoder-decoder Transformer architecture of 2017. === Training vs prediction === There is a subtle difference between training and prediction. During training time, both the input and the output sequences are known. During prediction time, only the input sequence is known, and the output sequence must be decoded by the network itself. Specifically, consider an input sequence x 1 : n {\displaystyle x_{1:n}} and output sequence y 1 : m {\displaystyle y_{1:m}} . The encoder would process the input x 1 : n {\displaystyle x_{1:n}} step by step. After that, the decoder would take the output from the encoder, as well as the as input, and produce a prediction y ^ 1 {\displaystyle {\hat {y}}_{1}} . Now, the question is: what should be input to the decoder in the next step? A standard method for training is "teacher forcing". In teacher forcing, no matter what is output by the decoder, the next input to the decoder is always the reference. That is, even if y ^ 1 ≠ y 1 {\displaystyle {\hat {y}}_{1}\neq y_{1}} , the next input to the decoder is still y 1 {\displaystyle y_{1}} , and so on. During prediction time, the "teacher" y 1 : m {\displaystyle y_{1:m}} would be unavailable. Therefore, the input to the decoder must be y ^ 1 {\displaystyle {\hat {y}}_{1}} , then y ^ 2 {\displaystyle {\hat {y}}_{2}} , and so on. It is found that if a model is trained purely by teacher forcing, its performance would degrade during prediction time, since generation based on the model's own output is different from generation based on the teacher's output. This is called exposure bias or a train/test distribution shift. A 2015 paper recommends that, during training, randomly switch between teacher forcing and no teacher forcing. === Attention for seq2seq === The attention mechanism is an enhancement introduced by Bahdanau et al. in 2014 to address limitations in the basic Seq2Seq architecture where a longer input sequence results in the hidden state output of the encoder becoming irrelevant for the decoder. It enables the model to selectively focus on different parts of the input sequence during the decoding process. At each decoder step, an alignment model calculates the attention score using the current decoder state and all of the attention hidden vectors as input. An alignment model is another neural network model that is trained jointly with the seq2seq model used to calculate how well an input, represented by the hidden state, matches with the previous output, represented by attention hidden state. A softmax function is then applied to the attention score to get the attention weight. In some models, the encoder states are directly fed into an activation function, removing the need for alignment model. An activation function receives one decoder state and one encoder state and returns a scalar value of their relevance. Consider the seq2seq language English-to-French translation task. To be concrete, let us consider the translation of "the zone of international control ", which should translate to "la zone de contrôle international ". Here, we use the special token as a control character to delimit the end of input for both the encoder and the decoder. An input sequence of text x 0 , x 1 , … {\displaystyle x_{0},x_{1},\dots } is processed by a neural network (which can be an LSTM, a Transformer encoder, or some other network) into a sequence of real-valued vectors h 0 , h 1 , … {\displaystyle h_{0},h_{1},\dots } , where h {\displaystyle h} stands for "hidden vector". After the encoder has finished processing, the decoder starts operating over the hidden vectors, to produce an output sequence y 0 , y 1 , … {\displaystyle y_{0},y_{1},\dots } , autoregressively. That is, it always takes as input both the hidden vectors produced by the encoder, and what the decoder itself has produced before, to produce the next output word: ( h 0 , h 1 , … {\displaystyle h_{0},h_{1},\dots } , "") → "la" ( h 0 , h 1 , … {\displaystyle h_{0},h_{1},\dots } , " la") → "la zone" ( h 0 , h 1 , … {\displaystyle h_{0},h_{1},\dots } , " la zone") → "la zone de" ... ( h 0 , h 1 , … {\displaystyle h_{0},h_{1},\dots } , " la zone de contrôle international") → "la zone de contrôle international " Here, we use the special token as a control character to delimit the start of input for the decoder. The decoding terminates as soon as "" appears in the decoder output. ==

    Read more →
  • Tropical cryptography

    Tropical cryptography

    In tropical analysis, tropical cryptography refers to the study of a class of cryptographic protocols built upon tropical algebras. In many cases, tropical cryptographic schemes have arisen from adapting classical (non-tropical) schemes to instead rely on tropical algebras. The case for the use of tropical algebras in cryptography rests on at least two key features of tropical mathematics: in the tropical world, there is no classical multiplication (a computationally expensive operation), and the problem of solving systems of tropical polynomial equations has been shown to be NP-hard. == Basic Definitions == The key mathematical object at the heart of tropical cryptography is the tropical semiring ( R ∪ { ∞ } , ⊕ , ⊗ ) {\displaystyle (\mathbb {R} \cup \{\infty \},\oplus ,\otimes )} (also known as the min-plus algebra), or a generalization thereof. The operations are defined as follows for x , y ∈ R ∪ { ∞ } {\displaystyle x,y\in \mathbb {R} \cup \{\infty \}} : x ⊕ y = min { x , y } {\displaystyle x\oplus y=\min\{x,y\}} x ⊗ y = x + y {\displaystyle x\otimes y=x+y} It is easily verified that with ∞ {\displaystyle \infty } as the additive identity, these binary operations on R ∪ { ∞ } {\displaystyle \mathbb {R} \cup \{\infty \}} form a semiring.

    Read more →
  • Cipher device

    Cipher device

    A cipher device was a term used by the US military in the first half of the 20th century to describe a manually operated cipher equipment that converted the plaintext into ciphertext or vice versa. A similar term, cipher machine, was used to describe the cipher equipment that required external power for operation. Cipher box or crypto box is a physical cryptographic device used to encrypt and decrypt messages between plaintext (unencrypted) and ciphertext (encrypted or secret) forms. The ciphertext is suitable for transmission over a channel, such as radio, that might be observed by an adversary the communicating parties wish to conceal the plaintext from.

    Read more →