Tf–idf

Tf–idf

In information retrieval, tf–idf (term frequency–inverse document frequency, TFIDF, TFIDF, TF–IDF, or Tf–idf) is a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general. Like the bag-of-words model, it models a document as a multiset of words, without word order. It is a refinement over the simple bag-of-words model, by allowing the weight of words to depend on the rest of the corpus. It was often used as a weighting factor in searches of information retrieval, text mining, and user modeling. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries used tf–idf. Variations of the tf–idf weighting scheme were often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model. == Motivations == Karen Spärck Jones (1972) conceived a statistical interpretation of term-specificity called Inverse Document Frequency (idf), which became a cornerstone of term weighting: The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs.For example, the df (document frequency) and idf for some words in Shakespeare's 37 plays might be represented as follows: We see that "Romeo", "Falstaff", and "salad" appears in very few plays, so seeing these words, one could get a good idea as to which play it might be. In contrast, "good" and "sweet" appears in every play and are completely uninformative as to which play it is. == Definition == The tf–idf is the product of two statistics, term frequency and inverse document frequency. There are various ways for determining the exact values of both statistics. A formula that aims to define the importance of a keyword or phrase within a document or a web page. === Term frequency === Term frequency, tf(t,d), is the relative frequency of term t within document d, t f ( t , d ) = f t , d ∑ t ′ ∈ d f t ′ , d {\displaystyle \mathrm {tf} (t,d)={\frac {f_{t,d}}{\sum _{t'\in d}{f_{t',d}}}}} , where ft,d is the raw count of a term in a document, i.e., the number of times that term t occurs in document d. Note the denominator is simply the total number of terms in document d (counting each occurrence of the same term separately). There are various other ways to define term frequency: the raw count itself: tf(t,d) = ft,d Boolean "frequencies": tf(t,d) = 1 if t occurs in d and 0 otherwise; logarithmically scaled frequency: tf(t,d) = log (1 + ft,d); augmented frequency, to prevent a bias towards longer documents, e.g. raw frequency divided by the raw frequency of the most frequently occurring term in the document: t f ( t , d ) = 0.5 + 0.5 ⋅ f t , d max { f t ′ , d : t ′ ∈ d } {\displaystyle \mathrm {tf} (t,d)=0.5+0.5\cdot {\frac {f_{t,d}}{\max\{f_{t',d}:t'\in d\}}}} === Inverse document frequency === The inverse document frequency is a measure of how much information the word provides, i.e., how common or rare it is across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient): i d f ( t , D ) = log ⁡ N n t {\displaystyle \mathrm {idf} (t,D)=\log {\frac {N}{n_{t}}}} with D {\displaystyle D} : is the set of all documents in the corpus N = | D | {\displaystyle N={|D|}} : total number of documents in the corpus n t = | { d ∈ D : t ∈ d } | {\displaystyle n_{t}=|\{d\in D:t\in d\}|} : number of documents where the term t {\displaystyle t} appears (i.e., t f ( t , d ) ≠ 0 {\displaystyle \mathrm {tf} (t,d)\neq 0} ). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the numerator to 1 + N {\displaystyle 1+N} and the denominator to 1 + | { d ∈ D : t ∈ d } | {\displaystyle 1+|\{d\in D:t\in d\}|} . === Term frequency–inverse document frequency === Then tf–idf is calculated as t f i d f ( t , d , D ) = t f ( t , d ) ⋅ i d f ( t , D ) {\displaystyle \mathrm {tfidf} (t,d,D)=\mathrm {tf} (t,d)\cdot \mathrm {idf} (t,D)} A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf–idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf–idf closer to 0. == Justification of idf == Idf was introduced as "term specificity" by Karen Spärck Jones in a 1972 paper. Although it has worked well as a heuristic, its theoretical foundations have been troublesome for at least three decades afterward, with many researchers trying to find information theoretic justifications for it. Spärck Jones's own explanation did not propose much theory, aside from a connection to Zipf's law. Attempts have been made to put idf on a probabilistic footing, by estimating the probability that a given document d contains a term t as the relative document frequency, P ( t | D ) = | { d ∈ D : t ∈ d } | N , {\displaystyle P(t|D)={\frac {|\{d\in D:t\in d\}|}{N}},} so that we can define idf as i d f = − log ⁡ P ( t | D ) = log ⁡ 1 P ( t | D ) = log ⁡ N | { d ∈ D : t ∈ d } | {\displaystyle {\begin{aligned}\mathrm {idf} &=-\log P(t|D)\\&=\log {\frac {1}{P(t|D)}}\\&=\log {\frac {N}{|\{d\in D:t\in d\}|}}\end{aligned}}} Namely, the inverse document frequency is the logarithm of "inverse" relative document frequency. This probabilistic interpretation in turn takes the same form as that of self-information. However, applying such information-theoretic notions to problems in information retrieval leads to problems when trying to define the appropriate event spaces for the required probability distributions: not only documents need to be taken into account, but also queries and terms. == Link with information theory == Both term frequency and inverse document frequency can be formulated in terms of information theory; it helps to understand why their product has a meaning in terms of joint informational content of a document. A characteristic assumption about the distribution p ( d , t ) {\displaystyle p(d,t)} is that: p ( d | t ) = 1 | { d ∈ D : t ∈ d } | {\displaystyle p(d|t)={\frac {1}{|\{d\in D:t\in d\}|}}} This assumption and its implications, according to Aizawa: "represent the heuristic that tf–idf employs." The conditional entropy of a "randomly chosen" document in the corpus D {\displaystyle D} , conditional to the fact it contains a specific term t {\displaystyle t} (and assuming that all documents have equal probability to be chosen) is: H ( D | T = t ) = − ∑ d p d | t log ⁡ p d | t = − log ⁡ 1 | { d ∈ D : t ∈ d } | = log ⁡ | { d ∈ D : t ∈ d } | | D | + log ⁡ | D | = − i d f ( t ) + log ⁡ | D | {\displaystyle H({\cal {D}}|{\cal {T}}=t)=-\sum _{d}p_{d|t}\log p_{d|t}=-\log {\frac {1}{|\{d\in D:t\in d\}|}}=\log {\frac {|\{d\in D:t\in d\}|}{|D|}}+\log |D|=-\mathrm {idf} (t)+\log |D|} In terms of notation, D {\displaystyle {\cal {D}}} and T {\displaystyle {\cal {T}}} are "random variables" corresponding to respectively draw a document or a term. The mutual information can be expressed as M ( T ; D ) = H ( D ) − H ( D | T ) = ∑ t p t ⋅ ( H ( D ) − H ( D | W = t ) ) = ∑ t p t ⋅ i d f ( t ) {\displaystyle M({\cal {T}};{\cal {D}})=H({\cal {D}})-H({\cal {D}}|{\cal {T}})=\sum _{t}p_{t}\cdot (H({\cal {D}})-H({\cal {D}}|W=t))=\sum _{t}p_{t}\cdot \mathrm {idf} (t)} The last step is to expand p t {\displaystyle p_{t}} , the unconditional probability to draw a term, with respect to the (random) choice of a document, to obtain: M ( T ; D ) = ∑ t , d p t | d ⋅ p d ⋅ i d f ( t ) = ∑ t , d t f ( t , d ) ⋅ 1 | D | ⋅ i d f ( t ) = 1 | D | ∑ t , d t f ( t , d ) ⋅ i d f ( t ) . {\displaystyle M({\cal {T}};{\cal {D}})=\sum _{t,d}p_{t|d}\cdot p_{d}\cdot \mathrm {idf} (t)=\sum _{t,d}\mathrm {tf} (t,d)\cdot {\frac {1}{|D|}}\cdot \mathrm {idf} (t)={\frac {1}{|D|}}\sum _{t,d}\mathrm {tf} (t,d)\cdot \mathrm {idf} (t).} This expression shows that summing the Tf–idf of all possible terms and documents recovers the mutual information between documents and term taking into account all the specificities of their joint distribution. Each Tf–idf hence carries the "bit of information" attached to a term x document pair. == Link with statistical theory == Tf–idf is closely related to the negative logarithmically transformed p-value from a one-tailed formulation of Fisher's exact test when the underlying corpus documents satisfy certain idealized assumptions. More recently, tf–idf variants were shown to arise as components in the test st

Semantic analytics

Semantic analytics, also termed semantic relatedness, is the use of ontologies to analyze content in web resources. This field of research combines text analytics and Semantic Web technologies like RDF. Semantic analytics measures the relatedness of different ontological concepts. Some academic research groups that have active project in this area include Kno.e.sis Center at Wright State University among others. == History == An important milestone in the beginning of semantic analytics occurred in 1996, although the historical progression of these algorithms is largely subjective. In his seminal study publication, Philip Resnik established that computers have the capacity to emulate human judgement. Spanning the publications of multiple journals, improvements to the accuracy of general semantic analytic computations all claimed to revolutionize the field. However, the lack of a standard terminology throughout the late 1990s was the cause of much miscommunication. This prompted Budanitsky & Hirst to standardize the subject in 2006 with a summary that also set a framework for modern spelling and grammar analysis. In the early days of semantic analytics, obtaining a large enough reliable knowledge bases was difficult. In 2006, Strube & Ponzetto demonstrated that Wikipedia could be used in semantic analytic calculations. The usage of a large knowledge base like Wikipedia allows for an increase in both the accuracy and applicability of semantic analytics. == Methods == Given the subjective nature of the field, different methods used in semantic analytics depend on the domain of application. No singular methods is considered correct, however one of the most generally effective and applicable method is explicit semantic analysis (ESA). ESA was developed by Evgeniy Gabrilovich and Shaul Markovitch in the late 2000s. It uses machine learning techniques to create a semantic interpreter, which extracts text fragments from articles into a sorted list. The fragments are sorted by how related they are to the surrounding text. Latent semantic analysis (LSA) is another common method that does not use ontologies, only considering the text in the input space. == Applications == Entity linking Ontology building / knowledge base population Search and query tasks Natural language processing Spoken dialog systems (e.g., Amazon Alexa, Google Assistant, Microsoft's Cortana) Artificial intelligence Knowledge management The application of semantic analysis methods generally streamlines organizational processes of any knowledge management system. Academic libraries often use a domain-specific application to create a more efficient organizational system. By classifying scientific publications using semantics and Wikipedia, researchers are helping people find resources faster. Search engines like Semantic Scholar provide organized access to millions of articles.

Supercomputer operating system

A supercomputer operating system is an operating system intended for supercomputers. Since the end of the 20th century, supercomputer operating systems have undergone major transformations, as fundamental changes have occurred in supercomputer architecture. While early operating systems were custom tailored to each supercomputer to gain speed, the trend has been moving away from in-house operating systems and toward some form of Linux, with it running all the supercomputers on the TOP500 list in November 2017. In 2021, top 10 computers run for instance Red Hat Enterprise Linux (RHEL), or some variant of it or other Linux distribution e.g. Ubuntu. Given that modern massively parallel supercomputers typically separate computations from other services by using multiple types of nodes, they usually run different operating systems on different nodes, e.g., using a small and efficient lightweight kernel such as Compute Node Kernel (CNK) or Compute Node Linux (CNL) on compute nodes, but a larger system such as a Linux distribution on server and input/output (I/O) nodes. While in a traditional multi-user computer system job scheduling is in effect a tasking problem for processing and peripheral resources, in a massively parallel system, the job management system needs to manage the allocation of both computational and communication resources, as well as gracefully dealing with inevitable hardware failures when tens of thousands of processors are present. Although most modern supercomputers use the Linux operating system, each manufacturer has made its own specific changes to the Linux distribution they use, and no industry standard exists, partly because the differences in hardware architectures require changes to optimize the operating system to each hardware design. == Context and overview == In the early days of supercomputing, the basic architectural concepts were evolving rapidly, and system software had to follow hardware innovations that usually took rapid turns. In the early systems, operating systems were custom tailored to each supercomputer to gain speed, yet in the rush to develop them, serious software quality challenges surfaced and in many cases the cost and complexity of system software development became as much an issue as that of hardware. In the 1980s the cost for software development at Cray came to equal what they spent on hardware and that trend was partly responsible for a move away from the in-house operating systems to the adaptation of generic software. The first wave in operating system changes came in the mid-1980s, as vendor specific operating systems were abandoned in favor of Unix. Despite early skepticism, this transition proved successful. By the early 1990s, major changes were occurring in supercomputing system software. By this time, the growing use of Unix had begun to change the way system software was viewed. The use of a high level language (C) to implement the operating system, and the reliance on standardized interfaces was in contrast to the assembly language oriented approaches of the past. As hardware vendors adapted Unix to their systems, new and useful features were added to Unix, e.g., fast file systems and tunable process schedulers. However, all the companies that adapted Unix made unique changes to it, rather than collaborating on an industry standard to create "Unix for supercomputers". This was partly because differences in their architectures required these changes to optimize Unix to each architecture. As general purpose operating systems became stable, supercomputers began to borrow and adapt critical system code from them, and relied on the rich set of secondary functions that came with them. However, at the same time the size of the code for general purpose operating systems was growing rapidly. By the time Unix-based code had reached 500,000 lines long, its maintenance and use was a challenge. This resulted in the move to use microkernels which used a minimal set of the operating system functions. Systems such as Mach at Carnegie Mellon University and ChorusOS at INRIA were examples of early microkernels. The separation of the operating system into separate components became necessary as supercomputers developed different types of nodes, e.g., compute nodes versus I/O nodes. Thus modern supercomputers usually run different operating systems on different nodes, e.g., using a small and efficient lightweight kernel such as CNK or CNL on compute nodes, but a larger system such as a Linux-derivative on server and I/O nodes. == Early systems == The CDC 6600, generally considered the first supercomputer in the world, ran the Chippewa Operating System, which was then deployed on various other CDC 6000 series computers. The Chippewa was a rather simple job control oriented system derived from the earlier CDC 3000, but it influenced the later KRONOS and SCOPE systems. The first Cray-1 was delivered to the Los Alamos Lab with no operating system, or any other software. Los Alamos developed the application software for it, and the operating system. The main timesharing system for the Cray 1, the Cray Time Sharing System (CTSS), was then developed at the Livermore Labs as a direct descendant of the Livermore Time Sharing System (LTSS) for the CDC 6600 operating system from twenty years earlier. In developing supercomputers, rising software costs soon became dominant, as evidenced by the 1980s cost for software development at Cray growing to equal their cost for hardware. That trend was partly responsible for a move away from the in-house Cray Operating System to UNICOS system based on Unix. In 1985, the Cray-2 was the first system to ship with the UNICOS operating system. Around the same time, the EOS operating system was developed by ETA Systems for use in their ETA10 supercomputers. Written in Cybil, a Pascal-like language from Control Data Corporation, EOS highlighted the stability problems in developing stable operating systems for supercomputers and eventually a Unix-like system was offered on the same machine. The lessons learned from developing ETA system software included the high level of risk associated with developing a new supercomputer operating system, and the advantages of using Unix with its large extant base of system software libraries. By the middle 1990s, despite the extant investment in older operating systems, the trend was toward the use of Unix-based systems, which also facilitated the use of interactive graphical user interfaces (GUIs) for scientific computing across multiple platforms. The move toward a commodity OS had opponents, who cited the fast pace and focus of Linux development as a major obstacle against adoption. As one author wrote "Linux will likely catch up, but we have large-scale systems now". Nevertheless, that trend continued to gain momentum and by 2005, virtually all supercomputers used some Unix-like OS. These variants of Unix included IBM AIX, the open source Linux system, and other adaptations such as UNICOS from Cray. By the end of the 20th century, Linux was estimated to command the highest share of the supercomputing pie. == Modern approaches == The IBM Blue Gene supercomputer uses the CNK operating system on the compute nodes, but uses a modified Linux-based kernel called I/O Node Kernel (INK) on the I/O nodes. CNK is a lightweight kernel that runs on each node and supports a single application running for a single user on that node. For the sake of efficient operation, the design of CNK was kept simple and minimal, with physical memory being statically mapped and the CNK neither needing nor providing scheduling or context switching. CNK does not even implement file I/O on the compute node, but delegates that to dedicated I/O nodes. However, given that on the Blue Gene multiple compute nodes share a single I/O node, the I/O node operating system does require multi-tasking, hence the selection of the Linux-based operating system. While in traditional multi-user computer systems and early supercomputers, job scheduling was in effect a task scheduling problem for processing and peripheral resources, in a massively parallel system, the job management system needs to manage the allocation of both computational and communication resources. It is essential to tune task scheduling, and the operating system, in different configurations of a supercomputer. A typical parallel job scheduler has a master scheduler which instructs some number of slave schedulers to launch, monitor, and control parallel jobs, and periodically receives reports from them about the status of job progress. Some, but not all supercomputer schedulers attempt to maintain locality of job execution. The PBS Pro scheduler used on the Cray XT3 and Cray XT4 systems does not attempt to optimize locality on its three-dimensional torus interconnect, but simply uses the first available processor. On the other hand, IBM's scheduler on the Blue Gene supercomputers aims to exploit locality a

Blue check

A blue check is used on social media platforms, notably X (formerly known as Twitter), to indicate the authenticity of an account. Since November 2022, Twitter users whose accounts are at least 90 days old and have a verified phone number receive verification upon subscribing to X Premium or Verified Organizations; this status persists as long as the subscription remains active. When introduced in June 2009, the system provided the site's readers with a means to distinguish genuine notable account holders, such as celebrities and organizations, from impostors or parodies. Until November 2022, a blue checkmark displayed against an account name indicated that Twitter had taken steps to ensure that the account was actually owned by the person or organization whom it claimed to represent. The checkmark does not imply endorsement from Twitter, and does not mean that tweets from a verified account are necessarily accurate or truthful in any way. People with verified accounts on Twitter are often colloquially referred to as "blue checks" on social media and by reporters. In November 2022, the verification program was modified heavily by new owner Elon Musk, extending verification to any account with a verified phone number and an active subscription to an eligible X Premium (formerly Twitter Blue) plan. These changes faced criticism from users and the media, who believed that the changes would ease impersonation, and allow accounts spreading misleading information to feign credibility. In a related change, Twitter introduced additional gold and gray checkmarks, used by Verified Organizations and government-affiliated accounts, respectively. Twitter claims that the changes to verification are required to "reduce fraudulent accounts and bots". Twitter users who had been verified through the previous system were known as "legacy verified" accounts; legacy verification was deprecated in April 2023, and stripped from accounts who do not meet the new payment requirements. Musk later implied that he had been personally paying for the X Premium subscriptions of several notable celebrities. == Until November 2022 == In June 2009, after being criticized by Kanye West and sued by Tony La Russa over unauthorized accounts run by impersonators, the company launched their "Verified Accounts" program. Twitter stated that an account with a "blue tick" verification badge indicates "we've been in contact with the person or entity the account is representing and verified that it is approved". After the beta period, the company stated in their FAQ that it "proactively verifies accounts on an ongoing basis to make it easier for users to find who they're looking for" and that they "do not accept requests for verification from the general public". Originally, Twitter took on the responsibility of reaching out to celebrities and other notable people to confirm their identities in order to establish a verified account. In July 2016, Twitter announced a public application process to grant verified status to an account "if it is determined to be of public interest" and that verification "does not imply an endorsement". In 2016, the company began accepting requests for verification, but it was discontinued the same year. Twitter explained that the volume of requests for verified accounts had exceeded its ability to cope; rather, Twitter determines on its own whom to approach about verified accounts, limiting verification to accounts which are "authentic, notable, and active". In November 2020, Twitter announced a relaunch of its verification system in 2021. According to the new policy, Twitter verifies six different types of accounts; for three of them (companies, brands, and influential individuals like activists), the existence of a Wikipedia page will be one criterion for showing that the account has "Off Twitter Notability". === Controversy === On June 21, 2014, actor William Shatner raised an issue with several Engadget editorial staff and their verification status on Twitter. Besides the site's social media editor, John Colucci, Shatner also targeted several junior members of the staff for being "nobodies", unlike some of his actor colleagues who did not bear such distinction. Shatner claimed Colucci and the team were bullying him when giving a text interview to Mashable. Over a month later, Shatner continued to discuss the issue on his Tumblr page, to which Engadget replied by defending its team and discussing the controversy surrounding the social media verification. Twitter's practice and process for verifying accounts came under scrutiny again in 2017 after the company verified the account of white supremacist and far-right political activist, Jason Kessler. Many who criticized Twitter's decision to verify Kessler's account saw this as a political act on the company's behalf. In response, Twitter put its verification process on hold. The company tweeted, "Verification was meant to authenticate identity & voice but it is interpreted as an endorsement or an indicator of importance. We recognize that we have created this confusion and need to resolve it. We have paused all general verifications while we work and will report back soon." As of November 2017, Twitter continued to deny verification of Julian Assange's account following his requests. In November 2019, Dalit activists of India alleged that higher-caste people get Twitter verification easily and trended hashtags #CancelAllBlueTicksInIndia and #CasteistTwitter. Critics have said that the company's verification process is not transparent and causes digital marginalisation of already marginalised communities. Twitter India rejected the allegations, calling them "impartial" and working on a "case-by-case" policy. == Since November 2022 == On April 20, 2023, Twitter (known as X since July 2023) began removing verification status for users of public interest, causing a controversy among Twitter users. The website's system was altered, allowing any individual to receive verification for a monthly fee, an act which saw significant criticism. Following the acquisition of Twitter by Elon Musk on October 28, 2022, Musk told Twitter employees to introduce paid verification by November 7 through Twitter Blue. The Verge reported that the updated Blue subscription would cost $19.99 per month, and users would lose their verification status if they did not join within 90 days. Following backlash, Musk tweeted, in response to author Stephen King, a lowered $8 price on November 1, 2022. Twitter confirmed the new price of $7.99 per month on November 5, 2022. The new verification system began rollout on November 9, 2022, a day after the 2022 United States elections. The decision to delay its rollout was to address concerns about users potentially spreading misinformation about voting results by posing as news outlets and lawmakers. At the same time, Twitter introduced a secondary gray "Official" label on some high-profile accounts, but removed them hours after launch. Less than 48 hours later, Twitter reinstated the gray "Official" label, after multiple users were suspended for deliberately impersonating reporters and high-profile athletes like LeBron James. A viral tweet from an account purporting to be the pharmaceutical company Eli Lilly and Company caused the company's stock to fall after announcing "insulin is free now". As a result, Twitter disabled new Blue subscriptions on November 11, 2022. === Announcement === In October 2022, Casey Newton of Platformer reported that executives at Twitter began discussing the possibility of users being forced to pay for Twitter Blue in order to keep their verification status. Musk publicly announced that verification was "being revamped right now" after Newton's article; according to The Verge, Twitter planned to increase the price of Twitter Blue from US$4.99 per month to US$19.99 per month. Users would have had 90 days to subscribe or face losing their verification status, and employees were told to implement paid verification by November 9 or risk getting fired. Upon the news that Twitter Blue would cost US$19.99 per month, author Stephen King expressed displeasure towards Twitter and stated that he would leave. Musk, replying to King's tweet, proposed that the service should cost US$7.99 instead. In a separate tweet, Musk wrote that Twitter Blue subscribers would receive priority in replies, mentions, and search, fewer advertisements, and longer audio and video. Although paid verification was expected to be launched by November 7, the reintroduction of Twitter Blue was delayed until after the 2022 United States elections on November 9, according to a memo obtained by The New York Times. The announcement of paid verification resulted in several accounts facetiously impersonating Musk, such as those of comedians Kathy Griffin and Sarah Silverman, being suspended. In response, Musk announced that impersonators using Twitter Blue "will be permanently suspended". An "official

Digital first

Digital first is a communication theory that publishers should release content into new media channels in preference to old media. The premise behind the theory is that after the advent of Internet, most established media organizations continued to give priority to traditional media. Over time, those organizations faced a choice to either publish first in digital media or traditional media. A "digital first" decision occurs when a publisher chooses to distribute information online in preference to or at the expense of traditional media like print publishing. Many employers and employees find it challenging to imagine using digital first practices. Distributing content digital first introduces new practices, including a need to manage the data which tracks readership. Many paper print publishers feel intimidated by the idea of publishing content online before publishing it in paper media. Comedian John Oliver in the show Last Week Tonight criticized digital first practices as a cause of lower standards in journalism. == Digital-First Transformation in Business and Education == The classical perspective of an information system is that it represents and reflects physical reality. However, it is increasingly evident that digital technologies not only represent reality but also actively shape it, as, in many instances, the digital version is created first, and the physical version follows. Gradually, digital infrastructures are integrated in people's work and life, shaping a digital environment through technologies such as 5G, sensors, and blockchain. The Digital First Framework, developed by Professor Youngjin Yoo, is a conceptual approach that helps the physical companies in the integration of digital technologies into the core of product and service design. The shift from traditional cars, where the physical vehicle precedes its digital representation on Google maps, to autonomous vehicles, where the digital representation (the blue dot) is created first, emphasizes the digital-first mindset in the design and operation of systems. In today's business environment, it's critical for organizations to embrace a digital-first strategy. Companies built on digital platforms will significantly diverge from traditional, hierarchical business structures that typically focus on a single product or market. These digitally-centered enterprises will offer products and services that are tailored to individual requirements, utilizing algorithms to assess needs based on specific situations, and relying on external partners to provide these solutions. This highlights the need to transform traditional R&D practices. It's essential for R&D teams to move beyond their laboratories and immerse themselves in the environments of their users. Understanding the context of use is fundamental to creating a relevant platform. As an illustration, the concept of Digital-first, as defined by Rohm et al. (2019), involves the integration of digital projects within educational courses, exemplified by institutions like M-School. The program adopts a programmatic approach, where successive courses progressively build upon one another, adopting an all-encompassing perspective that regards all aspects of marketing as inherently digital. Students actively participate in real-world projects, including campaigns for community improvement, and are tasked with generating content for diverse platforms. Through hands-on collaboration with live clients and the utilization of tools such as Google AdWords and Facebook Advertising, students acquire practical experience in the realms of digital marketing and analytics. == vBook == A vBook is an eBook that is digital first media with embedded video, images, graphs, tables, text, and other media.

Symbaloo

Symbaloo is a cloud-based site that allows users to organize and categorize web links in the form of buttons. Symbaloo works from a web browser and can be configured as a homepage, allowing users to create a personalized virtual desktop accessible from any device with an Internet connection. Symbaloo users, which must be previously registered, have a page with a grid of buttons that can be configured to link to a specific page. The site allows users to assign different colors to the buttons for easy visual classification. Symbaloo allows a single user to create different pages or screens with buttons. These screens called webmix are useful to separate topics and links can be shared with other users, making them public and sending the link via email. As of 2015 Symbaloo has 6 million users worldwide and mainly used as an online education resource. Symbaloo's slogan is "Start Simple".

BitClout

BitClout was an open source blockchain-based social media platform. On the platform, users could post short-form writings and photos, award money to posts they particularly like by clicking a diamond icon, as well as buy and sell "creator coins" (personalized tokens whose value depends on people's reputations). BitClout ran on a custom proof of work blockchain, and was a prototype of what can be built on DeSo (short for "Decentralized Social"). BitClout's founder and primary leader is Nader al-Naji, known pseudonymously as "Diamondhands". Under development since 2019, BitClout's blockchain created its first block in January 2021, and BitClout itself launched publicly in March 2021. The platform launched with 15,000 "reserved" accounts — a move intended to prevent impersonation, but which backfired as some people with reserved accounts tried to actively distance themselves. Later, in September 2021, BitClout was revealed to be the flagship product of the DeSo blockchain. == History == === Origins (2019 - March 2021) === In early 2019, Nader al-Naji became interested in "mixing investing and social media". He started creating a custom blockchain in May 2019, but didn't tell anyone else until November 2020. However, in the fall of 2020, al-Naji pitched BitClout's own investors under his real name and began posting job listings for a "new operation". Although BitClout was not originally intended to launch until mid-2021, its development was sped up due to "zeitgeist about decentralized social media" in January 2021. BitClout's first block was mined on 18 January 2021. Its next block was mined on 1 March 2021. === As BitClout (March - September 2021) === In early March 2021, about fifty investors received links to a password-protected website with the BitClout white paper. They were encouraged to explore the site and send the same link to "two or three other 'trusted contacts'". Within weeks users were spending millions of dollars per day on the platform. The platform's founders said they were "completely unprepared", having planned to have a "soft-launch". The leader went by the name "diamondhands" on the platform. On 24 March 2021, BitClout launched out of private beta. Investors include Sequoia Capital, Andreessen Horowitz, the venture capital firm Social Capital, Coinbase Ventures, Winklevoss Capital Management, Alexis Ohanian, Polychain, Pantera, and Digital Currency Group (CoinDesk's parent company). During its initial launch, BitClout's currency could be bought with bitcoin, but not sold except on Discord servers or Twitter threads. A single bitcoin wallet related to BitClout received more than $165M worth of deposits. In March 2021, law firm Anderson Kill P.C. sent Nader al-Naji, the presumed leader of the BitClout platform, a cease-and-desist letter, demanding the removal of Brandon Curtis's account and alleging that BitClout violated sections 1798 and 3344 of the California Civil Code by using Curtis's name and likeness without his consent. Curtis also tweeted, "Adopting Bitcoin's aesthetic to raise VC funding to carry out unethical and blatantly illegal schemes like BitClout: not cool". (However, Curtis's coin, despite not being listed on the official website, can still be bought by users searching for the original username.) Additionally, in April 2021, Lee Hsien Loong asked for his name and photograph to be removed from the site, stating that he has "nothing to do with the platform" and that "it is misleading and done without [his] permission". On 18 May 2021, diamondhands announced that 100% of the BitClout code went public. On 12 June 2021, the supply of BitClout was capped at around 11 million coins. On 18 July 2021, BitClout added the ability for users to mint and purchase NFTs within the platform. === As part of DeSo (September 2021 - July 2024) === On 21 September 2021, it was revealed that BitClout was a prototype built on DeSo, short for "Decentralized Social". As a part of this revelation, diamondhands confirmed his identity as Nader al-Naji. (As early as April 2021, it had been believed that diamondhands indeed was that person.)The Bitclout project raised $200M in funding, which went to setting up the DeSo Foundation. === End and aftermath (July 2024 - present) === In July 2024, al-Naji was arrested by the FBI and charged with wire fraud involving BitClout. He also faced civil charges of securities fraud and unregistered offers and sales of securities from the Securities and Exchange Commission. In response, the official "deso" account posted that al-Naji was "safe and at home" and "that this experience has only reinforced [his] commitment to DeSo". In February 2025, the Justice Department dropped its case against al-Naji. In March 2026, the SEC voluntarily dismissed the enforcement case with prejudice. == Design == BitClout is a social media platform. Its users can post short-form writings and photos (similarly to Twitter). They can award money to posts they particularly like by clicking a diamond icon (similarly to Twitch Bits). The prices of each account's "creator coin" goes up and down with the popularity of the celebrity behind it. For example, if someone says something negative, the value of their corresponding account may go down. This price is computed automatically according to the formula p r i c e _ i n _ b i t c l o u t = .003 ∗ c r e a t o r _ c o i n s _ i n _ c i r c u l a t i o n 2 {\displaystyle price\_in\_bitclout=.003creator\_coins\_in\_circulation^{2}} . At launch time, BitClout scraped 15,000 profiles of celebrities from Twitter to create "reserved" accounts in their names. To claim a reserved account, the account holder would need to tweet about it (which also serves as a marketing strategy). At least 80 such reserved profiles have been claimed. Proof of stake was introduced in March 2024.