AI Code Base

AI Code Base — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Bayesian programming

    Bayesian programming

    Bayesian programming is a formalism and a methodology for having a technique to specify probabilistic models and solve problems when less than the necessary information is available. Edwin T. Jaynes proposed that probability could be considered as an alternative and an extension of logic for rational reasoning with incomplete and uncertain information. In his founding book Probability Theory: The Logic of Science he developed this theory and proposed what he called "the robot," which was not a physical device, but an inference engine to automate probabilistic reasoning—a kind of Prolog for probability instead of logic. Bayesian programming is a formal and concrete implementation of this "robot". Bayesian programming may also be seen as an algebraic formalism to specify graphical models such as, for instance, Bayesian networks, dynamic Bayesian networks, Kalman filters or hidden Markov models. Indeed, Bayesian programming is more general than Bayesian networks and has a power of expression equivalent to probabilistic factor graphs. == Formalism == A Bayesian program is a means of specifying a family of probability distributions. The constituent elements of a Bayesian program are presented below: Program { Description { Specification ( π ) { Variables Decomposition Forms Identification (based on δ ) Question {\displaystyle {\text{Program}}{\begin{cases}{\text{Description}}{\begin{cases}{\text{Specification}}(\pi ){\begin{cases}{\text{Variables}}\\{\text{Decomposition}}\\{\text{Forms}}\\\end{cases}}\\{\text{Identification (based on }}\delta )\end{cases}}\\{\text{Question}}\end{cases}}} A program is constructed from a description and a question. A description is constructed using some specification ( π {\displaystyle \pi } ) as given by the programmer and an identification or learning process for the parameters not completely specified by the specification, using a data set ( δ {\displaystyle \delta } ). A specification is constructed from a set of pertinent variables, a decomposition and a set of forms. Forms are either parametric forms or questions to other Bayesian programs. A question specifies which probability distribution has to be computed. === Description === The purpose of a description is to specify an effective method of computing a joint probability distribution on a set of variables { X 1 , X 2 , ⋯ , X N } {\displaystyle \left\{X_{1},X_{2},\cdots ,X_{N}\right\}} given a set of experimental data δ {\displaystyle \delta } and some specification π {\displaystyle \pi } . This joint distribution is denoted as: P ( X 1 ∧ X 2 ∧ ⋯ ∧ X N ∣ δ ∧ π ) {\displaystyle P\left(X_{1}\wedge X_{2}\wedge \cdots \wedge X_{N}\mid \delta \wedge \pi \right)} . To specify preliminary knowledge π {\displaystyle \pi } , the programmer must undertake the following: Define the set of relevant variables { X 1 , X 2 , ⋯ , X N } {\displaystyle \left\{X_{1},X_{2},\cdots ,X_{N}\right\}} on which the joint distribution is defined. Decompose the joint distribution (break it into relevant independent or conditional probabilities). Define the forms of each of the distributions (e.g., for each variable, one of the list of probability distributions). ==== Decomposition ==== Given a partition of { X 1 , X 2 , … , X N } {\displaystyle \left\{X_{1},X_{2},\ldots ,X_{N}\right\}} containing K {\displaystyle K} subsets, K {\displaystyle K} variables are defined L 1 , ⋯ , L K {\displaystyle L_{1},\cdots ,L_{K}} , each corresponding to one of these subsets. Each variable L k {\displaystyle L_{k}} is obtained as the conjunction of the variables { X k 1 , X k 2 , ⋯ } {\displaystyle \left\{X_{k_{1}},X_{k_{2}},\cdots \right\}} belonging to the k t h {\displaystyle k^{th}} subset. Recursive application of Bayes' theorem leads to: P ( X 1 ∧ X 2 ∧ ⋯ ∧ X N ∣ δ ∧ π ) = P ( L 1 ∧ ⋯ ∧ L K ∣ δ ∧ π ) = P ( L 1 ∣ δ ∧ π ) × P ( L 2 ∣ L 1 ∧ δ ∧ π ) × ⋯ × P ( L K ∣ L K − 1 ∧ ⋯ ∧ L 1 ∧ δ ∧ π ) {\displaystyle {\begin{aligned}&P\left(X_{1}\wedge X_{2}\wedge \cdots \wedge X_{N}\mid \delta \wedge \pi \right)\\={}&P\left(L_{1}\wedge \cdots \wedge L_{K}\mid \delta \wedge \pi \right)\\={}&P\left(L_{1}\mid \delta \wedge \pi \right)\times P\left(L_{2}\mid L_{1}\wedge \delta \wedge \pi \right)\times \cdots \times P\left(L_{K}\mid L_{K-1}\wedge \cdots \wedge L_{1}\wedge \delta \wedge \pi \right)\end{aligned}}} Conditional independence hypotheses then allow further simplifications. A conditional independence hypothesis for variable L k {\displaystyle L_{k}} is defined by choosing some variable X n {\displaystyle X_{n}} among the variables appearing in the conjunction L k − 1 ∧ ⋯ ∧ L 2 ∧ L 1 {\displaystyle L_{k-1}\wedge \cdots \wedge L_{2}\wedge L_{1}} , labelling R k {\displaystyle R_{k}} as the conjunction of these chosen variables and setting: P ( L k ∣ L k − 1 ∧ ⋯ ∧ L 1 ∧ δ ∧ π ) = P ( L k ∣ R k ∧ δ ∧ π ) {\displaystyle P\left(L_{k}\mid L_{k-1}\wedge \cdots \wedge L_{1}\wedge \delta \wedge \pi \right)=P\left(L_{k}\mid R_{k}\wedge \delta \wedge \pi \right)} We then obtain: P ( X 1 ∧ X 2 ∧ ⋯ ∧ X N ∣ δ ∧ π ) = P ( L 1 ∣ δ ∧ π ) × P ( L 2 ∣ R 2 ∧ δ ∧ π ) × ⋯ × P ( L K ∣ R K ∧ δ ∧ π ) {\displaystyle {\begin{aligned}&P\left(X_{1}\wedge X_{2}\wedge \cdots \wedge X_{N}\mid \delta \wedge \pi \right)\\={}&P\left(L_{1}\mid \delta \wedge \pi \right)\times P\left(L_{2}\mid R_{2}\wedge \delta \wedge \pi \right)\times \cdots \times P\left(L_{K}\mid R_{K}\wedge \delta \wedge \pi \right)\end{aligned}}} Such a simplification of the joint distribution as a product of simpler distributions is called a decomposition, derived using the chain rule. This ensures that each variable appears at the most once on the left of a conditioning bar, which is the necessary and sufficient condition to write mathematically valid decompositions. ==== Forms ==== Each distribution P ( L k ∣ R k ∧ δ ∧ π ) {\displaystyle P\left(L_{k}\mid R_{k}\wedge \delta \wedge \pi \right)} appearing in the product is then associated with either a parametric form (i.e., a function f μ ( L k ) {\displaystyle f_{\mu }\left(L_{k}\right)} ) or a question to another Bayesian program P ( L k ∣ R k ∧ δ ∧ π ) = P ( L ∣ R ∧ δ ^ ∧ π ^ ) {\displaystyle P\left(L_{k}\mid R_{k}\wedge \delta \wedge \pi \right)=P\left(L\mid R\wedge {\widehat {\delta }}\wedge {\widehat {\pi }}\right)} . When it is a form f μ ( L k ) {\displaystyle f_{\mu }\left(L_{k}\right)} , in general, μ {\displaystyle \mu } is a vector of parameters that may depend on R k {\displaystyle R_{k}} or δ {\displaystyle \delta } or both. Learning takes place when some of these parameters are computed using the data set δ {\displaystyle \delta } . An important feature of Bayesian programming is this capacity to use questions to other Bayesian programs as components of the definition of a new Bayesian program. P ( L k ∣ R k ∧ δ ∧ π ) {\displaystyle P\left(L_{k}\mid R_{k}\wedge \delta \wedge \pi \right)} is obtained by some inferences done by another Bayesian program defined by the specifications π ^ {\displaystyle {\widehat {\pi }}} and the data δ ^ {\displaystyle {\widehat {\delta }}} . This is similar to calling a subroutine in classical programming and provides an easy way to build hierarchical models. === Question === Given a description (i.e., P ( X 1 ∧ X 2 ∧ ⋯ ∧ X N ∣ δ ∧ π ) {\displaystyle P\left(X_{1}\wedge X_{2}\wedge \cdots \wedge X_{N}\mid \delta \wedge \pi \right)} ), a question is obtained by partitioning { X 1 , X 2 , ⋯ , X N } {\displaystyle \left\{X_{1},X_{2},\cdots ,X_{N}\right\}} into three sets: the searched variables, the known variables and the free variables. The 3 variables S e a r c h e d {\displaystyle Searched} , K n o w n {\displaystyle Known} and F r e e {\displaystyle Free} are defined as the conjunction of the variables belonging to these sets. A question is defined as the set of distributions: P ( S e a r c h e d ∣ Known ∧ δ ∧ π ) {\displaystyle P\left(Searched\mid {\text{Known}}\wedge \delta \wedge \pi \right)} made of many "instantiated questions" as the cardinal of K n o w n {\displaystyle Known} , each instantiated question being the distribution: P ( Searched ∣ Known ∧ δ ∧ π ) {\displaystyle P\left({\text{Searched}}\mid {\text{Known}}\wedge \delta \wedge \pi \right)} === Inference === Given the joint distribution P ( X 1 ∧ X 2 ∧ ⋯ ∧ X N ∣ δ ∧ π ) {\displaystyle P\left(X_{1}\wedge X_{2}\wedge \cdots \wedge X_{N}\mid \delta \wedge \pi \right)} , it is always possible to compute any possible question using the following general inference: P ( Searched ∣ Known ∧ δ ∧ π ) = ∑ Free [ P ( Searched ∧ Free ∣ Known ∧ δ ∧ π ) ] = ∑ Free [ P ( Searched ∧ Free ∧ Known ∣ δ ∧ π ) ] P ( Known ∣ δ ∧ π ) = ∑ Free [ P ( Searched ∧ Free ∧ Known ∣ δ ∧ π ) ] ∑ Free ∧ Searched [ P ( Searched ∧ Free ∧ Known ∣ δ ∧ π ) ] = 1 Z × ∑ Free [ P ( Searched ∧ Free ∧ Known ∣ δ ∧ π ) ] {\displaystyle {\begin{aligned}&P\left({\text{Searched}}\mid {\text{Known}}\wedge \delta \wedge \pi \right)\\={}&\sum _{\text{Free}}\left[P\left({\text{Searched}}\wedge {\text{Free}}\mid {\text{Known}}\wedge \delta \wedge \

    Read more →
  • Digital zombie

    Digital zombie

    A digital zombie is a person so engaged with digital technology or social media they are unable to separate themselves from a persistent online presence. Writing in 2017, University of Sydney researcher Andrew Campbell expressed concerns over whether or not the individual can truly live a full and healthy life while they are preoccupied with the digital world. Other individuals have also begun referencing certain types of behaviour with being a digital zombie. Stefanie Valentic, managing editor of EHS Today, refers to it as people hunting digital creatures through their smartphones in public spaces, always fixed on their phones. The University of Warwick has used the term to argue that further research needs to be done with people who exist in digital form after death to help people grieve their loss. == Modern applications == === Distracted walking === The term digital zombie can refer to a person performing distracted walking, which has been labelled dangerous by the American Academy of Orthopaedic Surgeons. They created the "Digital Deadwalkers" campaign after physicians became aware of the risks associated with walking across intersections and sidewalks while paying attention only to smartphones and not one's surroundings. Also stating that the name is derived from the fact that "they're oblivious to everyone else, so it's like they're dead-walking, sleepwalking." === Living through media === The Department of Sociology, University of Warwick has also identified the term, digital zombie, to refer to an individual who has died but is digitally resurrected, reanimated and socially active. These digital zombies do things in death they did not do when they were alive as they "live" again through a digital self on a digital medium. Dead celebrities sometimes become digital zombies when they are reanimated to appear in commercial advertisements (such as Audrey Hepburn and Bob Monkhouse). Other accidental digital zombies include Tupac Shakur and Michael Jackson who were both digitally resurrected and recreated to perform "live" on stage years after their death. Researchers at the University of Warwick have carried out research into the area of human-computer interaction. in an effort to understand the affect these digital zombies have on grief and bereavement. === Mobile gaming === Writer for EHS Today, Stefanie Valentic, has made observations with the mobile phone video game Pokémon Go, which offers players the experience to hunt and collect digital creatures called Pokémon through their smartphone in real world. Players can be observed simultaneously gazing at their phone while also obliviously walking around their environments looking for Pokémon. Stefanie references these individuals as "digital zombies" since they walk around with no cognition of their surroundings while engaged with their phone. == Health risks == === Heavy use of technology === Research by the University of Sydney has begun looking at how new technology such as digital media and smartphones impact our lives and questioning whether they can create new compulsions and obsessions. The research demonstrates that increased heavy technological use can have negative health consequences similar to drugs, smoking, and alcohol. Marcel O'Gorman, an associate professor of English at the University of Waterloo, has commented on the body of research examining how technology impacts cognition, stating currently that there is no empirical evidence to support any theories that suggest that technology can damage memory and attention span. === Heightened risk to children === Manfred Spitzer, a German psychiatrist, has raised concerns with providing digital devices to children. During the early childhood stage while their brains are rapidly growing, increased exposure to digital devices may deprive them of necessary development required to facilitate brain growth. These concerns are also shared by Korean doctors who believe giving digital devices, like smartphones to children, limits their cognitive development.

    Read more →
  • LTE Advanced

    LTE Advanced

    LTE Advanced, also named or recognized as LTE+, LTE-A or 4G+, is a 4G mobile cellular communication standard developed by 3GPP as a major enhancement of the Long Term Evolution (LTE) standard. Three technologies from the LTE-Advanced tool-kit – carrier aggregation, 4x4 MIMO and 256QAM modulation in the downlink – if used together and with sufficient aggregated bandwidth, can deliver maximum peak downlink speeds approaching, or even exceeding, 1 Gbit/s. This is significantly more than the peak 300 Mbit/s rate offered by the preceding LTE standard. Later developments have resulted in LTE Advanced Pro (or 4.9G) which increases bandwidth even further. The first ever LTE Advanced network was deployed in 2013 by SK Telecom in South Korea. In August 2019, the Global mobile Suppliers Association (GSA) reported that there were 304 commercially launched LTE-Advanced networks in 134 countries. Overall, 335 operators are investing in LTE-Advanced (in the form of tests, trials, deployments or commercial service provision) in 141 countries. == Name == LTE Advanced is also named (indicated as) LTE+, LTE-A, or (on Samsung Galaxy and Xiaomi smartphones) as 4G+. Such networks have also often been described as ‘Gigabit LTE networks’ mirroring a term that is also used in the fixed broadband industry. == History == The mobile communication industry and standards organizations have therefore started work on 4G access technologies, such as LTE Advanced. At a workshop in April 2008 in China, 3GPP agreed the plans for work on Long Term Evolution (LTE). A first set of specifications were approved in June 2008. Besides the peak data rate 1 Gb/s as defined by the ITU-R, it also targets faster switching between power states and improved performance at the cell edge. Detailed proposals are being studied within the working groups. The LTE+ format was first proposed by NTT DoCoMo of Japan and has been adopted as the international standard. It was formally submitted as a candidate 4G to ITU-T in late 2009 as meeting the requirements of the IMT-Advanced standard, and was standardized by the 3rd Generation Partnership Project (3GPP) in March 2011 as 3GPP Release 10. The work by 3GPP to define a 4G candidate radio interface technology started in Release 9 with the study phase for LTE-Advanced. Being described as a 3.9G (beyond 3G but pre-4G), the first release of LTE did not meet the requirements for 4G (also called IMT Advanced as defined by the International Telecommunication Union) such as peak data rates up to 1 Gb/s. The ITU has invited the submission of candidate Radio Interface Technologies (RITs) following their requirements in a circular letter, 3GPP Technical Report (TR) 36.913, "Requirements for Further Advancements for E-UTRA (LTE-Advanced)." These are based on ITU's requirements for 4G and on operators’ own requirements for advanced LTE. Major technical considerations include the following: Continual improvement to the LTE radio technology and architecture Scenarios and performance requirements for working with legacy radio technologies Backward compatibility of LTE-Advanced with LTE. An LTE terminal should be able to work in an LTE-Advanced network and vice versa. Any exceptions will be considered by 3GPP. Consideration of recent World Radiocommunication Conference (WRC-07) decisions regarding frequency bands to ensure that LTE-Advanced accommodates the geographically available spectrum for channels above 20 MHz. Also, specifications must recognize those parts of the world in which wideband channels are not available. Likewise, 'WiMAX 2', 802.16m, has been approved by ITU as the IMT Advanced family. WiMAX 2 is designed to be backward compatible with WiMAX 1 devices. Most vendors now support conversion of 'pre-4G', pre-advanced versions and some support software upgrades of base station equipment from 3G. == Proposals == The target of 3GPP LTE Advanced is to reach and surpass the ITU requirements. LTE Advanced should be compatible with first release LTE equipment, and should share frequency bands with first release LTE. In the feasibility study for LTE Advanced, 3GPP determined that LTE Advanced would meet the ITU-R requirements for 4G. The results of the study are published in 3GPP Technical Report (TR) 36.912. One of the important LTE Advanced benefits is the ability to take advantage of advanced topology networks; optimized heterogeneous networks with a mix of macrocells with low power nodes such as picocells, femtocells and new relay nodes. The next significant performance leap in wireless networks will come from making the most of topology, and brings the network closer to the user by adding many of these low power nodes – LTE Advanced further improves the capacity and coverage, and ensures user fairness. LTE Advanced also introduces multicarrier to be able to use ultra wide bandwidth, up to 100 MHz of spectrum supporting very high data rates. In the research phase many proposals have been studied as candidates for LTE Advanced (LTE-A) technologies. The proposals could roughly be categorized into: Support for relay node base stations Coordinated multipoint (CoMP) transmission and reception UE Dual TX antenna solutions for SU-MIMO and diversity MIMO, commonly referred to as 2x2 MIMO Scalable system bandwidth exceeding 20 MHz, up to 100 MHz Carrier aggregation of contiguous and non-contiguous spectrum allocations Local area optimization of air interface Nomadic / Local Area network and mobility solutions Flexible spectrum usage Cognitive radio Automatic and autonomous network configuration and operation Support of autonomous network and device test, measurement tied to network management and optimization Enhanced precoding and forward error correction Interference management and suppression Asymmetric bandwidth assignment for FDD Hybrid OFDMA and SC-FDMA in uplink UL/DL inter eNB coordinated MIMO SONs, Self Organizing Networks methodologies Within the range of system development, LTE-Advanced and WiMAX 2 can use up to 8x8 MIMO and 128-QAM in downlink direction. Example performance: 100 MHz aggregated bandwidth, LTE-Advanced provides almost 3.3 Gbit peak download rates per sector of the base station under ideal conditions. Advanced network architectures combined with distributed and collaborative smart antenna technologies provide several years road map of commercial enhancements. The 3GPP standards Release 12 added support for 256-QAM. A summary of a study carried out in 3GPP can be found in TR36.912. == Timeframe and introduction of additional features == Original standardization work for LTE-Advanced was done as part of 3GPP Release 10, which was frozen in April 2011. Trials were based on pre-release equipment. Major vendors support software upgrades to later versions and ongoing improvements. In order to improve the quality of service for users in hotspots and on cell edges, heterogeneous networks (HetNets) are formed of a mixture of macro-, pico- and femto base stations serving corresponding-size areas. Frozen in December 2012, 3GPP Release 11 concentrates on better support of HetNet. Coordinated Multi-Point operation (CoMP) is a key feature of Release 11 in order to support such network structures. Whereas users located at a cell edge in homogenous networks suffer from decreasing signal strength compounded by neighbor cell interference, CoMP is designed to enable use of a neighboring cell to also transmit the same signal as the serving cell, enhancing quality of service on the perimeter of a serving cell. In-device Co-existence (IDC) is another topic addressed in Release 11. IDC features are designed to ameliorate disturbances within the user equipment caused between LTE/LTE-A and the various other radio subsystems such as WiFi, Bluetooth, and the GPS receiver. Further enhancements for MIMO such as 4x4 configuration for the uplink were standardized. The higher number of cells in HetNet results in user equipment changing the serving cell more frequently when in motion. The ongoing work on LTE-Advanced in Release 12, amongst other areas, concentrates on addressing issues that come about when users move through HetNet, such as frequent hand-overs between cells. It also included use of 256-QAM. == First technology demonstrations and field trials == This list covers technology demonstrations and field trials up to the year 2014, paving the way for a wider commercial deployment of the VoLTE technology worldwide. From 2014 onwards various further operators trialled and demonstrated the technology for future deployment on their respective networks. These are not covered here. Instead a coverage of commercial deployments can be found in the section below. == LTE Advanced Pro == LTE Advanced Pro (LTE-A Pro, also known as 4.5G, 4.5G Pro, 4.9G, Pre-5G, 5G Project) is a name for 3GPP release 13 and 14. It is an evolution of LTE Advanced (LTE-A) cellular standard supporting data rates in excess of 3 Gbit/s using 32-carrier aggregation. It also introduces th

    Read more →
  • Information element

    Information element

    An information element, sometimes informally referred to as a field, is an item in Q.931 and Q.2931 messages, IEEE 802.11 management frames, and cellular network messages sent between a base transceiver station and a mobile phone or similar piece of user equipment. An information element is often a type–length–value item, containing 1) a type (which corresponds to the label of a field), a length indicator, and a value, although any combination of one or more of those parts is possible. A single message may contain multiple information elements. The abbreviation IE is found in many technical specification documents from 3GPP. It is not uncommon for a single specification document to contain thousands of references to IEs.

    Read more →
  • Psychology in cybersecurity

    Psychology in cybersecurity

    The psychology of cybersecurity (often intersecting with usable security and cyberpsychology) is an interdisciplinary field studying how human behavior, cognitive biases, and social dynamics influence information security. While traditional cybersecurity focuses on hardware and software vulnerabilities, this discipline addresses the "human factor," which is exploited in cyberattacks. Psychology in cybersecurity draws from cognitive psychology and human–computer interaction. == History and evolution == The challenge of human behavior in computing was noted as early as the 1960s with multi-user mainframes like the Compatible Time-Sharing System (CTSS). In 1966, a software error on CTSS caused the system's master password file to be displayed to every user upon login—one of the earliest documented security incidents attributable to a combination of system design and human factors. These behaviors gained broader significance in the 1990s as the Internet became widely accessible. High-profile incidents involving figures like Kevin Mitnick demonstrated how human trust could be exploited through social engineering such as pretexting over the phone. == Cognitive and behavioral factors == Much of the psychology of cybersecurity focuses on decision-making under stress or uncertainty. Researchers apply frameworks like dual process theory to explain why humans fall for phishing or business email compromise. Threat actors design malicious communications to trigger fast, emotional "System 1" thinking—using urgency, authority, or panic, which prompts users to click a link or wire funds before their analytical "System 2" can assess the situation's legitimacy. Industry research has consistently documented the effectiveness of these techniques at scale, pointing to several recurring psychological phenomena that influence daily security practices: Cognitive biases: The optimism bias leads users to believe they are unlikely to be targeted by cybercriminals, resulting in lax password practices or delayed software updates. The availability heuristic causes individuals to focus on highly publicized, sophisticated threats while ignoring common, statistically probable risks like credential reuse. Social influence: Attackers leverage established principles of persuasion, such as those categorized by Robert Cialdini. Impersonating a CEO leverages the psychological trigger of authority, while fake tech support scams use reciprocity (offering to fix a problem before asking for network credentials). == Neurological and pre-cognitive factors == Functional magnetic resonance imaging (fMRI) studies show that neural activation in visual and attentional regions decreases with repeated exposure to the same stimulus, a phenomenon termed repetition suppression. Experiments have confirmed this effect in the context of security warnings: static warning designs produce declines in user attention and adherence. Information processing research on phishing indicates that affective cues, such as artificial urgency or fear, increase cognitive load and elicit automatic heuristic processing, reducing the likelihood of analytical evaluation and facilitating compliance with malicious requests. == Security fatigue and organizational dynamics == Aggressive cybersecurity postures can sometimes lead to mental and emotional exhaustion, a phenomenon known as security fatigue. === Alert fatigue === One example is alert fatigue, which most frequently affects both end-users and security operations center analysts. Continuous exposure to browser warnings or antivirus pop-ups, particularly those that are false positives, conditions users to dismiss alerts automatically due to the volume of notifications rather than their repetitive appearance (see § Neurological and pre-cognitive factors). The scale of this problem is significant in enterprise: SOC teams in large organizations receive thousands of alerts daily, and a survey published in ACM Computer Surveys found that analysts spend over 25% of their time handling false positives, meaning that malicious indicators can be buried in the noise. === Password fatigue === Similarly, password fatigue is the feeling experienced by many people who are required to remember an excessive number of passwords as part of their daily routine, such as to log in to a computer at work. Users cope with the memory burden by making predictable, iterative changes to their passwords (such as updating "Password01!" to "Password02!"), which decreases password security.

    Read more →
  • Zero-overhead looping

    Zero-overhead looping

    In computer architecture, zero-overhead looping is a hardware feature found in some processors that enables loops to execute without the performance cost of traditional loop control instructions. Instead of software managing loop iterations, the processor's hardware handles repetition automatically, saving clock cycles and improving efficiency. This technique is commonly employed in digital signal processors (DSPs) and certain complex instruction set computer (CISC) architectures. == Background == In many instruction sets, a loop must be implemented by using instructions to increment or decrement a counter, check whether the end of the loop has been reached, and if not jump to the beginning of the loop so it can be repeated. Although this typically only represents around 3–16 bytes of space for each loop, even that small amount could be significant depending on the size of the CPU caches. More significant is that those instructions each take time to execute, time which is not spent doing useful work. The overhead of such a loop is apparent compared to a completely unrolled loop, in which the body of the loop is duplicated exactly as many times as it will execute. In that case, no space or execution time is wasted on instructions to repeat the body of the loop. However, the duplication caused by loop unrolling can significantly increase code size, and the larger size can even impact execution time due to cache misses. (For this reason, it's common to only partially unroll loops, such as transforming it into a loop which performs the work of four iterations in one step before repeating. This balances the advantages of unrolling with the overhead of repeating the loop.) Moreover, completely unrolling a loop is only possible for a limited number of loops: those whose number of iterations is known at compile time. For example, the following C code could be compiled and optimized into the following x86 assembly code: == Implementation == Processors with zero-overhead looping have machine instructions and registers to automatically repeat one or more instructions. Depending on the instructions available, these may only be suitable for count-controlled loops ("for loops") in which the number of iterations can be calculated in advance, or only for condition-controlled loops ("while loops") such as operations on null-terminated strings. === Examples === ==== PIC ==== In the PIC instruction set, the REPEAT and DO instructions implement zero-overhead loops. REPEAT only repeats a single instruction, while DO repeats a specified number of following instructions. ==== Blackfin ==== Blackfin offers two zero-overhead loops. The loops can be nested; if both hardware loops are configured with the same "loop end" address, loop 1 will behave as the inner loop and repeat, and loop 0 will behave as the outer loop and repeat only if loop 1 would not repeat. Loops are controlled using the LTx and LBx registers (x either 0 to 1) to set the top and bottom of the loop — that is, the first and last instructions to be executed, which can be the same for a loop with only one instruction — and LCx for the loop count. The loop repeats if LCx is nonzero at the end of the loop, in which case LCx is decremented. The loop registers can be set manually, but this would typically consume 6 bytes to load the registers, and 8–16 bytes to set up the values to be loaded. More common is to use the loop setup instruction (represented in assembly as either LOOP with pseudo-instruction LOOP_BEGIN and LOOP_END, or in a single line as LSETUP), which optionally initializes LCx and sets LTx and LBx to the desired values. This only requires 4–6 bytes, but can only set LTx and LBx within a limited range relative to where the loop setup instruction is located. ==== x86 ==== The x86 assembly language REP prefixes implement zero-overhead loops for a few instructions (namely MOVS/STOS/CMPS/LODS/SCAS). Depending on the prefix and the instruction, the instruction will be repeated a number of times with (E)CX holding the repeat count, or until a match (or non-match) is found with AL/AX/EAX or with DS:[(E)SI]. This can be used to implement some types of searches and operations on null-terminated strings.

    Read more →
  • Industry Dive

    Industry Dive

    Industry Dive is a United States-based business-to-business news organization with an estimated 18 million readers in more than 25 industries, such as banking and waste management. Since 2022, it has been owned by Informa plc. Industry Dive aims to serve business executives who read news on their mobile phones. The company had an estimated revenue of more than of more than $110 million in 2023. As of 2020, it has more than 300 employees, including 80 journalists and 12 engineers. Its headquarters is in Washington, D.C. == History == Industry Dive was formed in 2012 by Sean Griffey (president), Eli Dickinson (chief technology officer), and Ryan Willumson (chief revenue officer). It was funded with $900,000 from private investors in 2012 and 2013. The company covered five industries: construction, education, marketing, utility, and waste. In 2016, it began its Dive Awards. Industry Dive's revenues quadrupled from 2015 to 2018, putting it in the top half of the Deloitte Technology Fast 500 and the top 20 percent of the Inc. Top 5000 list. In 2019, Falfurrias Capital Partners acquired a majority stake in the company. ID's content marketing clients included IBM, Siemens, and UPS. In 2020, DCA Live named Industry Dive to its "Red Hot Companies" list, which recognizes the D.C. area's 'fastest-growing' companies. In the same year, Industry Dive acquired CFO. In 2021, Industry Dive acquired PharmaVOICE. In 2022, it was purchased by Informa plc, which bought its majority stake from Falfurrias Capital Partners for about $530 million. == Publications == Industry Dive provides news coverage of a variety of industries including agriculture, banking, construction, education, fashion, healthcare, and manufacturing, each using a different website: == Awards == Industry Dive publications have received several national and regional Awards of Excellence from the American Society of Business Publication Editors, including for a series of 2020 articles about Big Pharma and the race for the coronavirus vaccine. The Washington Post recognized Industry Dive as a top place to work for four consecutive years, from 2016 to 2020.

    Read more →
  • Open Sound Control

    Open Sound Control

    Open Sound Control (OSC) is a protocol for networking sound synthesizers, computers, and other multimedia devices for purposes such as musical performance or show control. OSC's advantages include interoperability, accuracy, flexibility and enhanced organization and documentation. Its disadvantages include higher bandwidth requirements, increased load on embedded processors, and lack of standardized messages/interoperability. The first specification was released in March 2002. == Motivation == OSC is a content format developed at CNMAT by Adrian Freed and Matt Wright comparable to XML, WDDX, or JSON. It was originally intended for sharing music performance data (gestures, parameters and note sequences) between musical instruments (especially electronic musical instruments such as synthesizers), computers, and other multimedia devices. OSC is sometimes used as an alternative to the 1983 MIDI standard, when higher resolution and a richer parameter space is desired. OSC messages are transported across the internet and within local subnets using UDP/IP and Ethernet. OSC messages between gestural controllers are usually transmitted over serial endpoints of USB wrapped in the SLIP protocol. == Features == OSC's main features, compared to MIDI, include: Open-ended, dynamic, URI-style symbolic naming scheme Symbolic and high-resolution numeric data Pattern matching language to specify multiple recipients of a single message High resolution time tags "Bundles" of messages whose effects must occur simultaneously == Applications == There are dozens of OSC applications, including real-time sound and media processing environments, web interactivity tools, software synthesizers, programming languages and hardware devices. OSC has achieved wide use in fields including musical expression, robotics, video performance interfaces, distributed music systems and inter-process communication. The TUIO community standard for tangible interfaces such as multitouch is built on top of OSC. Similarly the GDIF system for representing gestures integrates OSC. OSC is used extensively in experimental musical controllers, and has been built into several open source and commercial products. The Open Sound World (OSW) music programming language is designed around OSC messaging. OSC is the heart of the DSSI plugin API, an evolution of the LADSPA API, in order to make the eventual GUI interact with the core of the plugin via messaging the plugin host. LADSPA and DSSI are APIs dedicated to audio effects and synthesizers. In 2007, a standardized namespace within OSC called SYN, for communication between controllers, synthesizers and hosts, was proposed. == Design == OSC messages consist of an address pattern (such as /oscillator/4/frequency), a type tag string (such as ,fi for a float32 argument followed by an int32 argument), and the arguments themselves (which may include a time tag). Address patterns form a hierarchical name space, reminiscent of a Unix filesystem path, or a URL, and refer to "Methods" inside the server, which are invoked with the attached arguments. Type tag strings are a compact string representation of the argument types. Arguments are represented in binary form with four-byte alignment. The core types supported are 32-bit two's complement signed integers 32-bit IEEE floating point numbers Null-terminated arrays of eight-bit encoded data (C-style strings) arbitrary sized blob (e.g. audio data, or a video frame) An example message is included in the spec (with null padding bytes represented by ␀): /oscillator/4/frequency␀,f␀␀, Followed by the 4-byte float32 representation of 440.0: 0x43dc0000. Messages may be combined into bundles, which themselves may be combined into bundles, etc. Each bundle contains a timestamp, which determines whether the server should respond immediately or at some point in the future. Applications commonly employ extensions to this core set. More recently some of these extensions such as a compact Boolean type were integrated into the required core types of OSC 1.1. The advantages of OSC over MIDI are primarily internet connectivity; data type resolution; and the comparative ease of specifying a symbolic path, as opposed to specifying all connections as seven-bit numbers with seven-bit or fourteen-bit data types. This human-readability has the disadvantage of being inefficient to transmit and more difficult to parse by embedded firmware, however. The spec does not define any particular OSC Methods or OSC Containers. All messages are implementation-defined and vary from server to server.

    Read more →
  • Expectation propagation

    Expectation propagation

    Expectation propagation (EP) is a technique in Bayesian machine learning. EP finds approximations to a probability distribution. It uses an iterative approach that uses the factorization structure of the target distribution. It differs from other Bayesian approximation approaches such as variational Bayesian methods. More specifically, suppose we wish to approximate an intractable probability distribution p ( x ) {\displaystyle p(\mathbf {x} )} with a tractable distribution q ( x ) {\displaystyle q(\mathbf {x} )} . Expectation propagation achieves this approximation by minimizing the Kullback–Leibler divergence K L ( p | | q ) {\displaystyle \mathrm {KL} (p||q)} . Variational Bayesian methods minimize K L ( q | | p ) {\displaystyle \mathrm {KL} (q||p)} instead. If q ( x ) {\displaystyle q(\mathbf {x} )} is a Gaussian N ( x | μ , Σ ) {\displaystyle {\mathcal {N}}(\mathbf {x} |\mu ,\Sigma )} , then K L ( p | | q ) {\displaystyle \mathrm {KL} (p||q)} is minimized with μ {\displaystyle \mu } and Σ {\displaystyle \Sigma } being equal to the mean of p ( x ) {\displaystyle p(\mathbf {x} )} and the covariance of p ( x ) {\displaystyle p(\mathbf {x} )} , respectively; this is called moment matching. == Applications == Expectation propagation via moment matching plays a vital role in approximation for indicator functions that appear when deriving the message passing equations for TrueSkill.

    Read more →
  • Alt TikTok

    Alt TikTok

    Alt TikTok (or 2020 Alt) was an online youth subculture and internet community that emerged on TikTok in 2020. Alt TikTok users (also known as alt girls, alt boys, or alt kids) emerged as primarily LGBTQ+ individuals who were in contrast to "Straight TikTok" which was seen as the mainstream and heteronormative side of the platform. The subculture became closely associated with music surrounding the hyperpop scene, particularly 100 gecs and also led to a short-lived fashion style and Internet aesthetic adopted by Generation Z during the COVID-19 lockdowns. Notable artists associated with the movement included Girl in Red, Freddie Dredd, David Shawty, WHOKILLEDXIX, and 645AR. While "alt kid" might imply a general association with traditional alternative fashion, the subculture was more an offshoot of e-girls and e-boys. In 2023, the hashtag #altfashion on TikTok amassed over 1.8 billion views. == History == Around mid-2020, users on TikTok began to group different content on the site into labels like "elite TikTok", "deep TikTok", and "floptok". These categories acted as different "sides of TikTok", deviating from mainstream lip syncing, online trends, and dance videos. Alt TikTok became one of the many subcultural communities to emerge during this period, initially referred to interchangeably with "elite TikTok". The movement quickly identified itself with alternative and queer users, in contrast to "Straight TikTok", also known as the "straight side of TikTok", which was seen as the mainstream and heteronormative side of the platform. Alt TikTok was accompanied by memes with surrealist or supernatural themes (sometimes being described as cursed), such as videos with heavy saturation and humanoid animals. One of the popular videos from Alt TikTok, gaining 18 million likes, shows a llama dancing to a cover of a song from a Russian commercial by the cereal brand Miel Pops, later becoming a viral audio. Some Alt TikTok users personified brands and products in what was referred to as Retail TikTok. In 2020, Rolling Stone described Alt TikTok as "one of the primary countercultures on the app." In 2020, American journalist Taylor Lorenz stated in an article of The New York Times, "Every pop sensation needs its ironic counterpoints. Alt Tiktok gets it done. [...] alt TikTok stars like Mooptopia are mainstays on the more indie side of the app. They aren't the popular crowd, but their cool, quirky content still attracts millions." === Trump rally trolling === In June 2020, alt TikTok and K-pop twitter users coordinated a strategy to ruin a Trump rally in Tulsa, Oklahoma. American politician and activist Alexandria Ocasio-Cortez later saluted the individuals for their "Trump troll". == Alt subculture == In 2020, Alt TikTok was one of many subcultural communities to emerge on TikTok, alongside Deep TikTok (aka DeepTok) and Flop TikTok (aka Floptok). The alt kid subculture emerged from Alt TikTok primarily among young Gen Z women, influenced by online fashion and aesthetics shaped by e-girls and e-boys. The movement was accelerated by the COVID-19 lockdowns, while the subculture itself stood in opposition to mainstream "Straight TikTok" and the VSCO girl movement, primarily adopting aspects of queer and alternative culture. While the phrase might imply a general association with alternative fashion or alternative culture, it is more accurately understood as a specific internet-driven outgrowth of online aesthetic youth subcultures like e-girls and e-boys. The alt subculture's visual style blended influences from goth, punk, emo, and grunge, often expressed through fashion, music taste, and online presence. === Style and music === The style of alt-girls is reminiscent of a myriad of previous alternative fashion trends, often blending these influences with online aesthetics. In 2020, TikTok alt-girls were teens ranging from ages 13 to 16, who tended to wear friendship bracelets, goth boots, Dr. Martens, bunny and frog hats, piercings, and split-dyed hair, as well as iconography lifted from Monster Energy and Hello Kitty. Some alt-girls displayed a love of cosplay, while drawing from Japanese anime and manga, particularly Danganronpa and Haikyu!!, which originally gained traction on the app through Anime TikTok (aka Anitok). Alt TikTok has been noted for being primarily influenced by queer and alternative culture, positioning itself in contrast to "Straight TikTok", which focused on mainstream dances and music. Alt kids frequently intersected with the e-girls and e-boys subculture, in terms of music, style, visual media, and aesthetics. Several musicians and artists were closely associated with the alt subculture, particularly those in the hyperpop scene, while alt tiktok users became important in the wider popularization of artists like 100 gecs. Notable prominent artists associated with Alt Tiktok included Girl in Red, Freddie Dredd, David Shawty, WHOKILLEDXIX, and 645AR, alongside music by YouTubers turned musicians such as Wilbur Soot's "I'm in Love With an E‐Girl" and Corpse Husband's "E-Girls Are Ruining My Life!". == Legacy == In 2020, Pitchfork claimed Alt TikTok as having an influence on wider music trends, stating: "Alt TikTok's music is now a hot zone for major record labels, pushing it even further into the mainstream". After the COVID-19 lockdowns, Alt TikTok, alongside its subculture, fell out of prominence and was taken over by other Gen Z-related internet aesthetics, developments, and online trends.

    Read more →
  • AMiner (database)

    AMiner (database)

    AMiner (formerly ArnetMiner) is a free online service used to index, search, and mine big scientific data. == Overview == AMiner (ArnetMiner) is designed to search and perform data mining operations against academic publications on the Internet, using social network analysis to identify connections between researchers, conferences, and publications. This allows it to provide services such as expert finding, geographic search, trend analysis, reviewer recommendation, association search, course search, academic performance evaluation, and topic modeling. AMiner was created as a research project in social influence analysis, social network ranking, and social network extraction. A number of peer-reviewed papers have been published arising from the development of the system. It has been in operation for more than three years, and has indexed 130,000,000 researchers and more than 265 million publications. The research was funded by the Chinese National High-tech R&D Program and the National Science Foundation of China. AMiner is commonly used in academia to identify relationships between and draw statistical correlations about research and researchers. It has attracted more than 10 million independent IP accesses from 220 countries and regions. The product has been used in Elsevier's SciVerse platform, and academic conferences such as SIGKDD, ICDM, PKDD, WSDM. == Operation == AMiner automatically extracts the researcher profile from the web. It collects and identifies the relevant pages, then uses a unified approach to extract data from the identified documents. It also extracts publications from online digital libraries using heuristic rules. It integrates the extracted researchers’ profiles and the extracted publications. It employs the researcher name as the identifier. A probabilistic framework has been proposed to deal with the name ambiguity problem in the integration. The integrated data is stored into a researcher network knowledge base (RNKB). The principal other product in the area are Google Scholar, Elsevier's Scirus, and the open source project CiteSeer. == History == It was initiated and created by professor Jie Tang from Tsinghua University, China. It was first launched in March 2006. The following provide a list of updates in the past years: March 2006, Version 0.1, Functions include researcher profiling, expert search, conference search, and publication search. The system was developed in Perl; August 2006, Version 1.0, The system was re-implemented in Java; July 2007, Version 2.0, New functions include researcher interest mining, association search, survey paper finding (unavailable now); April 2008, Version 3.0, New functions include query understanding, new GUI, and search log analysis; November 2008, Version 4.0, New functions include graph search, topic modeling, NSF/NSFC funding information extraction; April 2009, Version 5.0, New functions include Profile edition, open API service, Bole search, course search (unavailable now); December 2009, Version 6.0, New functions include academic performance evaluation, user feedback, conference analysis; May 2010, Version 7.0, New functions include name disambiguation, paper-reviewer recommendation, ArnetPage creation; March 2012, Version II, renamed as AMiner, rewrote all the codes and redesign the GUI. New functions include: geographic search, ArnetAPP platform. June 2014, Version II, renamed as AMiner, rewrote all the codes and redesign the GUI. New functions include: geographic search, ArnetAPP platform. December 2015, a completely new version got online. May 2017, professional version got online. April 2018, New functions include Trend Analysis, a deep learning based Name Disambiguation == Resources == AMiner published several datasets for academic research purpose, including Open Academic Graph, DBLP+citation (a data set augmenting citations into the DBLP data from Digital Bibliography & Library Project), Name Disambiguation, Social Tie Analysis. For more available datasets and source codes for research, please refer to.

    Read more →
  • FreePBX Distro

    FreePBX Distro

    The FreePBX Distro was a freeware unified communications software system that consisted of FreePBX, a graphical user interface (GUI) for configuring, controlling and managing Asterisk PBX software. The FreePBX Distro included packages that offer VoIP, PBX, Fax, IVR, voice-mail and email functions. The FreePBX Distro Linux distribution was based on CentOS, which maintains binary compatibility with Red Hat Enterprise Linux. FreePBX has contributed to the popularity of Asterisk. As a result of CentOS Linux being discontinued and the last version of CentOS 7 going out of support on June 30, 2024, FreePBX 17 has moved over to and is supported on Debian Linux. FreePBX will no longer be providing a pre-configured FreePBX Distro, but will provide a script to install FreePBX on a fresh install of Debian Linux. In-place migration will not be possible, but will be possible by restoring a backup on the new version from the previous version. As FreePBX 16 will be supported until the release of FreePBX 18, FreePBX on this distribution will still work and be supported, however, there will be no further support for the underlying operating system. == Installation == The Official FreePBX Distro is installed from a ISO image available by web download, that includes the system CentOS, Asterisk, FreePBX GUI and assorted dependencies. This can then either be burned to DVD or written to a USB stick for installation == Support for telephony hardware == The FreePBX Distro has built-in support for cards from multiple vendors, including Digium, OpenVox, Alto, Rhino Equipment, Xorcom and Sangoma. The FreePBX Distro supports a large number of phone models via open-source modules. Supported VoIP phone manufacturers include Algo, AND, AudioCodes, Cisco, Cyberdata, Digium, Grandstream, Mitel/Aastra, Nortel/Avaya, Panasonic, Polycom, Sangoma, Snom, Xorcom and Yealink. == Development == FreePBX made its debut in 2004 as the AMP project (Asterisk Management Portal). The FreePBX Distro was released in 2011 as an turnkey solution for building a PBX using Asterisk, CentOS and FreePBX. FreePBX has over 1 million active production PBXs and over 20,000 new systems added each month. The core telephony engine is Asterisk, as configured by the Open Source FreePBX GUI. The last stable release is FreePBX Distro Stable SNG7-PBX16-64bit-2302-1 based on these main components: FreePBX 16 CentOS 7.8 Asterisk 16, 18, 19 (20 supported by upgrade once installed)

    Read more →
  • Large language model

    Large language model

    A large language model (LLM) is a neural network trained on a vast amount of text for natural language processing tasks, especially language generation. LLMs can typically generate, summarize, translate and analyze text in many contexts, and are a foundational technology behind modern chatbots. Biased or inaccurate training data can make an LLM's output less reliable. As of 2026, the most capable LLMs are based on transformer architectures, which, according to the 2017 paper "Attention Is All You Need", can be more efficient and parallelizable than earlier statistical and recurrent neural network models. Benchmark evaluations for LLMs attempt to measure model reasoning, factual accuracy, alignment, and safety. == History == Before the emergence of transformer-based models in 2017, some language models were considered large relative to the computational and data constraints of their time. In the early 1990s, IBM's statistical models pioneered word alignment techniques for machine translation, laying the groundwork for corpus-based language modeling. In 2001, a smoothed n-gram model, such as those employing Kneser–Ney smoothing, trained on 300 million words, achieved state-of-the-art perplexity on benchmark tests. During the 2000s, with the rise of widespread internet access, researchers began compiling massive text datasets from the web ("web as corpus") to train statistical language models. Moving beyond n-gram models, researchers started in 2000 to use neural networks as language models. Following the breakthrough of deep neural networks in image classification around 2012, similar architectures were adapted for language tasks. This shift was marked by the development of word embeddings (e.g., Word2Vec by Mikolov in 2013) and sequence-to-sequence (seq2seq) models using LSTM. In 2016, Google transitioned its translation service to neural machine translation (NMT), replacing statistical phrase-based models with deep recurrent neural networks. These early NMT systems used LSTM-based encoder-decoder architectures, as they preceded the invention of transformers. At the 2017 NeurIPS conference, Google researchers introduced the transformer architecture in their landmark paper "Attention Is All You Need". This paper's goal was to improve upon 2014 seq2seq technology, and was based mainly on the attention mechanism developed by Bahdanau et al. in 2014. The following year in 2018, BERT was introduced and quickly became "ubiquitous". Though the original transformer has both encoder and decoder blocks, BERT is an encoder-only model. Academic and research usage of BERT began to decline in 2023, following rapid improvements in the abilities of decoder-only models (such as GPT) to solve tasks via prompting. Although decoder-only GPT-1 was introduced in 2018, it was GPT-2 in 2019 that caught widespread attention because OpenAI claimed to have initially deemed it too powerful to release publicly, out of fear of malicious use. GPT-3 in 2020 went a step further and as of 2025 is available only via API with no offering of downloading the model to execute locally. But it was the consumer-facing chatbot ChatGPT in late 2022 that received extensive media coverage and public attention by 2023. The 2023 GPT-4 was praised for its increased accuracy and as a "holy grail" for its multimodal capabilities. OpenAI did not reveal the high-level architecture and the number of parameters of GPT-4. The release of ChatGPT led to an uptick in LLM usage across several research subfields of computer science, including robotics, software engineering, and societal impact work. In 2024, OpenAI released the reasoning model OpenAI o1, which generates long chains of thought before returning a final answer. Many LLMs with parameter counts comparable to those of OpenAI's GPT series have been developed. Since 2022, weights-available models have been gaining popularity, especially at first with BLOOM and LLaMA, though both have restrictions on usage and deployment. Mistral AI's open-weight models Mistral 7B and Mixtral 8x7B have a more permissive Apache License. In January 2025, DeepSeek released DeepSeek R1, a 671-billion-parameter open-weight model that performs comparably to OpenAI o1 but at a much lower price per token for users. Since 2023, many LLMs have been trained to be multimodal, having the ability to also process or generate other types of data, such as images, audio, or 3D meshes. Open-weight LLMs have become more influential since 2023. Per Vake et al. (2025), community-driven contributions to open-weight models improve their efficiency and performance via collaborative platforms such as Hugging Face. == Dataset preprocessing == === Tokenization === As machine learning algorithms process numbers rather than text, the text must be converted to numbers. In the first step, a vocabulary is decided upon, then integer indices are arbitrarily but uniquely assigned to each vocabulary entry, and finally, an embedding is associated with the integer index. Algorithms include byte-pair encoding (BPE) and WordPiece. There are also special tokens serving as control characters, such as [MASK] for masked-out token (as used in BERT), and [UNK] ("unknown") for characters not appearing in the vocabulary. Also, some special symbols are used to denote special text formatting. For example, "Ġ" denotes a preceding whitespace in RoBERTa and GPT and "##" denotes continuation of a preceding word in BERT. For example, the BPE tokenizer used by the legacy version of GPT-3 would split tokenizer: texts -> series of numerical "tokens" as Tokenization also compresses the datasets. Because LLMs generally require input to be an array that is not jagged, the shorter texts must be "padded" until they match the length of the longest one. ==== Byte-pair encoding ==== As an example, consider a tokenizer based on byte-pair encoding. In the first step, all unique characters (including blanks and punctuation marks) are treated as an initial set of n-grams (i.e. initial set of uni-grams). Successively the most frequent pair of adjacent characters is merged into a bi-gram and all instances of the pair are replaced by it. All occurrences of adjacent pairs of (previously merged) n-grams that most frequently occur together are then again merged into even lengthier n-gram, until a vocabulary of prescribed size is obtained. After a tokenizer is trained, any text can be tokenized by it, as long as it does not contain characters not appearing in the initial-set of uni-grams. === Dataset cleaning === In the context of training LLMs, datasets are typically cleaned by removing low-quality, duplicated, or toxic data. Cleaned datasets can increase training efficiency and lead to improved downstream performance. A trained LLM can be used to clean datasets for training a further LLM. With the increasing proportion of LLM-generated content on the web, data cleaning in the future may include filtering out such content. LLM-generated content can pose a problem if the content is similar to human text (making filtering difficult) but of lower quality (degrading performance of models trained on it). === Synthetic data === Training of largest language models might need more linguistic data than naturally available, or that the naturally occurring data is of insufficient quality. In these cases, synthetic data might be used. == Training == An LLM is a type of foundation model (large X model) trained on language. LLMs can be trained in different ways. In particular, GPT models are first pretrained to predict the next word on a large amount of data, before being fine-tuned. === Cost === Substantial infrastructure is necessary for training the largest models. The tendency towards larger models is visible in the list of large language models. For example, the training of GPT-2 (i.e. a 1.5-billion-parameter model) in 2019 cost $50,000, while training of the PaLM (i.e. a 540-billion-parameter model) in 2022 cost $8 million, and Megatron-Turing NLG 530B (in 2021) cost around $11 million. The qualifier "large" in "large language model" is inherently vague, as there is no definitive threshold for the number of parameters required to qualify as "large". === Fine-tuning === Before being fine-tuned, most LLMs are next-token predictors. The fine-tuning shapes the LLM's behavior via techniques like reinforcement learning from human feedback (RLHF) or constitutional AI. Instruction fine-tuning is a form of supervised learning used to teach LLMs to follow user instructions. In 2022, OpenAI demonstrated InstructGPT, a version of GPT-3 similarly fine-tuned to follow instructions. Reinforcement learning from human feedback (RLHF) involves training a reward model to predict which text humans prefer. Then, the LLM can be fine-tuned through reinforcement learning to better satisfy this reward model. Since humans typically prefer truthful, helpful and harmless answers, RLHF favors such answers. == Architecture == LLMs are generally based on the tra

    Read more →
  • Content strategy

    Content strategy

    Content strategy guides the planning, development, and management of content. It is a recognized field in user experience design, and it also draws from adjacent disciplines such as information architecture, content management, business analysis, digital marketing, and technical communication. == Definitions == Content strategy has been described as planning for "the creation, publication, and governance of useful, usable content." It has also been called "a repeatable system that defines the entire editorial content development process for a website development project." In a 2007 article titled "Content Strategy: The Philosophy of Data," Rachel Lovinger describes the goal of content strategy as using "words and data to create unambiguous content that supports meaningful, interactive experiences." Here, she also provided the analogy that "content strategy is to copywriting as information architecture is to design." She encourages content strategists and collaborators to engage in early discussions about content meaning, models, and tools, to make sure strategy is integrated from the start rather than as an afterthought. The Content Strategy Alliance combines Kevin Nichols' definition with Kristina Halvorson's and defines content strategy as "getting the right content to the right user at the right time through strategic planning of content creation, delivery, and governance." == Practitioners == Content strategists are often familiar with a wide range of approaches, techniques, and tools. The perspectives that content strategists bring also depend heavily on their professional training and education. For instance, some specialize in "front-end strategy," which includes developing personas, journey mapping the user experience, aligning business strategy and user needs, developing a brand strategy, exploring different channels, and creating style guidelines and search engine optimization (SEO) guidelines. Others specialize in "back-end strategy," which includes creating content models, planning taxonomies and metadata, structuring content management systems, and building systems to support content reuse. Both roles involve addressing workflow and governance issues. Many organizations and individuals tend to confuse content strategists with editors. However, content strategy is "about more than just the written word," according to Washington State University associate professor Brett Atwood. For example, Atwood indicates that a practitioner needs to also "consider how content might be re-distributed and/or re-purposed in other channels of delivery." It has also been proposed that the content strategist performs the role of a curator. Just as a museum curator sifts through a collection of content and identifies key pieces that can be juxtaposed against each other to create meaning and spur excitement, a content strategist "must approach a business’s content as a medium that needs to be strategically selected and placed to engage the audience, convey a message, and inspire action."

    Read more →
  • IAmAnas

    IAmAnas

    #IAmAnas (I Am Anas) is a Twitter hashtag and social media campaign that started in 2015. Users tweeted to express support for the undercover investigative works of Ghanaian journalist Anas Aremeyaw Anas. The campaign restarted in 2018 when the Ghanaian MP and financier of the New Patriotic Party, Kennedy Agyapong, announced his intention to reveal the identity of Anas following the journalist's exposé of corruption at the Ghana Football Association. Anas maintains that "being anonymous has always been his secret weapon." Pictures purported to be of Anas were first released by a TV station owned by Agyapong, and were quickly picked up by other media houses. At least one person, a Dutch-Brazilian model, has claimed ownership of one picture that was released, and has threatened legal action against Agyapong for possibly putting his life in danger. In response to Agyapong, social media users retweeted photos of themselves, random people, or even comic images of entities that resemble the trademark covered face of Anas. When the hashtag first began in 2015, along with other popular uses of the journalist's name, Elizabeth Ohene wrote an article about Ghanaians use of humour in response to dealing with the expose of government corruption. "I do not know when these words will make it into Wikipedia or the Oxford English Dictionary but for the moment you can take it from me that: To go undercover is to anas, to make secret recordings is to anas-anas, to wear disguises is to do an anas, to be caught in the act is to be anased. To have someone exposed taking bribes is to have that person being given the full Anas Aremeyaw Anas."

    Read more →