CloudSim

CloudSim

CloudSim is a framework for modeling and simulation of cloud computing infrastructures and services. Originally built primarily at the Cloud Computing and Distributed Systems (CLOUDS) Laboratory, the University of Melbourne, Australia, CloudSim has become one of the most popular open source cloud simulators in the research and academia. CloudSim is completely written in Java. The latest version of CloudSim is CloudSim v6.0.0-beta on GitHub. Cloudsim is suitable for implementing simulations scenarios based on Infrastructure as a service as well as with latest version Platform as a service, so get started here == CloudSim extensions == Initially developed as a stand-alone cloud simulator, CloudSim has further been extended by independent researchers. GPUCloudSim is an enhanced CloudSim tool for modeling GPU-based cloud infrastructures and data centers. It offers simulations for multi-GPU setups, customizable GPU policies, GPU remoting, etc. It also examines performance impacts and interactions within virtualized GPU environments. CloudSim Plus is a totally re-engineered CloudSim fork providing general-purpose cloud computing simulation and exclusive features such as: multi-cloud simulations, vertical and horizontal VM scaling, host fault injection and recovery, joint power- and network-aware simulations and more. Though CloudSim itself does not have a graphical user interface, extensions such as CloudReports offer a GUI for CloudSim simulations. CloudSimEx extends CloudSim by adding MapReduce simulation capabilities and parallel simulations. Cloud2Sim extends CloudSim to execute on multiple distributed servers, by leveraging Hazelcast distributed execution framework. RECAP DES extends the CloudSim Plus framework to model synchronous hierarchical architectures (such as ElasticSearch). ThermoSim extends CloudSim toolkit by incorporating thermal characteristics, and uses Deep learning-based temperature predictor for cloud nodes.

Organoid intelligence

Organoid intelligence (OI) is an emerging field of study in computer science and biology that develops and studies biological wetware computing using 3D cultures of human brain cells (or brain organoids) and brain-machine interface technologies. Such technologies may be referred to as OIs or the nervous filesystem. Organoid intelligent computer systems can be an example of biohybrid systems. == Differences with non-organic computing == As opposed to traditional non-organic silicon-based approaches, OI seeks to use lab-grown cerebral organoids to serve as "biological hardware". While these structures are still far from being able to think like a regular human brain and do not yet possess strong computing capabilities, OI research currently offers the potential to improve the understanding of brain development, learning and memory, potentially finding treatments for neurological disorders such as dementia. Thomas Hartung, a professor from Johns Hopkins University, argued in 2023 that "while silicon-based computers are certainly better with numbers, brains are better at learning." He noted that transistor density in computer chip may be approaching its limits, whereas brains, being wired differently, are more energy-efficient and can store large amounts of information. Some researchers claim that even though human brains are slower than machines at processing simple information, they are far better at processing complex information as brains can deal with fewer and more uncertain data, perform both sequential and parallel processing, being highly heterogenous, use incomplete datasets, and is said to outperform non-organic machines in decision-making. Training OIs involve the process of biological learning (BL) as opposed to machine learning (ML) for AIs. == Bioinformatics in OI == OI generates complex biological data, necessitating sophisticated methods for processing and analysis. Bioinformatics provides the tools and techniques to decipher raw data, uncovering the patterns and insights. Researchers have developed a platform named Neuroplatform for experimenting remotely with brain organoids via an API. == Intended functions == Brain-inspired computing hardware aims to emulate the structure and working principles of the brain and could be used to address current limitations in AI technologies. However, brain-inspired silicon chips are still limited in their ability to fully mimic brain function, as most examples are built on digital electronic principles. One study performed OI computation (which they termed Brainoware) by sending and receiving information from the brain organoid using a high-density multielectrode array. By applying spatiotemporal electrical stimulation, nonlinear dynamics, and fading memory properties, as well as unsupervised learning from training data by reshaping the organoid functional connectivity, the study showed the potential of this technology by using it for speech recognition and nonlinear equation prediction in a reservoir computing framework. == Ethical concerns == While researchers are hoping to use OI and biological computing to complement traditional silicon-based computing, there are also questions about the ethics of such an approach. Concerns include the possibility that an organoid could develop sentience or consciousness, and the question of the relationship between a stem cell donor (for growing the organoid) and the respective OI system.

Comet (programming)

Comet is a web application model in which a long-held HTTPS request allows a web server to push data to a browser, without the browser explicitly requesting it. Comet is an umbrella term, encompassing multiple techniques for achieving this interaction. All these methods rely on features included by default in browsers, such as JavaScript, rather than on non-default plugins. The Comet approach differs from the original model of the web, in which a browser requests a complete web page at a time. The use of Comet techniques in web development predates the use of the word Comet as a neologism for the collective techniques. Comet is known by several other names, including Ajax Push, Reverse Ajax, Two-way-web, HTTP Streaming, and HTTP server push among others. The term Comet is not an acronym, but was coined by Alex Russell in his 2006 blog post. In recent years, the standardisation and widespread support of WebSocket and Server-sent events has rendered the Comet model obsolete. == History == === Early Java applets === The ability to embed Java applets into browsers (starting with Netscape Navigator 2.0 in March 1996) made two-way sustained communications possible, using a raw TCP socket to communicate between the browser and the server. This socket can remain open as long as the browser is at the document hosting the applet. Event notifications can be sent in any format – text or binary – and decoded by the applet. === The first browser-to-browser communication framework === The very first application using browser-to-browser communications was Tango Interactive, implemented in 1996–98 at the Northeast Parallel Architectures Center (NPAC) at Syracuse University using DARPA funding. TANGO architecture has been patented by Syracuse University. TANGO framework has been extensively used as a distance education tool. The framework has been commercialized by CollabWorx and used in a dozen or so Command&Control and Training applications in the United States Department of Defense. === First Comet applications === The first set of Comet implementations dates back to 2000, with the Pushlets, Lightstreamer, and KnowNow projects. Pushlets, a framework created by Just van den Broecke, was one of the first open source implementations. Pushlets were based on server-side Java servlets, and a client-side JavaScript library. Bang Networks – a Silicon Valley start-up backed by Netscape co-founder Marc Andreessen – had a lavishly financed attempt to create a real-time push standard for the entire web. In April 2001, Chip Morningstar began developing a Java-based (J2SE) web server which used two HTTP sockets to keep open two communications channels between the custom HTTP server he designed and a client designed by Douglas Crockford; a functioning demo system existed as of June 2001. The server and client used a messaging format that the founders of State Software, Inc. assented to coin as JSON following Crockford's suggestion. The entire system, the client libraries, the messaging format known as JSON and the server, became the State Application Framework, parts of which were sold and used by Sun Microsystems, Amazon.com, EDS and Volkswagen. In March 2006, software engineer Alex Russell coined the term Comet in a post on his personal blog. The new term was a play on Ajax (Ajax and Comet both being common household cleaners in the USA). In 2006, some applications exposed those techniques to a wider audience: Meebo’s multi-protocol web-based chat application enabled users to connect to AOL, Yahoo, and Microsoft chat platforms through the browser; Google added web-based chat to Gmail; JotSpot, a startup since acquired by Google, built Comet-based real-time collaborative document editing. New Comet variants were created, such as the Java-based ICEfaces JSF framework (although they prefer the term "Ajax Push"). Others that had previously used Java-applet based transports switched instead to pure-JavaScript implementations. == Implementations == Comet applications attempt to eliminate the limitations of the page-by-page web model and traditional polling by offering two-way sustained interaction, using a persistent or long-lasting HTTP connection between the server and the client. Since browsers and proxies are not designed with server events in mind, several techniques to achieve this have been developed, each with different benefits and drawbacks. The biggest hurdle is the HTTP 1.1 specification, which states "this specification... encourages clients to be conservative when opening multiple connections". Therefore, holding one connection open for real-time events has a negative impact on browser usability: the browser may be blocked from sending a new request while waiting for the results of a previous request, e.g., a series of images. This can be worked around by creating a distinct hostname for real-time information, which is an alias for the same physical server. This strategy is an application of domain sharding. Specific methods of implementing Comet fall into two major categories: streaming and long polling. === Streaming === An application using streaming Comet opens a single persistent connection from the client browser to the server for all Comet events. These events are incrementally handled and interpreted on the client side every time the server sends a new event, with neither side closing the connection. Specific techniques for accomplishing streaming Comet include the following: ==== Hidden iframe ==== A basic technique for dynamic web application is to use a hidden iframe HTML element (an inline frame, which allows a website to embed one HTML document inside another). This invisible iframe is sent as a chunked block, which implicitly declares it as infinitely long (sometimes called "forever frame"). As events occur, the iframe is gradually filled with script tags, containing JavaScript to be executed in the browser. Because browsers render HTML pages incrementally, each script tag is executed as it is received. Some browsers require a specific minimum document size before parsing and execution is started, which can be obtained by initially sending 1–2 kB of padding spaces. One benefit of the iframes method is that it works in every common browser. Two downsides of this technique are the lack of a reliable error handling method, and the impossibility of tracking the state of the request calling process. ==== XMLHttpRequest ==== The XMLHttpRequest (XHR) object, a tool used by Ajax applications for browser–server communication, can also be pressed into service for server–browser Comet messaging by generating a custom data format for an XHR response, and parsing out each event using browser-side JavaScript; relying only on the browser firing the onreadystatechange callback each time it receives new data. === Ajax with long polling === None of the above streaming transports work across all modern browsers without negative side-effects. This forces Comet developers to implement several complex streaming transports, switching between them depending on the browser. Consequently, many Comet applications use long polling, which is easier to implement on the browser side, and works, at minimum, in every browser that supports XHR. As the name suggests, long polling requires the client to poll the server for an event (or set of events). The browser makes an Ajax-style request to the server, which is kept open until the server has new data to send to the browser, which is sent to the browser in a complete response. The browser initiates a new long polling request in order to obtain subsequent events. IETF RFC 6202 "Known Issues and Best Practices for the Use of Long Polling and Streaming in Bidirectional HTTP" compares long polling and HTTP streaming. Specific technologies for accomplishing long-polling include the following: ==== XMLHttpRequest long polling ==== For the most part, XMLHttpRequest long polling works like any standard use of XHR. The browser makes an asynchronous request of the server, which may wait for data to be available before responding. The response can contain encoded data (typically XML or JSON) or Javascript to be executed by the client. At the end of the processing of the response, the browser creates and sends another XHR, to await the next event. Thus the browser always keeps a request outstanding with the server, to be answered as each event occurs. ==== Script tag long polling ==== While any Comet transport can be made to work across subdomains, none of the above transports can be used across different second-level domains (SLDs), due to browser security policies designed to prevent cross-site scripting attacks. That is, if the main web page is served from one SLD, and the Comet server is located at another SLD (which does not have cross-origin resource sharing enabled), Comet events cannot be used to modify the HTML and DOM of the main page, using those transports. This problem can be sidestepped by creating a proxy server in

Motion picture film scanner

A motion picture film scanner is a device used in digital filmmaking to scan original film for storage as high-resolution digital intermediate files. A film scanner scans original film stock: negative or positive print or reversal/IP. Units may scan gauges from 8 mm to 70 mm (8 mm, Super 8, 9.5 mm, 16 mm, Super 16, 35 mm, Super 35, 65 mm and 70 mm) with very high resolution scanning at 2K, 4K, 8K, or 16K resolutions. (2K is approximately 2048×1080 pixels and 4K is approximately 4096×2160 pixels). Some models of film scanner are intermittent pull-down film scanners which scan each frame individually, locked down in a pin-registered film gate, taking roughly a second per frame. Continuous-scan film scanners, where the film frames are scanned as the film is continuously moved past the imaging pick up device, are typically evolved from earlier telecine mechanisms, and can act as such at lower resolutions. The scanner scans the film frames into a file sequence (using high-end computer data storage devices), whose single file contains a digital scan of each still frame; the preferred image file format used as output are usually Cineon, DPX or TIFF, because they can store color information as raw data, preserving the optical characteristics of the film stock. These systems take a lot of storage area network (SAN) disk space. The files can be played back one after each other on high-end workstation non-linear editing system (NLE) or a virtual telecine systems. The playback is at the normal rate of 24 frames per second (or original projection frame rate of: 25, 30 or other speeds). Each year hard disks get larger and are able to hold more hours of movies on SAN systems. The challenge is to archive this massive amount of data on to data storage devices. The scanned footage is edited and composited on work stations then mastered back on film, see film-out and digital intermediate. Scanned film frames may also be used in digital film restoration. The film may also be projected directly on a digital projector in the theater. The data film files may be converted to SDTV (NTSC or PAL) video TV systems. Film recorders are the opposite of film scanners, copying content from a computer system onto film stock, for preservation or for display using film projectors. Telecines are similar to film scanners. == Imaging device == The front end of a motion picture film scanner is similar to a telecine. The imaging system may be either a charge-coupled device (CCD), a complementary metal–oxide–semiconductor (CMOS) or photomultipliers imaging pick up. A lamp is used as the light source in a CCD imaging front end. The CCDs convert the light to the video signals. In a cathode-ray tube (CRT) imaging system the CRT (also called a Flying spot tube) is used as the light source and part of the scanning system. Photomultipliers or avalanche photodiodes are used to convert the light to electrical video signals. A prism and/or dichroic mirrors or color filters are used to separate the light into the three: red, green and blue, imaging pick up devices. == Image processing == The three color signals (RGB) are electronically processed and color graded. A 3D look up table (3D LUT) is usually applied to the RGB values before it is coded into the DPX output files. The DPX files are usually made output through a network port cable or an optical fiber port: HIPPI, Fibre Channel or newer systems like gigabit Ethernet. A computer then stores the files on to hard drives of a storage area network for later processing and use. Modern motion picture film scanners many have an option for an infrared CCD channel for dirt mapping, that can be used to automatically or in post manually remove dirt-dust spots on the film. The IR camera channel can be used with IR dirt and scratch removal system or made output on a four IR channel for downstream dirt and dirt and scratch removal systems. Popular downstream dirt and dirt and scratch removal systems are PF Clean and Digital ICE. HDR or high dynamic range is a new system, using a compositing and tone-mapping of images to extend the dynamic range beyond the native capability of the capturing device. This may be done by using a triple exposure for the film and then compositing the three back together. Compositing can be done in a workstation in none real time or in the scanner in real time. == Models == Bold indicates a currently produced model Single frame intermittent pull-down: ARRI - Arriscan Cintel - diTTo Filmlight - Northlight 1 (up to 6K, 16mm to VistaVision), Northlight 2 (up to 8K, 16mm to VistaVision) Imagica scanner, single frame intermittent scanner. Kodak - Cineon, the first system designed for DI work, included a scanner, tapes drives, workstations and a film recorder. Lasergraphics Director 13.5K, 8mm to 70mm, IMAX & VistaVision) Continuous motion scanning: Arri - ARRISCAN XT (up to 6K, S35 down to 16mm) Cintel's C-Reality/DSX and ITK - Millennium/dataMill. Under ownership of Blackmagic Design, the Cintel Scanner was released, with the current 3rd generation capable of up to 4K scans at 30 fps. DFT - Spirit Classic (up to 2K), Spirit 4k/2k/HD (up to 4K), POLAR HQ (up to 8K, 8mm to S35), OXScan 14K (up to 14K, 16mm to 70mm), Scanity HDR (up to 4K, 16mm to S35) Filmfabriek - HDS+ (up to 4k), Pictor Pro (up to 2.7K), Pictor (up to 1080p). Filmfabriek scanners can only scan 17.5mm or smaller film formats. GE4 - Golden Eye Four - Filmscanner, 38 Mega Pixel camera. LED light source and continuous film transport using Capstan. From Digital Vision. Lasergraphics ScanStation (6.5K, 8mm to 70mm, IMAX & VistaVision) Lasergraphics Archivist (up to 5K) MWA Nova Vario series with patented laser-based, sprocket and claw free transport for 16/35mm for realtime (24/25fps) scanning with sensors for either 2K+ 2236 x 1752, or 2.5K+ HDR High Dynamic Range at 2560 x 2160, direct optical and magnetic sound on and 16 and 35mm. MWA Nova Choice 2K+ patented laser-based, sprocket and claw free transport for 8/Super8, 9.5mm, 16mm realtime (24/25fps) scanning w at 2K+, 2236 x 1752 with direct optical and magnetic sound on 16mm, magnetic from main and balance stripes on 8, Super8. Faster than real time scanning at lower resolution. P+S Technik - SteadyFrame Universal Format Film Scanner Walde - FilmStar 4K UHD 2K @ 25fps, 4K UHD @ 6fps. 35mm/16mm/8mm archive quality, continuous motion capstan driven.

Quality of experience

Quality of experience (QoE) is a measure of the delight or annoyance of a customer's experiences with a service (e.g., web browsing, phone call, TV broadcast). QoE focuses on the entire service experience; it is a holistic concept, similar to the field of user experience, but with its roots in telecommunication. QoE is an emerging multidisciplinary field based on social psychology, cognitive science, economics, and engineering science, focused on understanding overall human quality requirements. == Definition and concepts == In 2013, within the context of the COST Action QUALINET, QoE has been defined as:The degree of delight or annoyance of the user of an application or service. It results from the fulfillment of his or her expectations with respect to the utility and / or enjoyment of the application or service in the light of the user’s personality and current state.This definition has been adopted in 2016 by the International Telecommunication Union in Recommendation ITU-T P.10/G.100. Before, various definitions of QoE had existed in the domain, with the above-mentioned definition now finding wide acceptance in the community. QoE has historically emerged from Quality of Service (QoS), which attempts to objectively measure service parameters (such as packet loss rates or average throughput). QoS measurement is most of the time not related to a customer, but to the media or network itself. QoE however is a purely subjective measure from the user's perspective of the overall quality of the service provided, by capturing people's aesthetic and hedonic needs. QoE looks at a vendor's or purveyor's offering from the standpoint of the customer or end user, and asks, "What mix of goods, services, and support, do you think will provide you with the perception that the total product is providing you with the experience you desired and/or expected?" It then asks, "Is this what the vendor/purveyor has actually provided?" If not, "What changes need to be made to enhance your total experience?" In short, QoE provides an assessment of human expectations, feelings, perceptions, cognition and satisfaction with respect to a particular product, service or application. QoE is a blueprint of all human subjective and objective quality needs and experiences arising from the interaction of a person with technology and with business entities in a particular context. Although QoE is perceived as subjective, it is an important measure that counts for customers of a service. Being able to measure it in a controlled manner helps operators understand what may be wrong with their services and how to improve them. == QoE factors == QoE aims at taking into consideration every factor that contributes to a user's perceived quality of a system or service. This includes system, human and contextual factors. The following so-called "influence factors" have been identified and classified by Reiter et al.: Human Influence Factors Low-level processing (visual and auditory acuity, gender, age, mood, …) Higher-level processing (cognitive processes, socio-cultural and economic background, expectations, needs and goals, other personality traits…) System Influence Factors Content-related Media-related (encoding, resolution, sample rate, …) Network-related (bandwidth, delay, jitter, …) Device-related (screen resolution, display size, …) Context Influence Factors Physical context (location and space) Temporal context (time of day, frequency of use, …) Social context (inter-personal relations during experience) Economic context Task context (multitasking, interruptions, task type) Technical and information context (relationship between systems) Studies in the field of QoE have typically focused on system factors, primarily due to its origin in the QoS and network engineering domains. Through the use of dedicated test laboratories, the context is often sought to be kept constant. == QoE versus User Experience == QoE is strongly related to but different from the field of User Experience (UX), which also focuses on users' experiences with services. Historically, QoE has emerged from telecommunication research, while UX has its roots in Human–Computer Interaction. Both fields can be considered multi-disciplinary. In contrast to UX, the goal of improving QoE for users was more strongly motivated by economic needs. Wechsung and De Moor identify the following key differences between the fields: == QoE measurement == As a measure of the end-to-end performance at the service level from the user's perspective, QoE is an important metric for the design of systems and engineering processes. This is particularly relevant for video services because – due to their high traffic demands –, bad network performance may highly affect the user's experience. So, when designing systems, the expected output, i.e. the expected QoE, is often taken into account – also as a system output metric and optimization goal. To measure this level of QoE, human ratings can be used. The mean opinion score (MOS) is a widely used measure for assessing the quality of media signals. It is a limited form of QoE measurement, relating to a specific media type, in a controlled environment and without explicitly taking into account user expectations. The MOS as an indicator of experienced quality has been used for audio and speech communication, as well as for the assessment of quality of Internet video, television and other multimedia signals, and web browsing. Due to inherent limitations in measuring QoE in a single scalar value, the usefulness of the MOS is often debated. Subjective quality evaluation requires a lot of human resources, establishing it as a time-consuming process. Objective evaluation methods can provide quality results faster, but require dedicated computing resources. Since such instrumental video quality algorithms are often developed based on a limited set of subjective data, their QoE prediction accuracy may be low when compared to human ratings. QoE metrics are often measured at the end devices and can conceptually be seen as the remaining quality after the distortion introduced during the preparation of the content and the delivery through the network, until it reaches the decoder at the end device. There are several elements in the media preparation and delivery chain, and some of them may introduce distortion. This causes degradation of the content, and several elements in this chain can be considered as "QoE-relevant" for the offered services. The causes of degradation are applicable for any multimedia service, that is, not exclusive to video or speech. Typical degradations occur at the encoding system (compression degradation), transport network, access network (e.g., packet loss or packet delay), home network (e.g. WiFi performance) and end device (e.g. decoding performance). == QoE management == Several QoE-centric network management and bandwidth management solutions have been proposed, which aim to improve the QoE delivered to the end-users. When managing a network, QoE fairness may be taken into account in order to keep the users sufficiently satisfied (i.e., high QoE) in a fair manner. From a QoE perspective, network resources and multimedia services should be managed in order to guarantee specific QoE levels instead of classical QoS parameters, which are unable to reflect the actual delivered QoE. A pure QoE-centric management is challenged by the nature of the Internet itself, as the Internet protocols and architecture were not originally designed to support today's complex and high demanding multimedia services. As an example for an implementation of QoE management, network nodes can become QoE-aware by estimating the status of the multimedia service as perceived by the end-users. This information can then be used to improve the delivery of the multimedia service over the network and proactively improve the users' QoE. This can be achieved, for example, via traffic shaping. QoE management gives the service provider and network operator the capability to minimize storage and network resources by allocating only the resources that are sufficient to maintain a specific level of user satisfaction. As it may involve limiting resources for some users or services in order to increase the overall network performance and QoE, the practice of QoE management requires that net neutrality regulations are considered.

PagedAttention

PagedAttention is an attention algorithm for efficient serving of large language models (LLMs). It was introduced in 2023 by Woosuk Kwon and colleagues in the paper Efficient Memory Management for Large Language Model Serving with PagedAttention, alongside the vLLM serving engine. The method stores the key–value cache used during autoregressive decoding in fixed-size blocks that can be mapped to non-contiguous physical memory, borrowing ideas from virtual memory, paging, and operating system design. == Background == In transformer inference, the key–value cache grows with sequence length and the number of concurrent requests. Kwon et al. argued that earlier serving systems typically reserved contiguous cache regions in advance, which caused reserved space, internal fragmentation, and external fragmentation. In their experiments, the paper reported that the effective memory utilization of previous systems could fall as low as 20.4%. == Description == PagedAttention partitions the cache of each sequence into fixed-size KV blocks. A request's cache is represented as a sequence of logical blocks, while a block table maps those logical blocks to physical GPU-memory blocks. As a result, neighboring logical blocks do not need to be contiguous in physical memory, and new blocks can be allocated on demand as generation proceeds. The design also makes it easier to share cache state across related decoding paths. In vLLM, physical blocks can be reference-counted and shared among requests or branches, with block-granularity copy-on-write used when a shared block must be modified. The original paper applied this design to parallel sampling, beam search, and prompts with shared prefixes. == Mathematical formulation == For a query token i {\displaystyle i} in causal self-attention, the standard attention output can be written as a i j = exp ⁡ ( q i ⊤ k j / d ) ∑ t = 1 i exp ⁡ ( q i ⊤ k t / d ) , o i = ∑ j = 1 i a i j v j {\displaystyle a_{ij}={\frac {\exp(\mathbf {q} _{i}^{\top }\mathbf {k} _{j}/{\sqrt {d}})}{\sum _{t=1}^{i}\exp(\mathbf {q} _{i}^{\top }\mathbf {k} _{t}/{\sqrt {d}})}},\;\mathbf {o} _{i}=\sum _{j=1}^{i}a_{ij}\mathbf {v} _{j}} where q i {\displaystyle \mathbf {q} _{i}} , k j {\displaystyle \mathbf {k} _{j}} , and v j {\displaystyle \mathbf {v} _{j}} are the query, key, and value vectors, and d {\displaystyle d} is the attention dimension. If the cache is partitioned into blocks of size B {\displaystyle B} , the key and value blocks may be written as K j = ( k ( j − 1 ) B + 1 , … , k j B ) , V j = ( v ( j − 1 ) B + 1 , … , v j B ) {\displaystyle \mathbf {K} _{j}=(\mathbf {k} _{(j-1)B+1},\ldots ,\mathbf {k} _{jB}),\;\mathbf {V} _{j}=(\mathbf {v} _{(j-1)B+1},\ldots ,\mathbf {v} _{jB})} PagedAttention then performs the computation blockwise: A i j = exp ⁡ ( q i ⊤ K j / d ) ∑ t = 1 ⌈ i / B ⌉ exp ⁡ ( q i ⊤ K t / d ) , o i = ∑ j = 1 ⌈ i / B ⌉ V j A i j ⊤ {\displaystyle \mathbf {A} _{ij}={\frac {\exp(\mathbf {q} _{i}^{\top }\mathbf {K} _{j}/{\sqrt {d}})}{\sum _{t=1}^{\lceil i/B\rceil }\exp(\mathbf {q} _{i}^{\top }\mathbf {K} _{t}/{\sqrt {d}})}},\;\mathbf {o} _{i}=\sum _{j=1}^{\lceil i/B\rceil }\mathbf {V} _{j}\mathbf {A} _{ij}^{\top }} where A i j {\displaystyle \mathbf {A} _{ij}} is the vector of attention scores for the j {\displaystyle j} -th KV block. In the formulation given by Kwon et al., this preserves the causal attention calculation while allowing the key and value blocks to reside in non-contiguous physical memory. == Performance and use == The vLLM paper reported that, on its evaluated workloads, the use of PagedAttention and the associated memory-management design improved serving throughput by 2–4× over the compared baselines, including FasterTransformer and Orca, while preserving model outputs. In experiments on OPT-13B with the Alpaca trace, the paper also reported memory savings of 6.1–9.8% for parallel sampling and 37.6–55.2% for beam search through KV-block sharing. A 2024 survey of LLM serving systems described PagedAttention as having become an industry norm in LLM serving frameworks, citing support in TGI, vLLM, and TensorRT-LLM. == Limitations and alternatives == Subsequent work has described trade-offs in the approach. The 2025 vAttention paper argued that PagedAttention requires attention kernels to be rewritten to support paging and increases software complexity, portability issues, redundancy, and execution overhead, proposing instead a memory manager that keeps the cache contiguous in virtual memory while relying on demand paging for physical allocation. === vAttention === Unlike PagedAttention, vAttention does not introduce a different attention rule; it retains the standard attention computation Attention ⁡ ( q i , K , V ) = softmax ⁡ ( q i K ⊤ s c a l e ) V . {\displaystyle \operatorname {Attention} (q_{i},K,V)=\operatorname {softmax} \left({\frac {q_{i}K^{\top }}{\mathrm {scale} }}\right)V.} In the notation of Prabhu et al., the key and value tensors for a request seen so far are K , V ∈ R L ′ × ( H × D ) {\displaystyle K,V\in \mathbb {R} ^{L'\times (H\times D)}} , where L ′ {\displaystyle L'} is the context length seen so far, H {\displaystyle H} is the number of KV heads on a worker, and D {\displaystyle D} is the dimension of each KV head. In systems prior to PagedAttention, the K cache (or V cache) at each layer of a worker is typically allocated as a 4D tensor of shape [ B , L , H , D ] , {\displaystyle [B,L,H,D],} where B {\displaystyle B} is batch size and L {\displaystyle L} is the maximum context length supported by the model. vAttention preserves this contiguous virtual-memory view while deferring physical-memory allocation to runtime. A serving framework maintains separate K and V tensors for each layer, so vAttention reserves 2 N {\displaystyle 2N} virtual-memory buffers on a worker, where N {\displaystyle N} is the number of layers managed by that worker. The maximum size of one virtual-memory buffer is B S = B × S , {\displaystyle BS=B\times S,} where S {\displaystyle S} is the maximum size of a single request's per-layer K cache (or V cache) on a worker. The paper defines S = L × H × D × P , {\displaystyle S=L\times H\times D\times P,} where P {\displaystyle P} is the number of bytes needed to store one element. In this formulation, vAttention keeps the KV cache contiguous in virtual memory and relies on demand paging for physical allocation, rather than modifying the attention kernel to operate over non-contiguous KV-cache blocks.

Verge3D

Verge3D is a real-time renderer and a toolkit used for creating interactive 3D experiences running on websites. == Overview == Verge3D enables users to convert content from 3D modelling tools (Blender, 3ds Max, and Maya are currently supported) to view in a web browser. Verge3D was created by the same core group of software engineers that previously created the Blend4Web framework. == Features == Verge3D uses WebGL for rendering. It incorporates components of the Three.js library and exposes its API to application developers. Puzzles Application functionality can be added via JavaScript, either by writing code directly or by using Puzzles, Verge3D’s visual programming environment based on Google Blockly. Puzzles is aimed primarily at non-programmers allowing quick creation of interactive scenarios in a drag-and-drop fashion. App Manager and web publishing App Manager is a lightweight web-based tool for creating, managing and publishing Verge3D projects, running on top of the local development server. Verge3D Network service integrated in the App Manager allows for publishing Verge3D applications via Amazon S3 and EC2 cloud services. PBR For purposes of authoring materials, a glTF 2.0-compliant physically based rendering pipeline is offered alongside the standard shader-based approach. PBR textures can be authored using external texturing software such as Substance Painter for which Verge3D offers the corresponding export preset. Besides the glTF 2.0 model, Verge3D supports physical materials of 3ds Max and Maya (with Autodesk Arnold as reference), and Blender's real-time Eevee materials. glTF and DCC software integration Verge3D integrates directly with Blender, 3ds Max, and Maya, enabling users to create 3D geometry, materials, and animations inside the software, then export them in the JSON-based glTF format. The Sneak Peek feature allows for exporting and viewing scenes from the DCC tool environment. Facebook 3D posts For Facebook publishing, Verge3D offers a specific GLB export option. The exported GLB files are displayed and can be opened in the App Manager. Asset compression Exported files can optionally use LZMA compression, resulting in a reduction in file size of up to 6x. UI and website layouts Interface layouts, created using external WYSIWYG editors, can be linked with Puzzles to trigger changes to a 3D scene being rendered in the browser and vice versa. Animation Verge3D supports skeletal animation, including animation of bipeds and character rigs, and allows for animation of material parameters. Model parts can also be set up to be dragged by the user. Physics The physics module can be linked separately to enable collision detection, dynamically moving objects, support for characters and vehicles, springs, ropes and cloth simulation. As of version 2.11, simple physics simulations can be created and controlled without coding via Puzzles, the visual programming system used by Verge3D. AR/VR The 2.10 update added support for WebXR, an in-development open technology designed to enable virtual reality and augmented reality experiences to be displayed in web browsers. It works with both headsets with controllers, like the HTC Vive and Oculus Rift, and those without, like Google Cardboard. AR/VR experiences can enabled via Puzzles or JavaScript. == Workflow == Verge3D's workflow differs substantially from other mainstream WebGL frameworks. Development of a new Verge3D application is usually started from modeling, texturing and animating 3D objects. The models are assembled in the 3D authoring tool. The scene file is then used as a basis for a Verge3D project initialized from the App Manager. An interactive scenario is optionally added using the Puzzles editor. A Verge3D application can be previewed in the web browser at any development stage using the App Manager. The finished web application can be deployed on the Verge3D Network, on Facebook or on the user's website. == Notable uses == NASA's Jet Propulsion Laboratory used Verge3D to create an interactive 3D visualization of the Mars InSight lander. The web application allows for exploring and interacting with the real-time model of the spacecraft, with the possibility to move different parts and unfurl the solar panels. NASA's older interactive web application Experience Curiosity was ported to Verge3D from Blend4Web. The application makes it possible to operate the rover, control its cameras and the robotic arm and reproduces some of the prominent events of the Mars Science Laboratory mission. Route 66 Digital's Escape Room used Verge3D and Blender. This interactive short explores how users can navigate 3D spaces and interact with objects without the need for instruction.