Something Big Is Happening

Something Big Is Happening

"Something Big Is Happening" is an essay by Matt Shumer, an AI entrepreneur, about the impact of artificial intelligence, published in February 2026, that has since been reportedly viewed more than 80 million times and widely discussed. Shumer noted that the technology has crossed an important threshold, where AI has become capable of creating self-improving systems. Referring to one the most recent AI models, he wrote: "It was making intelligent decisions. It had something that felt, for the first time, like judgment. Like taste." Speaking to CNBC's Power Lunch, Shumer said that his "core message" is "people in the workforce should start to use and experiment with AI tools so they can understand what’s coming". Even as the essay was widely shared and discussed, the essay also elicited criticism. Paulo Carvao, in an essay published by the Forbes Magazine stated that some of his advice is sound, but added: "It reads at times like a sales pitch. He urges readers to subscribe to the most advanced AI tools. He implies that those with access to premium models will outpace those without. He frames paid AI subscriptions as a form of insurance against obsolescence." Writing in The Guardian, Dan Milmo and Aisha Down mentioned Shumer as having a history of AI hype and stated, "He previously excited the internet by announcing the release of the world's "top open-source model", which it was not". Many workers in the technology sector criticized the article in blog posts shared on Hacker News; Edward Zitron commented that "while coding LLMs can test products, or scan/fix some bugs, this suggests they A) do this autonomously without human input, B) they do this correctly every time (or ever!)." In an article alluding to Shumer's original post, Ari Colaprete wrote "the LLM is fundamentally a writing machine, it does everything via text, and if you make it produce writing that exists purely to serve some sort of mechanical function, and you train it to succeed in that task, then it will tend to do so, even with vast intricacy."

Data-driven model

Data-driven models are a class of computational models that primarily rely on historical data collected throughout a system's or process' lifetime to establish relationships between input, internal, and output variables. Commonly found in numerous articles and publications, data-driven models have evolved from earlier statistical models, overcoming limitations posed by strict assumptions about probability distributions. These models have gained prominence across various fields, particularly in the era of big data, artificial intelligence, and machine learning, where they offer valuable insights and predictions based on the available data. == Background == These models have evolved from earlier statistical models, which were based on certain assumptions about probability distributions that often proved to be overly restrictive. The emergence of data-driven models in the 1950s and 1960s coincided with the development of digital computers, advancements in artificial intelligence research, and the introduction of new approaches in non-behavioural modelling, such as pattern recognition and automatic classification. == Key Concepts == Data-driven models encompass a wide range of techniques and methodologies that aim to intelligently process and analyse large datasets. Examples include fuzzy logic, fuzzy and rough sets for handling uncertainty, neural networks for approximating functions, global optimization and evolutionary computing, statistical learning theory, and Bayesian methods. These models have found applications in various fields, including economics, customer relations management, financial services, medicine, and the military, among others. Machine learning, a subfield of artificial intelligence, is closely related to data-driven modelling as it also focuses on using historical data to create models that can make predictions and identify patterns. In fact, many data-driven models incorporate machine learning techniques, such as regression, classification, and clustering algorithms, to process and analyse data. In recent years, the concept of data-driven models has gained considerable attention in the field of water resources, with numerous applications, academic courses, and scientific publications using the term as a generalization for models that rely on data rather than physics. This classification has been featured in various publications and has even spurred the development of hybrid models in the past decade. Hybrid models attempt to quantify the degree of physically based information used in hydrological models and determine whether the process of building the model is primarily driven by physics or purely data-based. As a result, data-driven models have become an essential topic of discussion and exploration within water resources management and research. The term "data-driven modelling" (DDM) refers to the overarching paradigm of using historical data in conjunction with advanced computational techniques, including machine learning and artificial intelligence, to create models that can reveal underlying trends, patterns, and, in some cases, make predictions Data-driven models can be built with or without detailed knowledge of the underlying processes governing the system behavior, which makes them particularly useful when such knowledge is missing or fragmented.

AIOps

AIOps (Artificial Intelligence for IT Operations) refers to the use of artificial intelligence, machine learning, and big data analytics to automate and enhance data center management. It helps organizations manage complex IT environments by detecting, diagnosing, and resolving issues more efficiently than traditional methods. == History == AIOps was first defined by Gartner in 2016, combining "artificial intelligence" and "IT operations" to describe the application of AI and machine learning to enhance IT operations. This concept was introduced to address the increasing complexity and data volume in IT environments, aiming to automate processes such as event correlation, anomaly detection, and causality determination. == Definition == AIOps refers to multi-layered, complex technology platforms that enhance and automate IT operations by using machine learning and analytics to analyze the large amounts of data collected from various DevOps devices and tools, automatically identifying and responding to issues in real-time. AIOps represents a shift from isolated IT data to aggregated observational data (e.g., job logs and monitoring systems) and interaction data (such as ticketing, events, or incident records) within a big data platform. AIOps applies machine learning and analytics to this data, resulting in continuous visibility that, when combined with automation, can lead to ongoing improvements. AIOps connects three IT disciplines (automation, service management, and performance management) to achieve continuous visibility and improvement. This new approach in modern, accelerated, and hyper-scaled IT environments leverages advances in machine learning and big data to overcome previous limitations. == Components == AIOps includes, but is not limited to, the following processes and techniques: Anomaly Detection Log Analysis Root Cause Analysis Cohort Analysis Event Correlation Predictive Analytics Hardware Failure Prediction Automated Remediation Performance Prediction Incident Management Causality Determination Queue Management Resource Scheduling and Optimization Predictive Capacity Management Resource Allocation Service Quality Monitoring Deployment and Integration Testing System Configuration Auto-diagnosis and Problem Localization Efficient ML Training and Inferencing Using LLMs for Cloud Ops Auto Service Healing Data Center Management Customer Support Security and Privacy in Cloud Operations == Comparison with DevOps == AIOps is increasingly compared with DevOps in terms of impact on operational efficiency. While DevOps focuses on collaboration between development and operations teams to accelerate software delivery, AIOps integrates artificial intelligence to enhance monitoring, automation, and predictive capabilities. Various industry analyses have explored the similarities and differences between the two approaches, including discussions on how organizations can combine them to improve incident management and resource optimization. == Results == AI optimizes IT operations in five ways: First, intelligent monitoring powered by AI helps identify potential issues before they cause outages, improving metrics like Mean Time to Detect (MTTD) by 15-20%. Second, performance data analysis and insights enable quick decision-making by ingesting and analyzing large data sets in real time. Third, AI-driven automated infrastructure optimization efficiently allocates resources and thereby reducing cloud costs. Fourth, enhanced IT service management reduces critical incidents by over 50% through AI-driven end-to-end service management. Lastly, intelligent task automation accelerates problem resolution and automates remedial actions with minimal human intervention. In 2025, Atera Networks was identified as a leader in AIOps by the software review platform G2. == AIOps vs. MLOps == AIOps tools use big data analytics, machine learning algorithms, and predictive analytics to detect anomalies, correlate events, and provide proactive insights. This automation reduces the burden on IT teams, allowing them to focus on strategic tasks rather than routine operational issues. AIOps is widely used by IT operations teams, DevOps, network administrators, and IT service management (ITSM) teams to enhance visibility and enable quicker incident resolution in hybrid cloud environments, data centers, and other IT infrastructures. In contrast to MLOps (Machine Learning Operations), which focuses on the lifecycle management and operational aspects of machine learning models, AIOps focuses on optimizing IT operations using a variety of analytics and AI-driven techniques. While both disciplines rely on AI and data-driven methods, AIOps primarily targets IT operations, whereas MLOps is concerned with the deployment, monitoring, and maintenance of ML models. == Conferences == There are several conferences that are specific to AIOps: AIOps Summit AI Dev Summit IBM Think conference

EM algorithm and GMM model

In statistics, EM (expectation maximization) algorithm handles latent variables, while GMM is the Gaussian mixture model. == Background == In the picture below, are shown the red blood cell hemoglobin concentration and the red blood cell volume data of two groups of people, the Anemia group and the control group (i.e. the group of people without Anemia). As expected, people with Anemia have lower red blood cell volume and lower red blood cell hemoglobin concentration than those without Anemia. x {\displaystyle x} is a random vector such as x := ( red blood cell volume , red blood cell hemoglobin concentration ) {\displaystyle x:={\big (}{\text{red blood cell volume}},{\text{red blood cell hemoglobin concentration}}{\big )}} , and from medical studies it is known that x {\displaystyle x} are normally distributed in each group, i.e. x ∼ N ( μ , Σ ) {\displaystyle x\sim {\mathcal {N}}(\mu ,\Sigma )} . z {\displaystyle z} is denoted as the group where x {\displaystyle x} belongs, with z i = 0 {\displaystyle z_{i}=0} when x i {\displaystyle x_{i}} belongs to the Anemia group and z i = 1 {\displaystyle z_{i}=1} when x i {\displaystyle x_{i}} belongs to the control group. Also z ∼ Categorical ⁡ ( k , ϕ ) {\displaystyle z\sim \operatorname {Categorical} (k,\phi )} where k = 2 {\displaystyle k=2} , ϕ j ≥ 0 , {\displaystyle \phi _{j}\geq 0,} and ∑ j = 1 k ϕ j = 1 {\displaystyle \sum _{j=1}^{k}\phi _{j}=1} . See Categorical distribution. The following procedure can be used to estimate ϕ , μ , Σ {\displaystyle \phi ,\mu ,\Sigma } . A maximum likelihood estimation can be applied: ℓ ( ϕ , μ , Σ ) = ∑ i = 1 m log ⁡ ( p ( x ( i ) ; ϕ , μ , Σ ) ) = ∑ i = 1 m log ⁡ ∑ z ( i ) = 1 k p ( x ( i ) ∣ z ( i ) ; μ , Σ ) p ( z ( i ) ; ϕ ) {\displaystyle \ell (\phi ,\mu ,\Sigma )=\sum _{i=1}^{m}\log(p(x^{(i)};\phi ,\mu ,\Sigma ))=\sum _{i=1}^{m}\log \sum _{z^{(i)}=1}^{k}p\left(x^{(i)}\mid z^{(i)};\mu ,\Sigma \right)p(z^{(i)};\phi )} As the z i {\displaystyle z_{i}} for each x i {\displaystyle x_{i}} are known, the log likelihood function can be simplified as below: ℓ ( ϕ , μ , Σ ) = ∑ i = 1 m log ⁡ p ( x ( i ) ∣ z ( i ) ; μ , Σ ) + log ⁡ p ( z ( i ) ; ϕ ) {\displaystyle \ell (\phi ,\mu ,\Sigma )=\sum _{i=1}^{m}\log p\left(x^{(i)}\mid z^{(i)};\mu ,\Sigma \right)+\log p\left(z^{(i)};\phi \right)} Now the likelihood function can be maximized by making partial derivative over μ , Σ , ϕ {\displaystyle \mu ,\Sigma ,\phi } , obtaining: ϕ j = 1 m ∑ i = 1 m 1 { z ( i ) = j } {\displaystyle \phi _{j}={\frac {1}{m}}\sum _{i=1}^{m}1\{z^{(i)}=j\}} μ j = ∑ i = 1 m 1 { z ( i ) = j } x ( i ) ∑ i = 1 m 1 { z ( i ) = j } {\displaystyle \mu _{j}={\frac {\sum _{i=1}^{m}1\{z^{(i)}=j\}x^{(i)}}{\sum _{i=1}^{m}1\left\{z^{(i)}=j\right\}}}} Σ j = ∑ i = 1 m 1 { z ( i ) = j } ( x ( i ) − μ j ) ( x ( i ) − μ j ) T ∑ i = 1 m 1 { z ( i ) = j } {\displaystyle \Sigma _{j}={\frac {\sum _{i=1}^{m}1\{z^{(i)}=j\}(x^{(i)}-\mu _{j})(x^{(i)}-\mu _{j})^{T}}{\sum _{i=1}^{m}1\{z^{(i)}=j\}}}} If z i {\displaystyle z_{i}} is known, the estimation of the parameters results to be quite simple with maximum likelihood estimation. But if z i {\displaystyle z_{i}} is unknown it is much more complicated. Being z {\displaystyle z} a latent variable (i.e. not observed), with unlabeled scenario, the expectation maximization algorithm is needed to estimate z {\displaystyle z} as well as other parameters. Generally, this problem is set as a GMM since the data in each group is normally distributed. In machine learning, the latent variable z {\displaystyle z} is considered as a latent pattern lying under the data, which the observer is not able to see very directly. x i {\displaystyle x_{i}} is the known data, while ϕ , μ , Σ {\displaystyle \phi ,\mu ,\Sigma } are the parameter of the model. With the EM algorithm, some underlying pattern z {\displaystyle z} in the data x i {\displaystyle x_{i}} can be found, along with the estimation of the parameters. The wide application of this circumstance in machine learning is what makes EM algorithm so important. == EM algorithm in GMM == The EM algorithm consists of two steps: the E-step and the M-step. Firstly, the model parameters and the z ( i ) {\displaystyle z^{(i)}} can be randomly initialized. In the E-step, the algorithm tries to guess the value of z ( i ) {\displaystyle z^{(i)}} based on the parameters, while in the M-step, the algorithm updates the value of the model parameters based on the guess of z ( i ) {\displaystyle z^{(i)}} of the E-step. These two steps are repeated until convergence is reached. The algorithm in GMM is: Repeat until convergence: 1. (E-step) For each i , j {\displaystyle i,j} , set w j ( i ) := p ( z ( i ) = j | x ( i ) ; ϕ , μ , Σ ) {\displaystyle w_{j}^{(i)}:=p\left(z^{(i)}=j|x^{(i)};\phi ,\mu ,\Sigma \right)} 2. (M-step) Update the parameters ϕ j := 1 m ∑ i = 1 m w j ( i ) {\displaystyle \phi _{j}:={\frac {1}{m}}\sum _{i=1}^{m}w_{j}^{(i)}} μ j := ∑ i = 1 m w j ( i ) x ( i ) ∑ i = 1 m w j ( i ) {\displaystyle \mu _{j}:={\frac {\sum _{i=1}^{m}w_{j}^{(i)}x^{(i)}}{\sum _{i=1}^{m}w_{j}^{(i)}}}} Σ j := ∑ i = 1 m w j ( i ) ( x ( i ) − μ j ) ( x ( i ) − μ j ) T ∑ i = 1 m w j ( i ) {\displaystyle \Sigma _{j}:={\frac {\sum _{i=1}^{m}w_{j}^{(i)}\left(x^{(i)}-\mu _{j}\right)\left(x^{(i)}-\mu _{j}\right)^{T}}{\sum _{i=1}^{m}w_{j}^{(i)}}}} With Bayes' rule, the following result is obtained by the E-step: p ( z ( i ) = j | x ( i ) ; ϕ , μ , Σ ) = p ( x ( i ) | z ( i ) = j ; μ , Σ ) p ( z ( i ) = j ; ϕ ) ∑ l = 1 k p ( x ( i ) | z ( i ) = l ; μ , Σ ) p ( z ( i ) = l ; ϕ ) {\displaystyle p\left(z^{(i)}=j|x^{(i)};\phi ,\mu ,\Sigma \right)={\frac {p\left(x^{(i)}|z^{(i)}=j;\mu ,\Sigma \right)p\left(z^{(i)}=j;\phi \right)}{\sum _{l=1}^{k}p\left(x^{(i)}|z^{(i)}=l;\mu ,\Sigma \right)p\left(z^{(i)}=l;\phi \right)}}} According to GMM setting, these following formulas are obtained: p ( x ( i ) | z ( i ) = j ; μ , Σ ) = 1 ( 2 π ) n / 2 | Σ j | 1 / 2 exp ⁡ ( − 1 2 ( x ( i ) − μ j ) T Σ j − 1 ( x ( i ) − μ j ) ) {\displaystyle p\left(x^{(i)}|z^{(i)}=j;\mu ,\Sigma \right)={\frac {1}{(2\pi )^{n/2}\left|\Sigma _{j}\right|^{1/2}}}\exp \left(-{\frac {1}{2}}\left(x^{(i)}-\mu _{j}\right)^{T}\Sigma _{j}^{-1}\left(x^{(i)}-\mu _{j}\right)\right)} p ( z ( i ) = j ; ϕ ) = ϕ j {\displaystyle p\left(z^{(i)}=j;\phi \right)=\phi _{j}} In this way, a switch between the E-step and the M-step is possible, according to the randomly initialized parameters.

Attention (machine learning)

In machine learning, attention is a method that determines the importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-width sequence that can range from tens to millions of tokens in size. Unlike "hard" weights, which are computed during the backwards training pass, "soft" weights exist only in the forward pass and therefore change with every step of the input. Earlier designs implemented the attention mechanism in a serial recurrent neural network (RNN) language translation system, but a more recent design, namely the transformer, removed the slower sequential RNN and relied more heavily on the faster parallel attention scheme. Inspired by ideas about attention in humans, the attention mechanism was developed to address the weaknesses of using information from the hidden layers of recurrent neural networks. Recurrent neural networks favor information contained in words at the end of a sentence and thus deemed more recent, thereby tending to attenuate the significance and associated predictive weight assigned to information earlier in the sentence. Attention allows a token equal access to any part of a sentence directly, rather than only through the previous state. == History == Additional surveys of the attention mechanism in deep learning are provided by Niu et al. and Soydaner. The major breakthrough came with self-attention, where each element in the input sequence attends to all others, enabling the model to capture global dependencies. This idea was central to the Transformer architecture, which replaced recurrence with attention mechanisms. As a result, Transformers became the foundation for models like BERT, T5 and generative pre-trained transformers (GPT). == Overview == The modern era of machine attention was revitalized by grafting an attention mechanism (Fig 1. orange) to an Encoder-Decoder. Figure 2 shows the internal step-by-step operation of the attention block (A) in Fig 1. === Interpreting attention weights === In translating between languages, alignment is the process of matching words from the source sentence to words of the translated sentence. Networks that perform verbatim translation without regard to word order would show the highest scores along the (dominant) diagonal of the matrix. The off-diagonal dominance shows that the attention mechanism is more nuanced. Consider an example of translating I love you to French. On the first pass through the decoder, 94% of the attention weight is on the first English word I, so the network offers the word je. On the second pass of the decoder, 88% of the attention weight is on the third English word you, so it offers t'. On the last pass, 95% of the attention weight is on the second English word love, so it offers aime. In the I love you example, the second word love is aligned with the third word aime. Stacking soft row vectors together for je, t', and aime yields an alignment matrix: Sometimes, alignment can be multiple-to-multiple. For example, the English phrase look it up corresponds to cherchez-le. Thus, "soft" attention weights work better than "hard" attention weights (setting one attention weight to 1, and the others to 0), as we would like the model to make a context vector consisting of a weighted sum of the hidden vectors, rather than "the best one", as there may not be a best hidden vector. == Variants == Many variants of attention implement soft weights, such as fast weight programmers, or fast weight controllers (1992). A "slow" neural network outputs the "fast" weights of another neural network through outer products. The slow network learns by gradient descent. It was later renamed as "linearized self-attention". Bahdanau-style attention, also referred to as additive attention, Luong-style attention, which is known as multiplicative attention, Early attention mechanisms similar to modern self-attention were proposed using recurrent neural networks. However, the highly parallelizable self-attention was introduced in 2017 and successfully used in the Transformer model, positional attention and factorized positional attention. For convolutional neural networks, attention mechanisms can be distinguished by the dimension on which they operate, namely: spatial attention, channel attention, or combinations. These variants recombine the encoder-side inputs to redistribute those effects to each target output. Often, a correlation-style matrix of dot products provides the re-weighting coefficients. In the figures below, W is the matrix of context attention weights, similar to the formula in Overview section above. == Optimizations == === Flash attention === The size of the attention matrix is proportional to the square of the number of input tokens. Therefore, when the input is long, calculating the attention matrix requires a lot of GPU memory. Flash attention is an implementation that reduces the memory needs and increases efficiency without sacrificing accuracy. It achieves this by partitioning the attention computation into smaller blocks that fit into the GPU's faster on-chip memory, reducing the need to store large intermediate matrices and thus lowering memory usage while increasing computational efficiency. === FlexAttention === FlexAttention is an attention kernel developed by Meta that allows users to modify attention scores prior to softmax and dynamically chooses the optimal attention algorithm. == Applications == Attention is widely used in natural language processing, computer vision, and speech recognition. In NLP, it improves context understanding in tasks like question answering and summarization. In vision, visual attention helps models focus on relevant image regions, enhancing object detection and image captioning. === Attention maps as explanations for vision transformers === From the original paper on vision transformers (ViT), visualizing attention scores as a heat map (called saliency maps or attention maps) has become an important and routine way to inspect the decision making process of ViT models. One can compute the attention maps with respect to any attention head at any layer, while the deeper layers tend to show more semantically meaningful visualization. Attention rollout is a recursive algorithm to combine attention scores across all layers, by computing the dot product of successive attention maps. Because vision transformers are typically trained in a self-supervised manner, attention maps are generally not class-sensitive. When a classification head is attached to the ViT backbone, class-discriminative attention maps (CDAM) combines attention maps and gradients with respect to the class [CLS] token. Some class-sensitive interpretability methods originally developed for convolutional neural networks can be also applied to ViT, such as GradCAM, which back-propagates the gradients to the outputs of the final attention layer. Using attention as basis of explanation for the transformers in language and vision is not without debate. While some pioneering papers analyzed and framed attention scores as explanations, higher attention scores do not always correlate with greater impact on model performances. == Mathematical representation == === Standard scaled dot-product attention === For matrices: Q ∈ R m × d k , K ∈ R n × d k {\displaystyle Q\in \mathbb {R} ^{m\times d_{k}},K\in \mathbb {R} ^{n\times d_{k}}} and V ∈ R n × d v {\displaystyle V\in \mathbb {R} ^{n\times d_{v}}} , the scaled dot-product, or QKV attention, is defined as: Attention ( Q , K , V ) = softmax ( Q K T d k ) V ∈ R m × d v {\displaystyle {\text{Attention}}(Q,K,V)={\text{softmax}}\left({\frac {QK^{T}}{\sqrt {d_{k}}}}\right)V\in \mathbb {R} ^{m\times d_{v}}} where T {\displaystyle {}^{T}} denotes transpose and the softmax function is applied independently to every row of its argument. The matrix Q {\displaystyle Q} contains m {\displaystyle m} queries, while matrices K , V {\displaystyle K,V} jointly contain an unordered set of n {\displaystyle n} key-value pairs. Value vectors in matrix V {\displaystyle V} are weighted using the weights resulting from the softmax operation, so that the rows of the m {\displaystyle m} -by- d v {\displaystyle d_{v}} output matrix are confined to the convex hull of the points in R d v {\displaystyle \mathbb {R} ^{d_{v}}} given by the rows of V {\displaystyle V} . To understand the permutation invariance and permutation equivariance properties of QKV attention, let A ∈ R m × m {\displaystyle A\in \mathbb {R} ^{m\times m}} and B ∈ R n × n {\displaystyle B\in \mathbb {R} ^{n\times n}} be permutation matrices; and D ∈ R m × n {\displaystyle D\in \mathbb {R} ^{m\times n}} an arbitrary matrix. The softmax function is permutation equivariant in the sense that: softmax ( A D B ) = A softmax ( D ) B {\displays

Deep tomographic reconstruction

Deep Tomographic Reconstruction is a set of methods for using deep learning methods to perform tomographic reconstruction of medical and industrial images. It uses artificial intelligence and machine learning, especially deep artificial neural networks or deep learning, to overcome challenges such as measurement noise, data sparsity, image artifacts, and computational inefficiency. This approach has been applied across various imaging modalities, including CT, MRI, PET, SPECT, ultrasound, and optical imaging == Historical background == Traditional tomographic reconstruction relies on analytic methods such as filtered back-projection, or iterative methods which incrementally compute inverse transformations from measurement data (e.g., Radon or Fourier transform data). However, these approaches are not sufficient for certain imaging techniques such as low-dose CT and fast MRI, or scenarios involving metal artifacts and patient motion. == Use in imaging modalities == === Computed tomography (CT) === In CT, deep learning models can be particularly effective in reducing radiation exposure while maintaining image quality. Deep neural networks can also be able to reconstruct images of fair quality from sparsely sampled data without sacrificing diagnostic performance. Deep learning-based generative AI models can reduce CT metal artifacts. === Magnetic resonance imaging (MRI) === In magnetic resonance imaging (MRI), deep learning can lead to reduced MRI motion artifacts, and increased acquisition speed, referred to as fast MRI. Despite suffering from disadvantages such as lower signal-to-noise ratio (SNR), deep learning can enhance image quality in low field MRI, making these systems clinically viable. === Positron emission tomography (PET) and single-photon emission CT (SPECT) === For PET imaging, deep learning models can provide substantial improvements in low-dose imaging and motion artifact correction. Also, deep learning can help SPECT for generation of attenuation background. A notable technique for PET denoising involves integrating MR data through multimodal networks, which use anatomical information from MRI to enhance PET image quality. === Ultrasound imaging === Deep learning can enhance ultrasound imaging by reducing speckle noise and motion blur. For ultrasound beamforming, deep neural networks can allow superior image quality with limited data at high speed. === Optical imaging and microscopy === Diffuse optical tomography, optical coherence tomography and microscopy can be improved by deep neural networks beyond traditional methods. Furthermore, deep learning can also enhance Photoacoustic imaging (see Deep learning in photoacoustic imaging), addressing challenges like high noise, low contrast, and limited resolution. Deep learning has also been applied to label-free live-cell imaging, where convolutional neural networks predict fluorescence labels from transmitted light images, a technique known as in silico labeling. This method can enable high-throughput, non-invasive cell analysis and phenotyping without the need for traditional fluorescent dyes.

PagedAttention

PagedAttention is an attention algorithm for efficient serving of large language models (LLMs). It was introduced in 2023 by Woosuk Kwon and colleagues in the paper Efficient Memory Management for Large Language Model Serving with PagedAttention, alongside the vLLM serving engine. The method stores the key–value cache used during autoregressive decoding in fixed-size blocks that can be mapped to non-contiguous physical memory, borrowing ideas from virtual memory, paging, and operating system design. == Background == In transformer inference, the key–value cache grows with sequence length and the number of concurrent requests. Kwon et al. argued that earlier serving systems typically reserved contiguous cache regions in advance, which caused reserved space, internal fragmentation, and external fragmentation. In their experiments, the paper reported that the effective memory utilization of previous systems could fall as low as 20.4%. == Description == PagedAttention partitions the cache of each sequence into fixed-size KV blocks. A request's cache is represented as a sequence of logical blocks, while a block table maps those logical blocks to physical GPU-memory blocks. As a result, neighboring logical blocks do not need to be contiguous in physical memory, and new blocks can be allocated on demand as generation proceeds. The design also makes it easier to share cache state across related decoding paths. In vLLM, physical blocks can be reference-counted and shared among requests or branches, with block-granularity copy-on-write used when a shared block must be modified. The original paper applied this design to parallel sampling, beam search, and prompts with shared prefixes. == Mathematical formulation == For a query token i {\displaystyle i} in causal self-attention, the standard attention output can be written as a i j = exp ⁡ ( q i ⊤ k j / d ) ∑ t = 1 i exp ⁡ ( q i ⊤ k t / d ) , o i = ∑ j = 1 i a i j v j {\displaystyle a_{ij}={\frac {\exp(\mathbf {q} _{i}^{\top }\mathbf {k} _{j}/{\sqrt {d}})}{\sum _{t=1}^{i}\exp(\mathbf {q} _{i}^{\top }\mathbf {k} _{t}/{\sqrt {d}})}},\;\mathbf {o} _{i}=\sum _{j=1}^{i}a_{ij}\mathbf {v} _{j}} where q i {\displaystyle \mathbf {q} _{i}} , k j {\displaystyle \mathbf {k} _{j}} , and v j {\displaystyle \mathbf {v} _{j}} are the query, key, and value vectors, and d {\displaystyle d} is the attention dimension. If the cache is partitioned into blocks of size B {\displaystyle B} , the key and value blocks may be written as K j = ( k ( j − 1 ) B + 1 , … , k j B ) , V j = ( v ( j − 1 ) B + 1 , … , v j B ) {\displaystyle \mathbf {K} _{j}=(\mathbf {k} _{(j-1)B+1},\ldots ,\mathbf {k} _{jB}),\;\mathbf {V} _{j}=(\mathbf {v} _{(j-1)B+1},\ldots ,\mathbf {v} _{jB})} PagedAttention then performs the computation blockwise: A i j = exp ⁡ ( q i ⊤ K j / d ) ∑ t = 1 ⌈ i / B ⌉ exp ⁡ ( q i ⊤ K t / d ) , o i = ∑ j = 1 ⌈ i / B ⌉ V j A i j ⊤ {\displaystyle \mathbf {A} _{ij}={\frac {\exp(\mathbf {q} _{i}^{\top }\mathbf {K} _{j}/{\sqrt {d}})}{\sum _{t=1}^{\lceil i/B\rceil }\exp(\mathbf {q} _{i}^{\top }\mathbf {K} _{t}/{\sqrt {d}})}},\;\mathbf {o} _{i}=\sum _{j=1}^{\lceil i/B\rceil }\mathbf {V} _{j}\mathbf {A} _{ij}^{\top }} where A i j {\displaystyle \mathbf {A} _{ij}} is the vector of attention scores for the j {\displaystyle j} -th KV block. In the formulation given by Kwon et al., this preserves the causal attention calculation while allowing the key and value blocks to reside in non-contiguous physical memory. == Performance and use == The vLLM paper reported that, on its evaluated workloads, the use of PagedAttention and the associated memory-management design improved serving throughput by 2–4× over the compared baselines, including FasterTransformer and Orca, while preserving model outputs. In experiments on OPT-13B with the Alpaca trace, the paper also reported memory savings of 6.1–9.8% for parallel sampling and 37.6–55.2% for beam search through KV-block sharing. A 2024 survey of LLM serving systems described PagedAttention as having become an industry norm in LLM serving frameworks, citing support in TGI, vLLM, and TensorRT-LLM. == Limitations and alternatives == Subsequent work has described trade-offs in the approach. The 2025 vAttention paper argued that PagedAttention requires attention kernels to be rewritten to support paging and increases software complexity, portability issues, redundancy, and execution overhead, proposing instead a memory manager that keeps the cache contiguous in virtual memory while relying on demand paging for physical allocation. === vAttention === Unlike PagedAttention, vAttention does not introduce a different attention rule; it retains the standard attention computation Attention ⁡ ( q i , K , V ) = softmax ⁡ ( q i K ⊤ s c a l e ) V . {\displaystyle \operatorname {Attention} (q_{i},K,V)=\operatorname {softmax} \left({\frac {q_{i}K^{\top }}{\mathrm {scale} }}\right)V.} In the notation of Prabhu et al., the key and value tensors for a request seen so far are K , V ∈ R L ′ × ( H × D ) {\displaystyle K,V\in \mathbb {R} ^{L'\times (H\times D)}} , where L ′ {\displaystyle L'} is the context length seen so far, H {\displaystyle H} is the number of KV heads on a worker, and D {\displaystyle D} is the dimension of each KV head. In systems prior to PagedAttention, the K cache (or V cache) at each layer of a worker is typically allocated as a 4D tensor of shape [ B , L , H , D ] , {\displaystyle [B,L,H,D],} where B {\displaystyle B} is batch size and L {\displaystyle L} is the maximum context length supported by the model. vAttention preserves this contiguous virtual-memory view while deferring physical-memory allocation to runtime. A serving framework maintains separate K and V tensors for each layer, so vAttention reserves 2 N {\displaystyle 2N} virtual-memory buffers on a worker, where N {\displaystyle N} is the number of layers managed by that worker. The maximum size of one virtual-memory buffer is B S = B × S , {\displaystyle BS=B\times S,} where S {\displaystyle S} is the maximum size of a single request's per-layer K cache (or V cache) on a worker. The paper defines S = L × H × D × P , {\displaystyle S=L\times H\times D\times P,} where P {\displaystyle P} is the number of bytes needed to store one element. In this formulation, vAttention keeps the KV cache contiguous in virtual memory and relies on demand paging for physical allocation, rather than modifying the attention kernel to operate over non-contiguous KV-cache blocks.