Computer Graphics International

Computer Graphics International

Computer Graphics International (CGI) is one of the oldest annual international conferences on computer graphics. It is organized by the Computer Graphics Society (CGS). Researchers across the whole world are invited to share their experiences and novel achievements in various fields - like computer graphics and human-computer interaction. Former conferences have been held recently in Hong Kong (China), Geneva (Switzerland), Shanghai (China), Geneva (virtually), Calgary (Canada), Bintan (Indonesia) and Yokohama (Japan). == Awards == Starting in the year of 2013, CGI has given yearly a Best Paper Award and a Career Achievement Award. == Venues ==

Exposure Notification

The (Google/Apple) Exposure Notification System (GAEN) is a framework and protocol specification developed by Apple Inc. and Google to facilitate digital contact tracing during the COVID-19 pandemic. When used by health authorities, it augments more traditional contact tracing techniques by automatically logging close approaches among notification system users using Android or iOS smartphones. Exposure Notification is a decentralized reporting protocol built on a combination of Bluetooth Low Energy technology and privacy-preserving cryptography. It is an opt-in feature within COVID-19 apps developed and published by authorized health authorities. Unveiled on April 10, 2020, it was made available on iOS on May 20, 2020, as part of the iOS 13.5 update and on December 14, 2020, as part of the iOS 12.5 update for older iPhones. On Android, it was added to devices via a Google Play Services update, supporting all versions since Android Marshmallow. The Apple/Google protocol is similar to the Decentralized Privacy-Preserving Proximity Tracing (DP-3T) protocol created by the European DP-3T consortium and the Temporary Contact Number (TCN) protocol by Covid Watch, but is implemented at the operating system level, which allows for more efficient operation as a background process. Since May 2020, a variant of the DP-3T protocol is supported by the Exposure Notification Interface. Other protocols are constrained in operation because they are not privileged over normal apps. This leads to issues, particularly on iOS devices where digital contact tracing apps running in the background experience significantly degraded performance. The joint approach is also designed to maintain interoperability between Android and iOS devices, which constitute nearly all of the market. The ACLU stated the approach "appears to mitigate the worst privacy and centralization risks, but there is still room for improvement". In late April, Google and Apple shifted the emphasis of the naming of the system, describing it as an "exposure notification service", rather than "contact tracing" system. == Technical specification == Digital contact tracing protocols typically have two major responsibilities: encounter logging and infection reporting. Exposure Notification only involves encounter logging which is a decentralized architecture. The majority of infection reporting is centralized in individual app implementations. To handle encounter logging, the system uses Bluetooth Low Energy to send tracking messages to nearby devices running the protocol to discover encounters with other people. The tracking messages contain unique identifiers that are encrypted with a secret daily key held by the sending device. These identifiers change every 15–20 minutes as well as Bluetooth MAC address in order to prevent tracking of clients by malicious third parties through observing static identifiers over time. The sender's daily encryption keys are generated using a random number generator. Devices record received messages, retaining them locally for 14 days. If a user tests positive for infection, the last 14 days of their daily encryption keys can be uploaded to a central server, where it is then broadcast to all devices on the network. The method through which daily encryption keys are transmitted to the central server and broadcast is defined by individual app developers. The Google-developed reference implementation calls for a health official to request a one-time verification code (VC) from a verification server, which the user enters into the encounter logging app. This causes the app to obtain a cryptographically signed certificate, which is used to authorize the submission of keys to the central reporting server. The received keys are then provided to the protocol, where each client individually searches for matches in their local encounter history. If a match meeting certain risk parameters is found, the app notifies the user of potential exposure to the infection. Google and Apple intend to use the received signal strength (RSSI) of the beacon messages as a source to infer proximity. RSSI and other signal metadata will also be encrypted to resist deanonymization attacks. === Version 1.0 === To generate encounter identifiers, first a persistent 32-byte private Tracing Key ( t k {\displaystyle tk} ) is generated by a client. From this a 16 byte Daily Tracing Key is derived using the algorithm d t k i = H K D F ( t k , N U L L , 'CT-DTK' | | D i , 16 ) {\displaystyle dtk_{i}=HKDF(tk,NULL,{\text{'CT-DTK'}}||D_{i},16)} , where H K D F ( Key, Salt, Data, OutputLength ) {\displaystyle HKDF({\text{Key, Salt, Data, OutputLength}})} is a HKDF function using SHA-256, and D i {\displaystyle D_{i}} is the day number for the 24-hour window the broadcast is in starting from Unix Epoch Time. These generated keys are later sent to the central reporting server should a user become infected. From the daily tracing key a 16-byte temporary Rolling Proximity Identifier is generated every 10 minutes with the algorithm R P I i , j = Truncate ( H M A C ( d t k i , 'CT-RPI' | | T I N j ) , 16 ) {\displaystyle RPI_{i,j}={\text{Truncate}}(HMAC(dtk_{i},{\text{'CT-RPI'}}||TIN_{j}),16)} , where H M A C ( Key, Data ) {\displaystyle HMAC({\text{Key, Data}})} is a HMAC function using SHA-256, and T I N j {\displaystyle TIN_{j}} is the time interval number, representing a unique index for every 10 minute period in a 24-hour day. The Truncate function returns the first 16 bytes of the HMAC value. When two clients come within proximity of each other they exchange and locally store the current R P I i , j {\displaystyle RPI_{i,j}} as the encounter identifier. Once a registered health authority has confirmed the infection of a user, the user's Daily Tracing Key for the past 14 days is uploaded to the central reporting server. Clients then download this report and individually recalculate every Rolling Proximity Identifier used in the report period, matching it against the user's local encounter log. If a matching entry is found, then contact has been established and the app presents a notification to the user warning them of potential infection. === Version 1.1 === Unlike version 1.0 of the protocol, version 1.1 does not use a persistent tracing key, rather every day a new random 16-byte Temporary Exposure Key ( t e k i {\displaystyle tek_{i}} ) is generated. This is analogous to the daily tracing key from version 1.0. Here i {\displaystyle i} denotes the time is discretized in 10 minute intervals starting from Unix Epoch Time. From this two 128-bit keys are calculated, the Rolling Proximity Identifier Key ( R P I K i {\displaystyle RPIK_{i}} ) and the Associated Encrypted Metadata Key ( A E M K i {\displaystyle AEMK_{i}} ). R P I K i {\displaystyle RPIK_{i}} is calculated with the algorithm R P I K i = H K D F ( t e k i , N U L L , 'EN-RPIK' , 16 ) {\displaystyle RPIK_{i}=HKDF(tek_{i},NULL,{\text{'EN-RPIK'}},16)} , and A E M K i {\displaystyle AEMK_{i}} using the algorithm A E M K i = H K D F ( t e k i , N U L L , 'EN-AEMK' , 16 ) {\displaystyle AEMK_{i}=HKDF(tek_{i},NULL,{\text{'EN-AEMK'}},16)} . From these values a temporary Rolling Proximity Identifier ( R P I i , j {\displaystyle RPI_{i,j}} ) is generated every time the BLE MAC address changes, roughly every 15–20 minutes. The following algorithm is used: R P I i , j = A E S 128 ( R P I K i , 'EN-RPI' | | 0 x 000000000000 | | E N I N j ) {\displaystyle RPI_{i,j}=AES128(RPIK_{i},{\text{'EN-RPI'}}||{\mathtt {0x000000000000}}||ENIN_{j})} , where A E S 128 ( Key, Data ) {\displaystyle AES128({\text{Key, Data}})} is an AES cryptography function with a 128-bit key, the data is one 16-byte block, j {\displaystyle j} denotes the Unix Epoch Time at the moment the roll occurs, and E N I N j {\displaystyle ENIN_{j}} is the corresponding 10-minute interval number. Next, additional Associated Encrypted Metadata is encrypted. What the metadata represents is not specified, likely to allow the later expansion of the protocol. The following algorithm is used: Associated Encrypted Metadata i , j = A E S 128 _ C T R ( A E M K i , R P I i , j , Metadata ) {\displaystyle {\text{Associated Encrypted Metadata}}_{i,j}=AES128\_CTR(AEMK_{i},RPI_{i,j},{\text{Metadata}})} , where A E S 128 _ C T R ( Key, IV, Data ) {\displaystyle AES128\_CTR({\text{Key, IV, Data}})} denotes AES encryption with a 128-bit key in CTR mode. The Rolling Proximity Identifier and the Associated Encrypted Metadata are then combined and broadcast using BLE. Clients exchange and log these payloads. Once a registered health authority has confirmed the infection of a user, the user's Temporary Exposure Keys t e k i {\displaystyle tek_{i}} and their respective interval numbers i {\displaystyle i} for the past 14 days are uploaded to the central reporting server. Clients then download this report and individually recalculate every Rolling Proximity Identifier starting from interval number i {\displaystyle i} ,

PCVC Speech Dataset

The PCVC (Persian Consonant Vowel Combination) Speech Dataset is a Modern Persian speech corpus for speech recognition and also speaker recognition. The dataset contains sound samples of Modern Persian combination of vowel and consonant phonemes from different speakers. Every sound sample contains just one consonant and one vowel So it is somehow labeled in phoneme level. This dataset consists of 23 Persian consonants and 6 vowels. The sound samples are all possible combinations of vowels and consonants (138 samples for each speaker). The sample rate of all speech samples is 48000 which means there are 48000 sound samples in every 1 second. Every sound sample starts with consonant then continues with vowel. In each sample, in average, 0.5 second of each sample is speech and the rest is silence. Each sound sample ends with silence. All of sound samples are denoised with "Adaptive noise reduction" algorithm. Compared to Farsdat speech dataset and Persian speech corpus it is more easy to use because it is prepared in .mat data files. Also it is more based on phoneme based separation and all samples are denoised. == Contents == The corpus is downloadable from its Kaggle web page, and contains the following: .mat data files of sound samples in a 23630000 matrix, in which 23 is number of consonants, 6 is the number of vowels and 30000 is the length of sound sample.

Captions (app)

Mirage (formerly known as Captions) is a video-generating, video-editing and AI research company headquartered in New York City. Their first app, Captions, is available on iOS, Android, and Web and offers a suite of tools aimed at streamlining the creation and editing of videos. Their enterprise platform, Mirage Studio, generates AI actors and videos for marketing assets and video campaigns. == History == Mirage was co-founded by Gaurav Misra and Dwight Churchill. During Misra's time leading design engineering at Snap Inc., he followed the rise of a new category of video, the "talking video." In 2021, Misra left Snap to found Mirage with his former colleague Churchill. Later that year, the Captions app launched with early backing from venture capital firms Sequoia Capital and Andreessen Horowitz as well as individual investors. In 2023, the company released Lipdub, an Al dubbing app which translates any video with spoken audio into 28 languages. In October 2023, Captions shared that it maintained over 100,000 daily active users with "about a million" videos being created monthly. In November 2024, Captions acquired AlpacaML, a generative AI company that focused on art and other images. In June 2025, Captions launched Mirage Studio, for marketers and advertising agencies. In September 2025, Captions rebranded their company to Mirage. This change reflects the company's focus on developing their proprietary foundation model and future video products. == Products == The Captions app offers features to automate common production tasks including captioning, editing, dubbing, script creation, and music integration. Mirage Studio allows users to generate AI avatars and create short-form videos from prompts or audio. == Awards == In 2023, the company was recognized as part of Fast Company's "Next Big Things In Tech" series. In 2024, the company won 2 Webby Awards for Best Use of AI & Machine Learning and Creative Production.

Voice activity detection

Voice activity detection (VAD), also known as speech activity detection or speech detection, is the detection of the presence or absence of human speech, used in speech processing. The main uses of VAD are in speaker diarization, speech coding and speech recognition. It can facilitate speech processing, and can also be used to deactivate some processes during non-speech section of an audio session: it can avoid unnecessary coding/transmission of silence packets in Voice over Internet Protocol (VoIP) applications, saving on computation and on network bandwidth. VAD is an important enabling technology for a variety of speech-based applications. Therefore, various VAD algorithms have been developed that provide varying features and compromises between latency, sensitivity, accuracy and computational cost. Some VAD algorithms also provide further analysis, for example whether the speech is voiced, unvoiced or sustained. Voice activity detection is usually independent of language. It was first investigated for use on time-assignment speech interpolation (TASI) systems. == Algorithm overview == The typical design of a VAD algorithm is as follows: There may first be a noise reduction stage, e.g. via spectral subtraction. Then some features or quantities are calculated from a section of the input signal. A classification rule is applied to classify the section as speech or non-speech – often this classification rule finds when a value exceeds a certain threshold. There may be some feedback in this sequence, in which the VAD decision is used to improve the noise estimate in the noise reduction stage, or to adaptively vary the threshold(s). These feedback operations improve the VAD performance in non-stationary noise (i.e. when the noise varies a lot). A representative set of recently published VAD methods formulates the decision rule on a frame by frame basis using instantaneous measures of the divergence distance between speech and noise. The different measures which are used in VAD methods include spectral slope, correlation coefficients, log likelihood ratio, cepstral, weighted cepstral, and modified distance measures. Independently from the choice of VAD algorithm, a compromise must be made between having voice detected as noise, or noise detected as voice (between false positive and false negative). A VAD operating in a mobile phone must be able to detect speech in the presence of a range of very diverse types of acoustic background noise. In these difficult detection conditions it is often preferable that a VAD should fail-safe, indicating speech detected when the decision is in doubt, to lower the chance of losing speech segments. The biggest difficulty in the detection of speech in this environment is the very low signal-to-noise ratios (SNRs) that are encountered. It may be impossible to distinguish between speech and noise using simple level detection techniques when parts of the speech utterance are buried below the noise. == Applications == VAD is an integral part of different speech communication systems such as audio conferencing, echo cancellation, speech recognition, speech encoding, speaker recognition and hands-free telephony. In the field of multimedia applications, VAD allows simultaneous voice and data applications. Similarly, in Universal Mobile Telecommunications Systems (UMTS), it controls and reduces the average bit rate and enhances overall coding quality of speech. In cellular radio systems (for instance GSM and CDMA systems) based on Discontinuous Transmission (DTX) mode, VAD is essential for enhancing system capacity by reducing co-channel interference and power consumption in portable digital devices. In speech processing applications, voice activity detection plays an important role since non-speech frames are often discarded. For a wide range of applications such as digital mobile radio, Digital Simultaneous Voice and Data (DSVD) or speech storage, it is desirable to provide a discontinuous transmission of speech-coding parameters. Advantages can include lower average power consumption in mobile handsets, higher average bit rate for simultaneous services like data transmission, or a higher capacity on storage chips. However, the improvement depends mainly on the percentage of pauses during speech and the reliability of the VAD used to detect these intervals. On the one hand, it is advantageous to have a low percentage of speech activity. On the other hand, clipping, that is the loss of milliseconds of active speech, should be minimized to preserve quality. This is the crucial problem for a VAD algorithm under heavy noise conditions. === Use in telemarketing === One controversial application of VAD is in conjunction with predictive dialers used by telemarketing firms. In order to maximize agent productivity, telemarketing firms set up predictive dialers to call more numbers than they have agents available, knowing most calls will end up in either "Ring – No Answer" or answering machines. When a person answers, they typically speak briefly ("Hello", "Good evening", etc.) and then there is a brief period of silence. Answering machine messages are usually 3–15 seconds of continuous speech. By setting VAD parameters correctly, dialers can determine whether a person or a machine answered the call and, if it's a person, transfer the call to an available agent. If it detects an answering machine message, the dialer hangs up. Often, even when the system correctly detects a person answering the call, no agent may be available, resulting in a "silent call". Call screening with a multi-second message like "please say who you are, and I may pick up the phone" will frustrate such automated calls. == Performance evaluation == To evaluate a VAD, its output using test recordings is compared with those of an "ideal" VAD – created by hand-annotating the presence or absence of voice in the recordings. The performance of a VAD is commonly evaluated on the basis of the following four parameters: FEC (Front End Clipping): clipping introduced in passing from noise to speech activity; MSC (Mid Speech Clipping): clipping due to speech misclassified as noise; OVER: noise interpreted as speech due to the VAD flag remaining active in passing from speech activity to noise; NDS (Noise Detected as Speech): noise interpreted as speech within a silence period. Although the method described above provides useful objective information concerning the performance of a VAD, it is only an approximate measure of the subjective effect. For example, the effects of speech signal clipping can at times be hidden by the presence of background noise, depending on the model chosen for the comfort noise synthesis, so some of the clipping measured with objective tests is in reality not audible. It is therefore important to carry out subjective tests on VADs, the main aim of which is to ensure that the clipping perceived is acceptable. In VoIP applications, front-end clipping can be reduced by rewinding to shortly before the detection and sending very slightly delayed data. This kind of test requires a certain number of listeners to judge recordings containing the processing results of the VADs being tested, giving marks to several speech sequences on the following features: Quality; Comprehension difficulty; Audibility of clipping. These marks are then used to calculate average results for each of the features listed above, thus providing a global estimate of the behavior of the VAD being tested. To conclude, whereas objective methods are very useful in an initial stage to evaluate the quality of a VAD, subjective methods are more significant. As they require the participation of several people for a few days, increasing cost, they are generally only used when a proposal is about to be standardized. == Implementations == One early standard VAD is that developed by British Telecom for use in the Pan-European digital cellular mobile telephone service in 1991. It uses inverse filtering trained on non-speech segments to filter out background noise, so that it can then more reliably use a simple power-threshold to decide if a voice is present. The G.729 standard calculates the following features for its VAD: line spectral frequencies, full-band energy, low-band energy (<1 kHz), and zero-crossing rate. It applies a simple classification using a fixed decision boundary in the space defined by these features, and then applies smoothing and adaptive correction to improve the estimate. The GSM standard includes two VAD options developed by ETSI. Option 1 computes the SNR in nine bands and applies a threshold to these values. Option 2 calculates different parameters: channel power, voice metrics, and noise power. It then thresholds the voice metrics using a threshold that varies according to the estimated SNR. The Speex audio compression library uses a procedure named Improved Minima Controlled Recursive Averaging, which uses a smoothed representation of spectral pow

Inductive bias

The inductive bias (also known as learning bias) of a learning algorithm is the set of assumptions that the learner uses to predict outputs of given inputs that it has not encountered. Inductive bias is anything which makes the algorithm learn one pattern instead of another pattern (e.g., step-functions in decision trees instead of continuous functions in linear regression models). Learning involves searching a space of solutions for a solution that provides a good explanation of the data. However, in many cases, there may be multiple equally appropriate solutions. An inductive bias allows a learning algorithm to prioritize one solution (or interpretation) over another, independently of the observed data. In machine learning, the aim is to construct algorithms that are able to learn to predict a certain target output. To achieve this, the learning algorithm is presented some training examples that demonstrate the intended relation of input and output values. Then the learner is supposed to approximate the correct output, even for examples that have not been shown during training. Without any additional assumptions, this problem cannot be solved since unseen situations might have an arbitrary output value. The kind of necessary assumptions about the nature of the target function are subsumed in the phrase inductive bias. A classical example of an inductive bias is Occam's razor, assuming that the simplest consistent hypothesis about the target function is actually the best. Here, consistent means that the hypothesis of the learner yields correct outputs for all of the examples that have been given to the algorithm. Approaches to a more formal definition of inductive bias are based on mathematical logic. Here, the inductive bias is a logical formula that, together with the training data, logically entails the hypothesis generated by the learner. However, this strict formalism fails in many practical cases in which the inductive bias can only be given as a rough description (e.g., in the case of artificial neural networks), or not at all. == Types == The following is a list of common inductive biases in machine learning algorithms. Maximum conditional independence: if the hypothesis can be cast in a Bayesian framework, try to maximize conditional independence. This is the bias used in the Naive Bayes classifier. Minimum cross-validation error: when trying to choose among hypotheses, select the hypothesis with the lowest cross-validation error. Although cross-validation may seem to be free of bias, the "no free lunch" theorems show that cross-validation must be biased, for example assuming that there is no information encoded in the ordering of the data. Maximum margin: when drawing a boundary between two classes, attempt to maximize the width of the boundary. This is the bias used in support vector machines. The assumption is that distinct classes tend to be separated by wide boundaries. Minimum description length: when forming a hypothesis, attempt to minimize the length of the description of the hypothesis. Minimum features: unless there is good evidence that a feature is useful, it should be deleted. This is the assumption behind feature selection algorithms. Nearest neighbors: assume that most of the cases in a small neighborhood in feature space belong to the same class. Given a case for which the class is unknown, guess that it belongs to the same class as the majority in its immediate neighborhood. This is the bias used in the k-nearest neighbors algorithm. The assumption is that cases that are near each other tend to belong to the same class. == Shift of bias == Although most learning algorithms have a static bias, some algorithms are designed to shift their bias as they acquire more data. This does not avoid bias, since the bias shifting process itself must have a bias.

Kernel (image processing)

In image processing, a kernel, convolution matrix, or mask is a small matrix used for blurring, sharpening, embossing, edge detection, and more. This is accomplished by doing a convolution between the kernel and an image. Or more simply, when each pixel in the output image is a function of the nearby pixels (including itself) in the input image, the kernel is that function. == Details == The general expression of a convolution is g x , y = ω ∗ f x , y = ∑ i = − a a ∑ j = − b b ω i , j f x − i , y − j , {\displaystyle g_{x,y}=\omega f_{x,y}=\sum _{i=-a}^{a}{\sum _{j=-b}^{b}{\omega _{i,j}f_{x-i,y-j}}},} where g ( x , y ) {\displaystyle g(x,y)} is the filtered image, f ( x , y ) {\displaystyle f(x,y)} is the original image, ω {\displaystyle \omega } is the filter kernel. Every element of the filter kernel is considered by − a ≤ i ≤ a {\displaystyle -a\leq i\leq a} and − b ≤ j ≤ b {\displaystyle -b\leq j\leq b} . Depending on the element values, a kernel can cause a wide range of effects: The above are just a few examples of effects achievable by convolving kernels and images. === Origin === The origin is the position of the kernel which is above (conceptually) the current output pixel. This could be outside of the actual kernel, though usually it corresponds to one of the kernel elements. For a symmetric kernel, the origin is usually the center element. == Convolution == Convolution is the process of adding each element of the image to its local neighbors, weighted by the kernel. This is related to a form of mathematical convolution. The matrix operation being performed—convolution—is not traditional matrix multiplication, despite being similarly denoted by . For example, if we have two three-by-three matrices, the first a kernel, and the second an image piece, convolution is the process of flipping both the rows and columns of the kernel and multiplying locally similar entries and summing. The element at coordinates [2, 2] (that is, the central element) of the resulting image would be a weighted combination of all the entries of the image matrix, with weights given by the kernel: ( [ a b c d e f g h i ] ∗ [ 1 2 3 4 5 6 7 8 9 ] ) [ 2 , 2 ] = {\displaystyle \left({\begin{bmatrix}a&b&c\\d&e&f\\g&h&i\end{bmatrix}}{\begin{bmatrix}1&2&3\\4&5&6\\7&8&9\end{bmatrix}}\right)[2,2]=} ( i ⋅ 1 ) + ( h ⋅ 2 ) + ( g ⋅ 3 ) + ( f ⋅ 4 ) + ( e ⋅ 5 ) + ( d ⋅ 6 ) + ( c ⋅ 7 ) + ( b ⋅ 8 ) + ( a ⋅ 9 ) . {\displaystyle (i\cdot 1)+(h\cdot 2)+(g\cdot 3)+(f\cdot 4)+(e\cdot 5)+(d\cdot 6)+(c\cdot 7)+(b\cdot 8)+(a\cdot 9).} The other entries would be similarly weighted, where we position the center of the kernel on each of the boundary points of the image, and compute a weighted sum. The values of a given pixel in the output image are calculated by multiplying each kernel value by the corresponding input image pixel values. This can be described algorithmically with the following pseudo-code: for each image row in input image: for each pixel in image row: set accumulator to zero for each kernel row in kernel: for each element in kernel row: if element position corresponding to pixel position then multiply element value corresponding to pixel value add result to accumulator endif set output image pixel to accumulator corresponding input image pixels are found relative to the kernel's origin. If the kernel is symmetric then place the center (origin) of the kernel on the current pixel. The kernel will overlap the neighboring pixels around the origin. Each kernel element should be multiplied with the pixel value it overlaps with and all of the obtained values should be summed. This resultant sum will be the new value for the current pixel currently overlapped with the center of the kernel. If the kernel is not symmetric, it has to be flipped both around its horizontal and vertical axis before calculating the convolution as above. The general form for matrix convolution is [ x 11 x 12 ⋯ x 1 n x 21 x 22 ⋯ x 2 n ⋮ ⋮ ⋱ ⋮ x m 1 x m 2 ⋯ x m n ] ∗ [ y 11 y 12 ⋯ y 1 n y 21 y 22 ⋯ y 2 n ⋮ ⋮ ⋱ ⋮ y m 1 y m 2 ⋯ y m n ] = ∑ i = 0 m − 1 ∑ j = 0 n − 1 x ( m − i ) ( n − j ) y ( 1 + i ) ( 1 + j ) {\displaystyle {\begin{bmatrix}x_{11}&x_{12}&\cdots &x_{1n}\\x_{21}&x_{22}&\cdots &x_{2n}\\\vdots &\vdots &\ddots &\vdots \\x_{m1}&x_{m2}&\cdots &x_{mn}\\\end{bmatrix}}{\begin{bmatrix}y_{11}&y_{12}&\cdots &y_{1n}\\y_{21}&y_{22}&\cdots &y_{2n}\\\vdots &\vdots &\ddots &\vdots \\y_{m1}&y_{m2}&\cdots &y_{mn}\\\end{bmatrix}}=\sum _{i=0}^{m-1}\sum _{j=0}^{n-1}x_{(m-i)(n-j)}y_{(1+i)(1+j)}} === Edge handling === Kernel convolution usually requires values from pixels outside of the image boundaries. There are a variety of methods for handling image edges. Extend The nearest border pixels are conceptually extended as far as necessary to provide values for the convolution. Corner pixels are extended in 90° wedges. Other edge pixels are extended in lines. Wrap The image is conceptually wrapped (or tiled) and values are taken from the opposite edge or corner. Mirror The image is conceptually mirrored at the edges. For example, attempting to read a pixel 3 units outside an edge reads one 3 units inside the edge instead. Crop / Avoid overlap Any pixel in the output image which would require values from beyond the edge is skipped. This method can result in the output image being slightly smaller, with the edges having been cropped. Move kernel so that values from outside of image is never required. Machine learning mainly uses this approach. Example: Kernel size 10x10, image size 32x32, result image is 23x23. Kernel Crop Any pixel in the kernel that extends past the input image isn't used and the normalizing is adjusted to compensate. Constant Use constant value for pixels outside of image. Usually black or sometimes gray is used. Generally this depends on application. === Normalization === Normalization is defined as the division of each element in the kernel by the sum of all kernel elements, so that the sum of the elements of a normalized kernel is unity. This will ensure the average pixel in the modified image is as bright as the average pixel in the original image. === Optimization === Fast convolution algorithms include: separable convolution ==== Separable convolution ==== 2D convolution with an M × N kernel requires M × N multiplications for each sample (pixel). If the kernel is separable, then the computation can be reduced to M + N multiplications. Using separable convolutions can significantly decrease the computation by doing 1D convolution twice instead of one 2D convolution. === Implementation === Here a concrete convolution implementation done with the GLSL shading language :