AI Content Paraphrasing Tool

AI Content Paraphrasing Tool — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • RFPolicy

    RFPolicy

    The RFPolicy outlines a method for contacting vendors about security vulnerabilities found in their products. It was initially written in 2000 by hacker and security consultant Rain Forest Puppy. It was perhaps the second disclosure policy, following Simple Nomad's. The policy gives the vendor five working days to respond to the reporter of the bug. If the vendor fails to contact the reporter within those five days, the issue is recommended to be disclosed to the general community. The reporter should help the vendor reproduce the bug and work out a fix. The reporter should delay notifying the general community about the bug if the vendor provides feasible reasons for requiring so. If the vendor fails to respond or shuts down communication with the reporter of the problem within five working days, the reporter should disclose the issue to the general community. When issuing an alert or fix, the vendor should give the reporter proper credit for reporting the bug. Context for the history of vulnerability disclosure is available in a history article.

    Read more →
  • Learning classifier system

    Learning classifier system

    Learning classifier systems, or LCS, are a paradigm of rule-based machine learning methods that combine a discovery component (e.g. typically a genetic algorithm in evolutionary computation) with a learning component (performing either supervised learning, reinforcement learning, or unsupervised learning). Learning classifier systems seek to identify a set of context-dependent rules that collectively store and apply knowledge in a piecewise manner in order to make predictions (e.g. behavior modeling, classification, data mining, regression, function approximation, or game strategy). This approach allows complex solution spaces to be broken up into smaller, simpler parts for the reinforcement learning that is inside artificial intelligence research. The founding concepts behind learning classifier systems came from attempts to model complex adaptive systems, using rule-based agents to form an artificial cognitive system (i.e. artificial intelligence). == Methodology == The architecture and components of a given learning classifier system can be quite variable. It is useful to think of an LCS as a machine consisting of several interacting components. Components may be added or removed, or existing components modified/exchanged to suit the demands of a given problem domain (like algorithmic building blocks) or to make the algorithm flexible enough to function in many different problem domains. As a result, the LCS paradigm can be flexibly applied to many problem domains that call for machine learning. The major divisions among LCS implementations are as follows: (1) Michigan-style architecture vs. Pittsburgh-style architecture, (2) reinforcement learning vs. supervised learning, (3) incremental learning vs. batch learning, (4) online learning vs. offline learning, (5) strength-based fitness vs. accuracy-based fitness, and (6) complete action mapping vs best action mapping. These divisions are not necessarily mutually exclusive. For example, XCS, the best known and best studied LCS algorithm, is Michigan-style, was designed for reinforcement learning but can also perform supervised learning, applies incremental learning that can be either online or offline, applies accuracy-based fitness, and seeks to generate a complete action mapping. === Elements of a generic LCS algorithm === Keeping in mind that LCS is a paradigm for genetic-based machine learning rather than a specific method, the following outlines key elements of a generic, modern (i.e. post-XCS) LCS algorithm. For simplicity let us focus on Michigan-style architecture with supervised learning. See the illustrations on the right laying out the sequential steps involved in this type of generic LCS. ==== Environment ==== The environment is the source of data upon which an LCS learns. It can be an offline, finite training dataset (characteristic of a data mining, classification, or regression problem), or an online sequential stream of live training instances. Each training instance is assumed to include some number of features (also referred to as attributes, or independent variables), and a single endpoint of interest (also referred to as the class, action, phenotype, prediction, or dependent variable). Part of LCS learning can involve feature selection, therefore not all of the features in the training data need to be informative. The set of feature values of an instance is commonly referred to as the state. For simplicity let's assume an example problem domain with Boolean/binary features and a Boolean/binary class. For Michigan-style systems, one instance from the environment is trained on each learning cycle (i.e. incremental learning). Pittsburgh-style systems perform batch learning, where rule sets are evaluated in each iteration over much or all of the training data. ==== Rule/classifier/population ==== A rule is a context dependent relationship between state values and some prediction. Rules typically take the form of an {IF:THEN} expression, (e.g. {IF 'condition' THEN 'action'}, or as a more specific example, {IF 'red' AND 'octagon' THEN 'stop-sign'}). A critical concept in LCS and rule-based machine learning alike, is that an individual rule is not in itself a model, since the rule is only applicable when its condition is satisfied. Think of a rule as a "local-model" of the solution space. Rules can be represented in many different ways to handle different data types (e.g. binary, discrete-valued, ordinal, continuous-valued). Given binary data LCS traditionally applies a ternary rule representation (i.e. rules can include either a 0, 1, or '#' for each feature in the data). The 'don't care' symbol (i.e. '#') serves as a wild card within a rule's condition allowing rules, and the system as a whole to generalize relationships between features and the target endpoint to be predicted. Consider the following rule (#1###0 ~ 1) (i.e. condition ~ action). This rule can be interpreted as: IF the second feature = 1 AND the sixth feature = 0 THEN the class prediction = 1. We would say that the second and sixth features were specified in this rule, while the others were generalized. This rule, and the corresponding prediction are only applicable to an instance when the condition of the rule is satisfied by the instance. This is more commonly referred to as matching. In Michigan-style LCS, each rule has its own fitness, as well as a number of other rule-parameters associated with it that can describe the number of copies of that rule that exist (i.e. the numerosity), the age of the rule, its accuracy, or the accuracy of its reward predictions, and other descriptive or experiential statistics. A rule along with its parameters is often referred to as a classifier. In Michigan-style systems, classifiers are contained within a population [P] that has a user defined maximum number of classifiers. Unlike most stochastic search algorithms (e.g. evolutionary algorithms), LCS populations start out empty (i.e. there is no need to randomly initialize a rule population). Classifiers will instead be initially introduced to the population with a covering mechanism. In any LCS, the trained model is a set of rules/classifiers, rather than any single rule/classifier. In Michigan-style LCS, the entire trained (and optionally, compacted) classifier population forms the prediction model. ==== Matching ==== One of the most critical and often time-consuming elements of an LCS is the matching process. The first step in an LCS learning cycle takes a single training instance from the environment and passes it to [P] where matching takes place. In step two, every rule in [P] is now compared to the training instance to see which rules match (i.e. are contextually relevant to the current instance). In step three, any matching rules are moved to a match set [M]. A rule matches a training instance if all feature values specified in the rule condition are equivalent to the corresponding feature value in the training instance. For example, assuming the training instance is (001001 ~ 0), these rules would match: (###0## ~ 0), (00###1 ~ 0), (#01001 ~ 1), but these rules would not (1##### ~ 0), (000##1 ~ 0), (#0#1#0 ~ 1). Notice that in matching, the endpoint/action specified by the rule is not taken into consideration. As a result, the match set may contain classifiers that propose conflicting actions. In the fourth step, since we are performing supervised learning, [M] is divided into a correct set [C] and an incorrect set [I]. A matching rule goes into the correct set if it proposes the correct action (based on the known action of the training instance), otherwise it goes into [I]. In reinforcement learning LCS, an action set [A] would be formed here instead, since the correct action is not known. ==== Covering ==== At this point in the learning cycle, if no classifiers made it into either [M] or [C] (as would be the case when the population starts off empty), the covering mechanism is applied (fifth step). Covering is a form of online smart population initialization. Covering randomly generates a rule that matches the current training instance (and in the case of supervised learning, that rule is also generated with the correct action. Assuming the training instance is (001001 ~ 0), covering might generate any of the following rules: (#0#0## ~ 0), (001001 ~ 0), (#010## ~ 0). Covering not only ensures that each learning cycle there is at least one correct, matching rule in [C], but that any rule initialized into the population will match at least one training instance. This prevents LCS from exploring the search space of rules that do not match any training instances. ==== Parameter updates/credit assignment/learning ==== In the sixth step, the rule parameters of any rule in [M] are updated to reflect the new experience gained from the current training instance. Depending on the LCS algorithm, a number of updates can take place at this step. For supervised learning, we can simply update the accuracy/error of a

    Read more →
  • Robust principal component analysis

    Robust principal component analysis

    Robust Principal Component Analysis (RPCA) is a modification of the widely used statistical procedure of principal component analysis (PCA) which works well with respect to grossly corrupted observations. A number of different approaches exist for Robust PCA, including an idealized version of Robust PCA, which aims to recover a low-rank matrix L0 from highly corrupted measurements M = L0 +S0. This decomposition in low-rank and sparse matrices can be achieved by techniques such as Principal Component Pursuit method (PCP), Stable PCP, Quantized PCP, Block based PCP, and Local PCP. Then, optimization methods are used such as the Augmented Lagrange Multiplier Method (ALM), Alternating Direction Method (ADM), Fast Alternating Minimization (FAM), Iteratively Reweighted Least Squares (IRLS ) or alternating projections (AP). == Algorithms == === Non-convex method === The 2014 guaranteed algorithm for the robust PCA problem (with the input matrix being M = L + S {\displaystyle M=L+S} ) is an alternating minimization type algorithm. The computational complexity is O ( m n r 2 log ⁡ 1 ϵ ) {\displaystyle O\left(mnr^{2}\log {\frac {1}{\epsilon }}\right)} where the input is the superposition of a low-rank (of rank r {\displaystyle r} ) and a sparse matrix of dimension m × n {\displaystyle m\times n} and ϵ {\displaystyle \epsilon } is the desired accuracy of the recovered solution, i.e., ‖ L ^ − L ‖ F ≤ ϵ {\displaystyle \|{\widehat {L}}-L\|_{F}\leq \epsilon } where L {\displaystyle L} is the true low-rank component and L ^ {\displaystyle {\widehat {L}}} is the estimated or recovered low-rank component. Intuitively, this algorithm performs projections of the residual onto the set of low-rank matrices (via the SVD operation) and sparse matrices (via entry-wise hard thresholding) in an alternating manner - that is, low-rank projection of the difference the input matrix and the sparse matrix obtained at a given iteration followed by sparse projection of the difference of the input matrix and the low-rank matrix obtained in the previous step, and iterating the two steps until convergence. This alternating projections algorithm is later improved by an accelerated version, coined AccAltProj. The acceleration is achieved by applying a tangent space projection before projecting the residue onto the set of low-rank matrices. This trick improves the computational complexity to O ( m n r log ⁡ 1 ϵ ) {\displaystyle O\left(mnr\log {\frac {1}{\epsilon }}\right)} with a much smaller constant in front while it maintains the theoretically guaranteed linear convergence. Another fast version of accelerated alternating projections algorithm is IRCUR. It uses the structure of CUR decomposition in alternating projections framework to dramatically reduces the computational complexity of RPCA to O ( max { m , n } r 2 log ⁡ ( m ) log ⁡ ( n ) log ⁡ 1 ϵ ) {\displaystyle O\left(\max\{m,n\}r^{2}\log(m)\log(n)\log {\frac {1}{\epsilon }}\right)} === Convex relaxation === This method consists of relaxing the rank constraint r a n k ( L ) {\displaystyle rank(L)} in the optimization problem to the nuclear norm ‖ L ‖ ∗ {\displaystyle \|L\|_{}} and the sparsity constraint ‖ S ‖ 0 {\displaystyle \|S\|_{0}} to ℓ 1 {\displaystyle \ell _{1}} -norm ‖ S ‖ 1 {\displaystyle \|S\|_{1}} . The resulting program can be solved using methods such as the method of Augmented Lagrange Multipliers. === Deep-learning augmented method === Some recent works propose RPCA algorithms with learnable/training parameters. Such a learnable/trainable algorithm can be unfolded as a deep neural network whose parameters can be learned via machine learning techniques from a given dataset or problem distribution. The learned algorithm will have superior performance on the corresponding problem distribution. == Applications == RPCA has many real life important applications particularly when the data under study can naturally be modeled as a low-rank plus a sparse contribution. Following examples are inspired by contemporary challenges in computer science, and depending on the applications, either the low-rank component or the sparse component could be the object of interest: === Video surveillance === Given a sequence of surveillance video frames, it is often required to identify the activities that stand out from the background. If we stack the video frames as columns of a matrix M, then the low-rank component L0 naturally corresponds to the stationary background and the sparse component S0 captures the moving objects in the foreground. === Face recognition === Images of a convex, Lambertian surface under varying illuminations span a low-dimensional subspace. This is one of the reasons for effectiveness of low-dimensional models for imagery data. In particular, it is easy to approximate images of a human's face by a low-dimensional subspace. To be able to correctly retrieve this subspace is crucial in many applications such as face recognition and alignment. It turns out that RPCA can be applied successfully to this problem to exactly recover the face.

    Read more →
  • Ground truth

    Ground truth

    Ground truth is information that is known to be real or true, provided by direct observation and measurement (i.e. empirical evidence) as opposed to information provided by inference. The term ground truth appeared in remote sensing literature as early as 1972, when NASA described it as essential "data about ... materials on the earth's surface" used to calibrate measurements. It was later adopted by the statistical modeling and machine learning communities. == Etymology == The Oxford English Dictionary (s.v. ground truth) records the use of the word Groundtruth in the sense of 'fundamental truth' from Henry Ellison's poem "The Siberian Exile's Tale", published in 1833. == Usage == The term "ground truth" can be used as a noun, adjective, and verb. Noun: "ground truth" (no hyphen). Example: "The ground truth is essential for training accurate models." Adjective: "ground-truth" (hyphenated compound adjective). Example: "We need to use ground-truth data to validate the model." Verb: "to ground-truth" or "to groundtruth" (compound verb,). Example: "We need to ground-truth the results to ensure their accuracy." == Statistics and machine learning == In statistics and machine learning, ground truth is the ideal expected result, used in statistical models to prove or disprove research hypotheses. "Ground truthing" is the process of gathering the good data for this test. Ground truth is typically included in labeled data. In machine learning, "ground truth" is not necessarily objectively correct or true. For example, in training AI models or relevance rankers, it may be a set of judgments made by people or inferred from user behavior, which may depend on context. For example, in Bayesian spam filtering, a supervised learning system is typically trained by examples labeled as spam and non-spam. Although these labels may be subjective or inaccurate, they are considered ground truth. True ground truth in machine learning is objective data. For example, suppose we are testing a stereo vision system to see how well it can estimate 3D positions. A calibrated laser rangefinder may provide accurate distances as ground truth. == Remote sensing == In remote sensing, "ground truth" refers to information collected at the imaged location. Ground truth allows image data to be related to real features and materials on the ground. The collection of ground truth data enables calibration of remote-sensing data, and aids in the interpretation and analysis of what is being sensed. Examples include cartography, meteorology, analysis of aerial photographs, satellite imagery and other techniques in which data are gathered at a distance. More specifically, ground truth may refer to a process in which "pixels" on a satellite image are compared to what is imaged (at the time of capture) in order to verify the contents of the "pixels" in the image (noting that the concept of "pixel" is imaging-system-dependent). In the case of a classified image, supervised classification can help to determine the accuracy of the classification by the remote sensing system which can minimize error in the classification. Ground truth is usually done on site, correlating what is known with surface observations and measurements of various properties of the features of the ground resolution cells under study in the remotely sensed digital image. The process also involves taking geographic coordinates of the ground resolution cell with GPS technology and comparing those with the coordinates of the "pixel" being studied provided by the remote sensing software to understand and analyze the location errors and how it may affect a particular study. Ground truth is important in the initial supervised classification of an image. When the identity and location of land cover types are known through a combination of field work, maps, and personal experience these areas are known as training sites. The spectral characteristics of these areas are used to train the remote sensing software using decision rules for classifying the rest of the image. These decision rules such as Maximum Likelihood Classification, Parallelopiped Classification, and Minimum Distance Classification offer different techniques to classify an image. Additional ground truth sites allow the remote sensor to establish an error matrix that validates the accuracy of the classification method used. Different classification methods may have different percentages of error for a given classification project. It is important that the remote sensor chooses a classification method that works best with the number of classifications used while providing the least amount of error. Ground truth also helps with atmospheric correction. Since images from satellites have to pass through the atmosphere, they can get distorted because of absorption in the atmosphere. So ground truth can help fully identify objects in satellite photos. === Errors of commission === An example of an error of commission is when a pixel reports the presence of a feature (such a tree) that, in reality, is absent (no tree is actually present). Ground truthing ensures that the error matrices have a higher accuracy percentage than would be the case if no pixels were ground-truthed. This value is the complement of the user's accuracy, i.e. Commission Error = 1 - user's accuracy. === Errors of omission === An example of an error of omission is when pixels of a certain type, for example, maple trees, are not classified as maple trees. The process of ground-truthing helps to ensure that the pixel is classified correctly and the error matrices are more accurate. This value is the complement of the producer's accuracy, i.e. Omission Error = 1 - producer's accuracy == Geographical information systems == In GIS the spatial data is modeled as field (like in remote sensing raster images) or as object (like in vectorial map representation). They are modeled from the real world (also named geographical reality), typically by a cartographic process (illustrated). Geographic information systems such as GIS, GPS, and GNSS, have become so widespread that the term "ground truth" has taken on special meaning in that context. If the location coordinates returned by a location method such as GPS are an estimate of a location, then the "ground truth" is the actual location on Earth. A smart phone might return a set of estimated location coordinates such as 43.87870, −103.45901. The ground truth being estimated by those coordinates is the tip of George Washington's nose on Mount Rushmore. The accuracy of the estimate is the maximum distance between the location coordinates and the ground truth. We could say in this case that the estimate accuracy is 10 meters, meaning that the point on Earth represented by the location coordinates is thought to be within 10 meters of George's nose—the ground truth. In slang, the coordinates indicate where we think George Washington's nose is located, and the ground truth is where it really is. In practice a smart phone or hand-held GPS unit is routinely able to estimate the ground truth within 6–10 meters. Specialized instruments can reduce GPS measurement error to under a centimeter. == Military usage == US military slang uses "ground truth" to refer to the facts comprising a tactical situation—as opposed to intelligence reports, mission plans, and other descriptions reflecting the conative or policy-based projections of the industrial·military complex. The term appears in the title of the Iraq War documentary film The Ground Truth (2006), and also in military publications, for example Stars and Stripes saying: "Stripes decided to figure out what the ground truth was in Iraq."

    Read more →
  • Region Based Convolutional Neural Networks

    Region Based Convolutional Neural Networks

    Region-based Convolutional Neural Networks (R-CNN) are a family of machine learning models for computer vision, and specifically object detection and localization. The original goal of R-CNN was to take an input image and produce a set of bounding boxes as output, where each bounding box contains an object and also the category (e.g. car or pedestrian) of the object. In general, R-CNN architectures perform selective search over feature maps outputted by a CNN. R-CNN has been extended to perform other computer vision tasks, such as: tracking objects from a drone-mounted camera, locating text in an image, and enabling object detection in Google Lens. Mask R-CNN is also one of seven tasks in the MLPerf Training Benchmark, which is a competition to speed up the training of neural networks. == History == The following covers some of the versions of R-CNN that have been developed. November 2013: R-CNN. April 2015: Fast R-CNN. June 2015: Faster R-CNN. March 2017: Mask R-CNN. December 2017: Cascade R-CNN is trained with increasing Intersection over Union (IoU, also known as the Jaccard index) thresholds, making each stage more selective against nearby false positives. June 2019: Mesh R-CNN adds the ability to generate a 3D mesh from a 2D image. == Architecture == For review articles see. === Selective search === Given an image (or an image-like feature map), selective search (also called Hierarchical Grouping) first segments the image by the algorithm in (Felzenszwalb and Huttenlocher, 2004), then performs the following: Input: (colour) image Output: Set of object location hypotheses L Segment image into initial regions R = {r1, ..., rn} using Felzenszwalb and Huttenlocher (2004) Initialise similarity set S = ∅ foreach Neighbouring region pair (ri, rj) do Calculate similarity s(ri, rj) S = S ∪ s(ri, rj) while S ≠ ∅ do Get highest similarity s(ri, rj) = max(S) Merge corresponding regions rt = ri ∪ rj Remove similarities regarding ri: S = S \ s(ri, r∗) Remove similarities regarding rj: S = S \ s(r∗, rj) Calculate similarity set St between rt and its neighbours S = S ∪ St R = R ∪ rt Extract object location boxes L from all regions in R === R-CNN === With R-CNN, prediction follows a two-step process. A preprocessing selective search step generates a large set of candidate objects (typically as many as 2000), known as regions of interest (ROI). These are forwarded to a CNN, which predicts an object class score and bounding box estimate, independently for each ROI. Importantly, the ROIs are heavily filtered to remove excess candidates. This is achieved using two mechanism. Filtering begins by removing ROIs assigned to the background category. This is a specialized category, which is scored by the CNN alongside other categories. An unfortunate reality is that remaining ROIs typically suffer from heavy duplication. Namely, multiple ROIs that cover same objects in the image are all assigned non-background categories. This is resolved by a heuristic non-maximum suppression (NMS) step. === Fast R-CNN === While the original R-CNN independently computed the neural network features on each of as many as two thousand regions of interest, Fast R-CNN runs the neural network once on the whole image. At the end of the network is a ROIPooling module, which slices out each ROI from the network's output tensor, reshapes it, and classifies it. As in the original R-CNN, the Fast R-CNN uses selective search to generate its region proposals. === Faster R-CNN === While Fast R-CNN used selective search to generate ROIs, Faster R-CNN integrates the ROI generation into the neural network itself. === Mask R-CNN === While previous versions of R-CNN focused on object detections, Mask R-CNN adds instance segmentation. Mask R-CNN also replaced ROIPooling with a new method called ROIAlign, which can represent fractions of a pixel.

    Read more →
  • Count sketch

    Count sketch

    Count sketch is a type of dimensionality reduction that is particularly efficient in statistics, machine learning and algorithms. It was invented by Moses Charikar, Kevin Chen and Martin Farach-Colton in an effort to speed up the AMS Sketch by Alon, Matias and Szegedy for approximating the frequency moments of streams (these calculations require counting of the number of occurrences for the distinct elements of the stream). The sketch is nearly identical to the Feature hashing algorithm by John Moody, but differs in its use of hash functions with low dependence, which makes it more practical. In order to still have a high probability of success, the median trick is used to aggregate multiple count sketches, rather than the mean. These properties allow use for explicit kernel methods, bilinear pooling in neural networks and is a cornerstone in many numerical linear algebra algorithms. == Intuitive explanation == The inventors of this data structure offer the following iterative explanation of its operation: at the simplest level, the output of a single hash function s mapping stream elements q into {+1, -1} is feeding a single up/down counter C. After a single pass over the data, the frequency n ( q ) {\displaystyle n(q)} of a stream element q can be approximated, although extremely poorly, by the expected value E [ C ⋅ s ( q ) ] {\displaystyle {\mathbf {E}}[C\cdot s(q)]} ; a straightforward way to improve the variance of the previous estimate is to use an array of different hash functions s i {\displaystyle s_{i}} , each connected to its own counter C i {\displaystyle C_{i}} . For each i, the E [ C i ⋅ s i ( q ) ] = n ( q ) {\displaystyle {\mathbf {E}}[C_{i}\cdot s_{i}(q)]=n(q)} still holds, so averaging across the i range will tighten the approximation; the previous construct still has a major deficiency: if a lower-frequency-but-still-important output element a exhibits a hash collision with a high-frequency element even for one of the s i {\displaystyle s_{i}} hashes, n ( a ) {\displaystyle n(a)} estimate can be significantly affected. Avoiding this requires reducing the frequency of collision counter updates between any two distinct elements. This is achieved by replacing each C i {\displaystyle C_{i}} in the previous construct with an array of m counters (making the counter set into a two-dimensional matrix C i , j {\displaystyle C_{i,j}} ), with index j of a particular counter to be incremented/decremented selected via another set of hash functions h i {\displaystyle h_{i}} that map element q into the range {1..m}. Since E [ C i , h i ( q ) ⋅ s i ( q ) ] = n ( q ) {\displaystyle {\mathbf {E}}[C_{i,h_{i}(q)}\cdot s_{i}(q)]=n(q)} , averaging across all values of i will work. == Mathematical definition == 1. For constants w {\displaystyle w} and t {\displaystyle t} (to be defined later) independently choose d = 2 t + 1 {\displaystyle d=2t+1} random hash functions h 1 , … , h d {\displaystyle h_{1},\dots ,h_{d}} and s 1 , … , s d {\displaystyle s_{1},\dots ,s_{d}} such that h i : [ n ] → [ w ] {\displaystyle h_{i}:[n]\to [w]} and s i : [ n ] → { ± 1 } {\displaystyle s_{i}:[n]\to \{\pm 1\}} . It is necessary that the hash families from which h i {\displaystyle h_{i}} and s i {\displaystyle s_{i}} are chosen be pairwise independent. 2. For each item q i {\displaystyle q_{i}} in the stream, add s j ( q i ) {\displaystyle s_{j}(q_{i})} to the h j ( q i ) {\displaystyle h_{j}(q_{i})} th bucket of the j {\displaystyle j} th hash. At the end of this process, one has w d {\displaystyle wd} sums ( C i j ) {\displaystyle (C_{ij})} where C i , j = ∑ h i ( k ) = j s i ( k ) . {\displaystyle C_{i,j}=\sum _{h_{i}(k)=j}s_{i}(k).} To estimate the count of q {\displaystyle q} s one computes the following value: r q = median i = 1 d s i ( q ) ⋅ C i , h i ( q ) . {\displaystyle r_{q}={\text{median}}_{i=1}^{d}\,s_{i}(q)\cdot C_{i,h_{i}(q)}.} The values s i ( q ) ⋅ C i , h i ( q ) {\displaystyle s_{i}(q)\cdot C_{i,h_{i}(q)}} are unbiased estimates of how many times q {\displaystyle q} has appeared in the stream. The estimate r q {\displaystyle r_{q}} has variance O ( m i n { m 1 2 / w 2 , m 2 2 / w } ) {\displaystyle O(\mathrm {min} \{m_{1}^{2}/w^{2},m_{2}^{2}/w\})} , where m 1 {\displaystyle m_{1}} is the length of the stream and m 2 2 {\displaystyle m_{2}^{2}} is ∑ q ( ∑ i [ q i = q ] ) 2 {\displaystyle \sum _{q}(\sum _{i}[q_{i}=q])^{2}} . Furthermore, r q {\displaystyle r_{q}} is guaranteed to never be more than 2 m 2 / w {\displaystyle 2m_{2}/{\sqrt {w}}} off from the true value, with probability 1 − e − O ( t ) {\displaystyle 1-e^{-O(t)}} . === Vector formulation === Alternatively Count-Sketch can be seen as a linear mapping with a non-linear reconstruction function. Let M ( i ∈ [ d ] ) ∈ { − 1 , 0 , 1 } w × n {\displaystyle M^{(i\in [d])}\in \{-1,0,1\}^{w\times n}} , be a collection of d = 2 t + 1 {\displaystyle d=2t+1} matrices, defined by M h i ( j ) , j ( i ) = s i ( j ) {\displaystyle M_{h_{i}(j),j}^{(i)}=s_{i}(j)} for j ∈ [ w ] {\displaystyle j\in [w]} and 0 everywhere else. Then a vector v ∈ R n {\displaystyle v\in \mathbb {R} ^{n}} is sketched by C ( i ) = M ( i ) v ∈ R w {\displaystyle C^{(i)}=M^{(i)}v\in \mathbb {R} ^{w}} . To reconstruct v {\displaystyle v} we take v j ∗ = median i C j ( i ) s i ( j ) {\displaystyle v_{j}^{}={\text{median}}_{i}C_{j}^{(i)}s_{i}(j)} . This gives the same guarantees as stated above, if we take m 1 = ‖ v ‖ 1 {\displaystyle m_{1}=\|v\|_{1}} and m 2 = ‖ v ‖ 2 {\displaystyle m_{2}=\|v\|_{2}} . == Relation to Tensor sketch == The count sketch projection of the outer product of two vectors is equivalent to the convolution of two component count sketches. The count sketch computes a vector convolution C ( 1 ) x ∗ C ( 2 ) x T {\displaystyle C^{(1)}x\ast C^{(2)}x^{T}} , where C ( 1 ) {\displaystyle C^{(1)}} and C ( 2 ) {\displaystyle C^{(2)}} are independent count sketch matrices. Pham and Pagh show that this equals C ( x ⊗ x T ) {\displaystyle C(x\otimes x^{T})} – a count sketch C {\displaystyle C} of the outer product of vectors, where ⊗ {\displaystyle \otimes } denotes Kronecker product. The fast Fourier transform can be used to do fast convolution of count sketches. By using the face-splitting product such structures can be computed much faster than normal matrices.

    Read more →
  • Pruning (artificial neural network)

    Pruning (artificial neural network)

    In deep learning, pruning is the practice of removing parameters from an existing artificial neural network. The goal of this process is to reduce the size (parameter count) of the neural network (and therefore the computational resources required to run it) whilst maintaining accuracy. This can be compared to the biological process of synaptic pruning which takes place in mammalian brains during development. == Node (neuron) pruning == A basic algorithm for pruning is as follows: Evaluate the importance of each neuron. Rank the neurons according to their importance (assuming there is a clearly defined measure for "importance"). Remove the least important neuron. Check a termination condition (to be determined by the user) to see whether to continue pruning. == Edge (weight) pruning == Most work on neural network pruning does not remove full neurons or layers (structured pruning). Instead, it focuses on removing the most insignificant weights (unstructured pruning), namely, setting their values to zero. This can either be done globally by comparing weights from all layers in the network or locally by comparing weights in each layer separately. Different metrics can be used to measure the importance of each weight. Weight magnitude as well as combinations of weight and gradient information are commonly used metrics. Early work suggested also to change the values of non-pruned weights. == When to prune the neural network? == Pruning can be applied at three different stages: before training, during training, or after training. When pruning is performed during or after training, additional fine-tuning epochs are typically required. Each approach involves different trade-offs between accuracy and computational cost.

    Read more →
  • Information gain ratio

    Information gain ratio

    In decision tree learning, information gain ratio is a ratio of information gain to the intrinsic information. It was proposed by Ross Quinlan, to reduce a bias towards multi-valued attributes by taking the number and size of branches into account when choosing an attribute. Information gain is also known as mutual information. == Information gain calculation == Information gain is the reduction in entropy produced from partitioning a set with attributes a {\displaystyle a} and finding the optimal candidate that produces the highest value: IG ( T , a ) = H ( T ) − H ( T | a ) , {\displaystyle {\text{IG}}(T,a)=\mathrm {H} {(T)}-\mathrm {H} {(T|a)},} where T {\displaystyle T} is a random variable and H ( T | a ) {\displaystyle \mathrm {H} {(T|a)}} is the entropy of T {\displaystyle T} given the value of attribute a {\displaystyle a} . The information gain is equal to the total entropy for an attribute if for each of the attribute values a unique classification can be made for the result attribute. In this case the relative entropies subtracted from the total entropy are 0. == Split information calculation == The split information value for a test is defined as follows: SplitInformation ( X ) = − ∑ i = 1 n N ( x i ) N ( x ) ∗ log ⁡ 2 N ( x i ) N ( x ) {\displaystyle {\text{SplitInformation}}(X)=-\sum _{i=1}^{n}{{\frac {\mathrm {N} (x_{i})}{\mathrm {N} (x)}}\log {_{2}}{\frac {\mathrm {N} (x_{i})}{\mathrm {N} (x)}}}} where X {\displaystyle X} is a discrete random variable with possible values x 1 , x 2 , . . . , x i {\displaystyle {x_{1},x_{2},...,x_{i}}} and N ( x i ) {\displaystyle N(x_{i})} being the number of times that x i {\displaystyle x_{i}} occurs divided by the total count of events N ( x ) {\displaystyle N(x)} where x {\displaystyle x} is the set of events. The split information value is a positive number that describes the potential worth of splitting a branch from a node. This in turn is the intrinsic value that the random variable possesses and will be used to remove the bias in the information gain ratio calculation. == Information gain ratio calculation == The information gain ratio is the ratio between the information gain and the split information value: IGR ( T , a ) = IG ( T , a ) / SplitInformation ( T ) {\displaystyle {\text{IGR}}(T,a)={\text{IG}}(T,a)/{\text{SplitInformation}}(T)} IGR ( T , a ) = − ∑ i = 1 n P ( T ) log ⁡ P ( T ) − ( − ∑ i = 1 n P ( T | a ) log ⁡ P ( T | a ) ) − ∑ i = 1 n N ( t i ) N ( t ) ∗ log ⁡ 2 N ( t i ) N ( t ) {\displaystyle {\text{IGR}}(T,a)={\frac {-\sum _{i=1}^{n}{\mathrm {P} (T)\log \mathrm {P} (T)}-(-\sum _{i=1}^{n}{\mathrm {P} (T|a)\log \mathrm {P} (T|a)})}{-\sum _{i=1}^{n}{{\frac {\mathrm {N} (t_{i})}{\mathrm {N} (t)}}\log {_{2}}{\frac {\mathrm {N} (t_{i})}{\mathrm {N} (t)}}}}}} == Example == Using weather data published by Fordham University, the table was created below: Using the table above, one can find the entropy, information gain, split information, and information gain ratio for each variable (outlook, temperature, humidity, and wind). These calculations are shown in the tables below: Using the above tables, one can deduce that Outlook has the highest information gain ratio. Next, one must find the statistics for the sub-groups of the Outlook variable (sunny, overcast, and rainy), for this example one will only build the sunny branch (as shown in the table below): One can find the following statistics for the other variables (temperature, humidity, and wind) to see which have the greatest effect on the sunny element of the outlook variable: Humidity was found to have the highest information gain ratio. One will repeat the same steps as before and find the statistics for the events of the Humidity variable (high and normal): Since the play values are either all "No" or "Yes", the information gain ratio value will be equal to 1. Also, now that one has reached the end of the variable chain with Wind being the last variable left, they can build an entire root to leaf node branch line of a decision tree. Once finished with reaching this leaf node, one would follow the same procedure for the rest of the elements that have yet to be split in the decision tree. This set of data was relatively small, however, if a larger set was used, the advantages of using the information gain ratio as the splitting factor of a decision tree can be seen more. == Advantages == Information gain ratio biases the decision tree against considering attributes with a large number of distinct values. For example, suppose that we are building a decision tree for some data describing a business's customers. Information gain ratio is used to decide which of the attributes are the most relevant. These will be tested near the root of the tree. One of the input attributes might be the customer's telephone number. This attribute has a high information gain, because it uniquely identifies each customer. Due to its high amount of distinct values, this will not be chosen to be tested near the root. == Disadvantages == Although information gain ratio solves the key problem of information gain, it creates another problem. If one is considering an amount of attributes that have a high number of distinct values, these will never be above one that has a lower number of distinct values. == Difference from information gain == Information gain's shortcoming is created by not providing a numerical difference between attributes with high distinct values from those that have less. Example: Suppose that we are building a decision tree for some data describing a business's customers. Information gain is often used to decide which of the attributes are the most relevant, so they can be tested near the root of the tree. One of the input attributes might be the customer's credit card number. This attribute has a high information gain, because it uniquely identifies each customer, but we do not want to include it in the decision tree: deciding how to treat a customer based on their credit card number is unlikely to generalize to customers we haven't seen before. Information gain ratio's strength is that it has a bias towards the attributes with the lower number of distinct values. Below is a table describing the differences of information gain and information gain ratio when put in certain scenarios.

    Read more →
  • 2018 Google data breach

    2018 Google data breach

    The 2018 Google data breach was a major data privacy scandal in which the Google+ API exposed the private data of over five hundred thousand users. Google+ managers first noticed harvesting of personal data in March 2018, during a review following the Facebook–Cambridge Analytica data scandal. The bug, despite having been fixed immediately, exposed the private data of approximately 500,000 Google+ users to the public. Google did not reveal the leak to the network's users. In November 2018, another data breach occurred following an update to the Google+ API. Although Google found no evidence of failure, approximately 52.5 million personal profiles were potentially exposed. In August 2019, Google declared a shutdown of Google+ due to low use and technological challenges. == Overview of Google+ == Google+ was launched in June 2011 as an invite-only social network, but was opened for public access later in the year. It was managed by Vic Gundotra. Similar to Facebook, Google+ also included key features Circles, Hangouts and Sparks. Circles let users personalize their social groups by sorting friends into different categories. Once allowed into a Circle, users could regulate information in their individual spaces. Hangouts included video chatting and instant messaging between users. Sparks allowed Google to track users' past searches to find news and content related to their interests. Google+ was linked to other Google services, such as YouTube, Google Drive and Gmail, giving it access to roughly 2 billion user accounts. However, less than 400 million consumers actively used Google+, with 90% of those users using it for less than five seconds. == The breaches == In March 2018, Google developers found a data breach within the Google+ People API in which external apps acquired access to Profile fields that were not marked as public. According to The Wall Street Journal, Google didn’t disclose the breach when it was first discovered in March to avoid regulatory scrutiny and reputational damage. 500,000 Google+ accounts were included in the breach, which allowed 438 external apps unauthorized access to private users' names, emails, addresses, occupations, genders and ages. This information was available between 2015 and 2018. Google found no evidence of any user's personal information being misused, nor that any third-party app developers were aware of the leak. In November 2018, a software update created another data breach within the Google+ API. The bug impacted 52.5 million users, where, similarly to the March breach, unauthorized apps were able to access Google+ profiles, including users' names, email addresses, occupations and ages. Apps could not access financial information, national identification, numbers, or passwords. Blog posts, messages and phone numbers also remained inaccessible if marked as private. Unlike the previous breach, access was only available for six days before Google+ learned of the breach. Once more, Google+ found no evidence of data being misused by third-party developers. == Responses == In October 2018, the Wall Street Journal published an article outlining the initial breach and Google's decision to not disclose it to users. At the time, there was no federal law that required Google to inform their consumers of data breaches. Google+ originally did not disclose the breach out of fears of being compared to Facebook's recent data leak and subsequent loss of consumer confidence. In response to the Wall Street Journal article, Google announced the shutdown of Google+ in August 2019. After the second data leak, the date was moved to April 2019. In response to the data breach, enterprise consumers were notified of the bug's impact and given instructions on how to save, download and delete their data prior to the Google+ shut down. Google's Privacy and Data Protection Office found no misuse of user data. Prior to the Google+ shutdown, Google set a 10-month period in which users could download and migrate their data. After the 10-month period, user content was deleted. On 4 February 2019, consumers were no longer able to create new Google+ profiles. Google shut down Google+ APIs on 7 March 2019 to ensure that developers did not continue to rely on the APIs prior to the Google+ shutdown. Google is the principal entity of its parent company, Alphabet Inc. After the data breach, Alphabet Inc. share prices fell by 1% to $1,157.06 on 9 October 2018 after an earlier drop of $1,135.40 that morning, the lowest price since 5 July 2018. After the publication of The Wall Street Journal article, share prices dropped as low as 2.1% in two days on 10 October 2018. Share prices steadily increased from this point and met the 8 October 2018 share price on 5 February 2019. Google planned to rebuild Google+ as a corporate enterprise network. Google Play will now assess which apps can ask for permission to access the user's SMS data. Only the default app for telephone distribution is able to make requests. Prior to the data breaches, apps were able to request access to all of a consumer's data simultaneously. Now, each app must request permission for each aspect of a consumer's profile.

    Read more →
  • Automated Pain Recognition

    Automated Pain Recognition

    Automated Pain Recognition (APR) is a method for objectively measuring pain and at the same time represents an interdisciplinary research area that comprises elements of medicine, psychology, psychobiology, and computer science. The focus is on computer-aided objective recognition of pain, implemented on the basis of machine learning. Automated pain recognition allows for the valid, reliable detection and monitoring of pain in people who are unable to communicate verbally. The underlying machine learning processes are trained and validated in advance by means of unimodal or multimodal body signals. Signals used to detect pain may include facial expressions or gestures and may also be of a (psycho-)physiological or paralinguistic nature. To date, the focus has been on identifying pain intensity, but visionary efforts are also being made to recognize the quality, site, and temporal course of pain. However, the clinical implementation of this approach is a controversial topic in the field of pain research. Critics of automated pain recognition argue that pain diagnosis can only be performed subjectively by humans. == Background == Pain diagnosis under conditions where verbal reporting is restricted - such as in verbally and/or cognitively impaired people or in patients who are sedated or mechanically ventilated - is based on behavioral observations by trained professionals. However, all known observation procedures (e.g., Zurich Observation Pain Assessment (ZOPA)); Pain Assessment in Advanced Dementia Scale (PAINAD) require a great deal of specialist expertise. These procedures can be made more difficult by perception- and interpretation-related misjudgments on the part of the observer. With regard to the differences in design, methodology, evaluation sample, and conceptualization of the phenomenon of pain, it is difficult to compare the quality criteria of the various tools. Even if trained personnel could theoretically record pain intensity several times a day using observation instruments, it would not be possible to measure it every minute or second. In this respect, the goal of automated pain recognition is to use valid, robust pain response patterns that can be recorded multimodally for a temporally dynamic, high-resolution, automated pain intensity recognition system. == Procedure == For automated pain recognition, pain-relevant parameters are usually recorded using non-invasive sensor technology, which captures data on the (physical) responses of the person in pain. This can be achieved with camera technology that captures facial expressions, gestures, or posture, while audio sensors record paralinguistic features. (Psycho-)physiological information such as muscle tone and heart rate can be collected via biopotential sensors (electrodes). Pain recognition requires the extraction of meaningful characteristics or patterns from the data collected. This is achieved using machine learning techniques that are able to provide an assessment of the pain after training (learning), e.g., "no pain," "mild pain," or "severe pain." == Parameters == Although the phenomenon of pain comprises different components (sensory discriminative, affective (emotional), cognitive, vegetative, and (psycho-)motor), automated pain recognition currently relies on the measurable parameters of pain responses. These can be divided roughly into the two main categories of "physiological responses" and "behavioral responses". === Physiological responses === In humans, pain almost always initiates autonomic nervous processes that are reflected measurably in various physiological signals. ==== Physiological signals ==== Measurements can include electrodermal activity (EDA, also skin conductance), electromyography (EMG), electrocardiogram (ECG), blood volume pulse (BVP), electroencephalogram (EEG), respiration, and body temperature, which are regulatory mechanisms of the sympathetic and parasympathetic systems. Physiological signals are mainly recorded using special non-invasive surface electrodes (for EDA, EMG, ECG, and EEG), a blood volume pulse sensor (BVP), a respiratory belt (respiration), and a thermal sensor (body temperature). Endocrinological and immunological parameters can also be recorded, but this requires measures that are somewhat invasive (e.g., blood sampling). === Behavioral responses === Behavioral responses to pain fulfil two functions: protection of the body (e.g., through protective reflexes) and external communication of the pain (e.g., as a cry for help). The responses are particularly evident in facial expressions, gestures, and paralinguistic features. ==== Facial expressions ==== Behavioral signals captured comprise facial expression patterns (expressive behavior), which are measured with the aid of video signals. Facial expression recognition is based on the everyday clinical observation that pain often manifests itself in the patient's facial expressions but that this is not necessarily always the case, since facial expressions can be inhibited through self-control. Despite the possibility that facial expressions may be influenced consciously, facial expression behavior represents an essential source of information for pain diagnosis and is thus also a source of information for automatic pain recognition. One advantage of video-based facial expression recognition is the contact-free measurement of the face, provided that it can be captured on video, which is not possible in every position (e.g., lying face down) or may be limited by bandages covering the face. Facial expression analysis relies on rapid, spontaneous, and temporary changes in neuromuscular activity that lead to visually detectable changes in the face. ==== Gestures ==== Gestures are also captured predominantly using non-contact camera technology. Motor pain responses vary and are strongly dependent on the type and cause of the pain. They range from abrupt protective reflexes (e.g., spontaneous retraction of extremities or doubling up) to agitation (pathological restlessness) and avoidance behavior (hesitant, cautious movements). ==== Paralinguistic features of language ==== Among other things, pain leads to nonverbal linguistic behavior that manifests itself in sounds such as sighing, gasping, moaning, whining, etc. Paralinguistic features are usually recorded using highly sensitive microphones. == Algorithms == After the recording, pre-processing (e.g., filtering), and extraction of relevant features, an optional information fusion can be performed. During this process, modalities from different signal sources are merged to generate new or more precise knowledge. The pain is classified using machine learning processes. The method chosen has a significant influence on the recognition rate and depends greatly on the quality and granularity of the underlying data. Similar to the field of affective computing, the following classifiers are currently being used: Support Vector Machine (SVM): The goal of an SVM is to find a clearly defined optimal hyperplane with the greatest minimal distance to two (or more) classes to be separated. The hyperplane acts as a decision function for classifying an unknown pattern. Random Forest (RF): RF is based on the composition of random, uncorrelated decision trees. An unknown pattern is judged individually by each tree and assigned to a class. The final classification of the patterns by the RF is then based on a majority decision. k-Nearest Neighbors (k-NN): The k-NN algorithm classifies an unknown object using the class label that most commonly classifies the k neighbors closest to it. Its neighbors are determined using a selected similarity measure (e.g., Euclidean distance, Jaccard coefficient, etc.). Artificial neural networks (ANNs): ANNs are inspired by biological neural networks and model their organizational principles and processes in a very simplified manner. Class patterns are learned by adjusting the weights of the individual neuronal connections. == Databases == In order to classify pain in a valid manner, it is necessary to create representative, reliable, and valid pain databases that are available to the machine learner for training. An ideal database would be sufficiently large and would consist of natural (not experimental), high-quality pain responses. However, natural responses are difficult to record and can only be obtained to a limited extent; in most cases they are characterized by suboptimal quality. The databases currently available therefore contain experimental or quasi-experimental pain responses, and each database is based on a different pain model. The following list shows a selection of the most relevant pain databases (last updated: April 2020): UNBC-McMaster Shoulder Pain BioVid Heat Pain EmoPain SenseEmotion X-ITE Pain

    Read more →
  • Latent class model

    Latent class model

    In statistics, a latent class model (LCM) is a model for clustering multivariate discrete data. It assumes that the data arise from a mixture of discrete distributions, within each of which the variables are independent. It is called a latent class model because the class to which each data point belongs is unobserved (or latent). Latent class analysis (LCA) is a subset of structural equation modeling used to find groups or subtypes of cases in multivariate categorical data. These groups or subtypes of cases are called "latent classes". When faced with the following situation, a researcher might opt to use LCA to better understand the data: Symptoms a, b, c, and d have been recorded in a variety of patients diagnosed with diseases X, Y, and Z. Disease X is associated with symptoms a, b, and c; disease Y is linked to symptoms b, c, and d; and disease Z is connected to symptoms a, c, and d. In this context, the LCA would attempt to detect the presence of latent classes (i.e., the disease entities), thus creating patterns of association in the symptoms. As in factor analysis, LCA can also be used to classify cases according to their maximum likelihood class membership probability. The key criterion for resolving the LCA is identifying latent classes in which the observed symptom associations are effectively rendered null. This is because within each class, the diseases responsible for the symptoms create a structure of dependencies. As a result, the symptoms become conditionally independent, meaning that, given the class a case belongs to, the symptoms are no longer related to one another. == Model == Within each latent class, the observed variables are statistically independent—an essential aspect of latent class modeling. Usually, the observed variables are statistically dependent. By introducing the latent variable, independence is restored in the sense that within classes, variables are independent (local independence). Therefore, the association between the observed variables is explained by the classes of the latent variable (McCutcheon, 1987). In one form, the LCM is written as p i 1 , i 2 , … , i N ≈ ∑ t T p t ∏ n N p i n , t n , {\displaystyle p_{i_{1},i_{2},\ldots ,i_{N}}\approx \sum _{t}^{T}p_{t}\,\prod _{n}^{N}p_{i_{n},t}^{n},} where T {\displaystyle T} is the number of latent classes and p t {\displaystyle p_{t}} are the so-called recruitment or unconditional probabilities that should sum to one. p i n , t n {\displaystyle p_{i_{n},t}^{n}} are the marginal or conditional probabilities. For a two-way latent class model, the form is p i j ≈ ∑ t T p t p i t p j t . {\displaystyle p_{ij}\approx \sum _{t}^{T}p_{t}\,p_{it}\,p_{jt}.} This two-way model is related to probabilistic latent semantic analysis and non-negative matrix factorization. The probability model used in LCA is closely related to the Naive Bayes classifier. The main difference is that in LCA, the class membership of an individual is a latent variable, whereas in Naive Bayes classifiers, the class membership is an observed label. == Related methods == There are a number of methods with distinct names and uses that share a common relationship. Cluster analysis is, like LCA, used to discover taxon-like groups of cases in data. Multivariate mixture estimation (MME) is applicable to continuous data and assumes that such data arise from a mixture of distributions, such as a set of heights arising from a mixture of men and women. If a multivariate mixture estimation is constrained so that measures must be uncorrelated within each distribution, it is termed latent profile analysis. Modified to handle discrete data, this constrained analysis is known as LCA. Discrete latent trait models further constrain the classes to form from segments of a single dimension, allocating members to classes based on that dimension. An example would be assigning cases to social classes based on ability or merit. In a practical instance, the variables could be multiple choice items of a political questionnaire. In this case, the data consists of an N-way contingency table with answers to the items for a number of respondents. In this example, the latent variable refers to political opinion, and the latent classes to political groups. Given group membership, the conditional probabilities specify the chance that certain answers are chosen. == Application == LCA may be used in many fields, such as: collaborative filtering, Behavior Genetics and Evaluation of diagnostic tests.

    Read more →
  • Neural cryptography

    Neural cryptography

    Neural cryptography is a branch of cryptography dedicated to analyzing the application of stochastic algorithms, especially artificial neural network algorithms, for use in encryption and cryptanalysis. == Definition == Artificial neural networks are well known for their ability to selectively explore the solution space of a given problem. This feature finds a natural niche of application in the field of cryptanalysis. At the same time, neural networks offer a new approach to attack ciphering algorithms based on the principle that any function could be reproduced by a neural network, which is a powerful proven computational tool that can be used to find the inverse-function of any cryptographic algorithm. The ideas of mutual learning, self learning, and stochastic behavior of neural networks and similar algorithms can be used for different aspects of cryptography, like public-key cryptography, solving the key distribution problem using neural network mutual synchronization, hashing or generation of pseudo-random numbers. Another idea is the ability of a neural network to separate space in non-linear pieces using "bias". It gives different probabilities of activating the neural network or not. This is very useful in the case of Cryptanalysis. Two names are used to design the same domain of research: Neuro-Cryptography and Neural Cryptography. The first work that it is known on this topic can be traced back to 1995 in an IT Master Thesis. == Applications == In 1995, Sebastien Dourlens applied neural networks to cryptanalyze DES by allowing the networks to learn how to invert the S-tables of the DES. The bias in DES studied through Differential Cryptanalysis by Adi Shamir is highlighted. The experiment shows about 50% of the key bits can be found, allowing the complete key to be found in a short time. Hardware application with multi micro-controllers have been proposed due to the easy implementation of multilayer neural networks in hardware. One example of a public-key protocol is given by Khalil Shihab . He describes the decryption scheme and the public key creation that are based on a backpropagation neural network. The encryption scheme and the private key creation process are based on Boolean algebra. This technique has the advantage of small time and memory complexities. A disadvantage is the property of backpropagation algorithms: because of huge training sets, the learning phase of a neural network is very long. Therefore, the use of this protocol is only theoretical so far. == Neural key exchange protocol == The most used protocol for key exchange between two parties A and B in the practice is Diffie–Hellman key exchange protocol. Neural key exchange, which is based on the synchronization of two tree parity machines, should be a secure replacement for this method. Synchronizing these two machines is similar to synchronizing two chaotic oscillators in chaos communications. === Tree parity machine === The tree parity machine is a special type of multi-layer feedforward neural network. It consists of one output neuron, K hidden neurons and K×N input neurons. Inputs to the network take three values: x i j ∈ { − 1 , 0 , + 1 } {\displaystyle x_{ij}\in \left\{-1,0,+1\right\}} The weights between input and hidden neurons take the values: w i j ∈ { − L , . . . , 0 , . . . , + L } {\displaystyle w_{ij}\in \left\{-L,...,0,...,+L\right\}} Output value of each hidden neuron is calculated as a sum of all multiplications of input neurons and these weights: σ i = sgn ⁡ ( ∑ j = 1 N w i j x i j ) {\displaystyle \sigma _{i}=\operatorname {sgn}(\sum _{j=1}^{N}w_{ij}x_{ij})} Signum is a simple function, which returns −1,0 or 1: sgn ⁡ ( x ) = { − 1 if x < 0 , 0 if x = 0 , 1 if x > 0. {\displaystyle \operatorname {sgn}(x)={\begin{cases}-1&{\text{if }}x<0,\\0&{\text{if }}x=0,\\1&{\text{if }}x>0.\end{cases}}} If the scalar product is 0, the output of the hidden neuron is mapped to −1 in order to ensure a binary output value. The output of neural network is then computed as the multiplication of all values produced by hidden elements: τ = ∏ i = 1 K σ i {\displaystyle \tau =\prod _{i=1}^{K}\sigma _{i}} Output of the tree parity machine is binary. === Protocol === Each party (A and B) uses its own tree parity machine. Synchronization of the tree parity machines is achieved in these steps Initialize random weight values Execute these steps until the full synchronization is achieved Generate random input vector X Compute the values of the hidden neurons Compute the value of the output neuron Compare the values of both tree parity machines Outputs are the same: one of the suitable learning rules is applied to the weights Outputs are different: go to 2.1 After the full synchronization is achieved (the weights wij of both tree parity machines are same), A and B can use their weights as keys. This method is known as a bidirectional learning. One of the following learning rules can be used for the synchronization: Hebbian learning rule: w i + = g ( w i + σ i x i Θ ( σ i τ ) Θ ( τ A τ B ) ) {\displaystyle w_{i}^{+}=g(w_{i}+\sigma _{i}x_{i}\Theta (\sigma _{i}\tau )\Theta (\tau ^{A}\tau ^{B}))} Anti-Hebbian learning rule: w i + = g ( w i − σ i x i Θ ( σ i τ ) Θ ( τ A τ B ) ) {\displaystyle w_{i}^{+}=g(w_{i}-\sigma _{i}x_{i}\Theta (\sigma _{i}\tau )\Theta (\tau ^{A}\tau ^{B}))} Random walk: w i + = g ( w i + x i Θ ( σ i τ ) Θ ( τ A τ B ) ) {\displaystyle w_{i}^{+}=g(w_{i}+x_{i}\Theta (\sigma _{i}\tau )\Theta (\tau ^{A}\tau ^{B}))} Where: Θ ( a , b ) = 0 {\displaystyle \Theta (a,b)=0} if a ≠ b {\displaystyle a\neq b} otherwise Θ ( a , b ) = 1 {\displaystyle \Theta (a,b)=1} And: g ( x ) {\displaystyle g(x)} is a function that keeps the w i {\displaystyle w_{i}} in the range { − L , − L + 1 , . . . , 0 , . . . , L − 1 , L } {\displaystyle \{-L,-L+1,...,0,...,L-1,L\}} === Attacks and security of this protocol === In every attack it is considered, that the attacker E can eavesdrop messages between the parties A and B, but does not have an opportunity to change them. ==== Brute force ==== To provide a brute force attack, an attacker has to test all possible keys (all possible values of weights wij). By K hidden neurons, K×N input neurons and boundary of weights L, this gives (2L+1)KN possibilities. For example, the configuration K = 3, L = 3 and N = 100 gives us 310253 key possibilities, making the attack impossible with today's computer power. ==== Learning with own tree parity machine ==== One of the basic attacks can be provided by an attacker, who owns the same tree parity machine as the parties A and B. He wants to synchronize his tree parity machine with these two parties. In each step there are three situations possible: Output(A) ≠ Output(B): None of the parties updates its weights. Output(A) = Output(B) = Output(E): All the three parties update weights in their tree parity machines. Output(A) = Output(B) ≠ Output(E): Parties A and B update their tree parity machines, but the attacker can not do that. Because of this situation his learning is slower than the synchronization of parties A and B. It has been proven, that the synchronization of two parties is faster than learning of an attacker. It can be improved by increasing of the synaptic depth L of the neural network. That gives this protocol enough security and an attacker can find out the key only with small probability. ==== Other attacks ==== For conventional cryptographic systems, we can improve the security of the protocol by increasing of the key length. In the case of neural cryptography, we improve it by increasing of the synaptic depth L of the neural networks. Changing this parameter increases the cost of a successful attack exponentially, while the effort for the users grows polynomially. Therefore, breaking the security of neural key exchange belongs to the complexity class NP. Alexander Klimov, Anton Mityaguine, and Adi Shamir say that the original neural synchronization scheme can be broken by at least three different attacks—geometric, probabilistic analysis, and using genetic algorithms. Even though this particular implementation is insecure, the ideas behind chaotic synchronization could potentially lead to a secure implementation. === Permutation parity machine === The permutation parity machine is a binary variant of the tree parity machine. It consists of one input layer, one hidden layer and one output layer. The number of neurons in the output layer depends on the number of hidden units K. Each hidden neuron has N binary input neurons: x i j ∈ { 0 , 1 } {\displaystyle x_{ij}\in \left\{0,1\right\}} The weights between input and hidden neurons are also binary: w i j ∈ { 0 , 1 } {\displaystyle w_{ij}\in \left\{0,1\right\}} Output value of each hidden neuron is calculated as a sum of all exclusive disjunctions (exclusive or) of input neurons and these weights: σ i = θ N ( ∑ j = 1 N w i j ⊕ x i j ) {\displaystyle \sigma _{i}=\theta _{N}(\sum _{j=1}^{N}w_{ij}\oplus x_{ij})} (⊕ means XOR). Th

    Read more →
  • Public computer

    Public computer

    A public computer (or public access computer) is any of various computers available in public areas. Some places where public computers may be available are libraries, schools, or dedicated facilities run by government. Public computers share similar hardware and software components to personal computers, however, the role and function of a public access computer is entirely different. A public access computer is used by many different untrusted individuals throughout the course of the day. The computer must be locked down and secure against both intentional and unintentional abuse. Users typically do not have authority to install software or change settings. A personal computer, in contrast, is typically used by a single responsible user, who can customize the machine's behavior to their preferences. Public access computers are often provided with tools such as a PC reservation system to regulate access. The world's first public access computer center was the Marin Computer Center in California, co-founded by David and Annie Fox in 1977. == Kiosks == A kiosk is a special type of public computer using software and hardware modifications to provide services only about the place the kiosk is in. For example, a movie ticket kiosk can be found at a movie theater. These kiosks are usually in a secure browser with zero access to the desktop. Many of these kiosks may run Linux, however, ATMs, a kiosk designed for depositing money, often run Windows XP. == Public computers in the United States == === Library computers === In the United States and Canada, almost all public libraries have computers available for the use of patrons, though some libraries will impose a time limit on users to ensure others will get a turn and keep the library less busy. Users are often allowed to print documents that they have created using these computers, though sometimes for a small fee. ==== Privacy ==== Privacy is an important part of the public library institution, since the libraries entitle the public to intellectual freedom. Use of any computer or network may create records of users' activities that can jeopardize their privacy. It is possible for a patron to jeopardize their privacy if they do not delete cache, clear cookies, or documents from the public computer. In order for a member of the public to remain private on a computer, the American Library Association (ALA) has guidelines. These give patrons an idea of the right way to keep using public library computers. In their provision of services to library users, librarians have an ethical responsibility, expressed in the ALA Code of Ethics, to preserve users' right to privacy. A librarian is also responsible for giving users an understanding of private patron use and access. Libraries must ensure that users have the following rights when browsing on public computers: the computer automatically will clear a users history; libraries should display privacy screens so users do not see another patron's screen; updating software for effective safety measures; restoration data software to clear documents that users may have left on their computers and to combat possible malware; security practices; and making users aware of any possible monitoring of their browsing activities. Users can also view the Library Privacy Checklist for Public Access Computers and Networks to better understand what libraries strive for when protecting privacy. === School computers === The U.S. government has given money to many school boards to purchase computers for educational applications. Schools may have multiple computer labs, which contain these computers for students to use. There is usually Internet access on these machines, but some schools will put up a blocking service to limit the websites that students are able to access to only include educational resources, such as Google. In addition to controlling the content students are viewing, putting up these blocks can also help to keep the computers safe by preventing students from downloading malware and other threats. However, the effectiveness of such content filtering systems is questionable since it can easily be circumvented by using proxy websites, Virtual Private Networks, and for some weak security systems, merely knowing the IP address of the intended website is enough to bypass the filter. School computers often have advanced operating system security to prevent tech-savvy students from inflicting damage (i.e. the Windows Registry Editor and Task Manager, etc.) are disabled on Microsoft Windows machines. Schools with very advanced tech services may also install a locked down BIOS/firmware or make kernel-level changes to the operating system, precluding the possibility of unauthorized activity.

    Read more →
  • Neocognitron

    Neocognitron

    The neocognitron is a hierarchical, multilayered artificial neural network proposed by Kunihiko Fukushima in 1979. It has been used for Japanese handwritten character recognition and other pattern recognition tasks, and served as the inspiration for convolutional neural networks. Previously in 1969, he published a similar architecture, but with hand-designed kernels inspired by convolutions in mammalian vision. In 1975 he improved it to the Cognitron, and in 1979 he improved it to the neocognitron, which learns all convolutional kernels by unsupervised learning (in his terminology, "self-organized by 'learning without a teacher'"). The neocognitron was inspired by the model proposed by Hubel & Wiesel in 1959. They found two types of cells in the visual primary cortex called simple cell and complex cell, and also proposed a cascading model of these two types of cells for use in pattern recognition tasks. The neocognitron is a natural extension of these cascading models. The neocognitron consists of multiple types of cells, the most important of which are called S-cells and C-cells. The local features are extracted by S-cells, and these features' deformation, such as local shifts, are tolerated by C-cells. Local features in the input are integrated gradually and classified in the higher layers. The idea of local feature integration is found in several other models, such as the Convolutional Neural Network model, the SIFT method, and the HoG method. There are various kinds of neocognitron. For example, some types of neocognitron can detect multiple patterns in the same input by using backward signals to achieve selective attention.

    Read more →
  • Relief (feature selection)

    Relief (feature selection)

    Relief is an algorithm developed by Kenji Kira and Larry Rendell in 1992 that takes a filter-method approach to feature selection that is notably sensitive to feature interactions. It was originally designed for application to binary classification problems with discrete or numerical features. Relief calculates a feature score for each feature which can then be applied to rank and select top scoring features for feature selection. Alternatively, these scores may be applied as feature weights to guide downstream modeling. Relief feature scoring is based on the identification of feature value differences between nearest neighbor instance pairs. If a feature value difference is observed in a neighboring instance pair with the same class (a 'hit'), the feature score decreases. Alternatively, if a feature value difference is observed in a neighboring instance pair with different class values (a 'miss'), the feature score increases. The original Relief algorithm has since inspired a family of Relief-based feature selection algorithms (RBAs), including the ReliefF algorithm. Beyond the original Relief algorithm, RBAs have been adapted to (1) perform more reliably in noisy problems, (2) generalize to multi-class problems (3) generalize to numerical outcome (i.e. regression) problems, and (4) to make them robust to incomplete (i.e. missing) data. To date, the development of RBA variants and extensions has focused on four areas; (1) improving performance of the 'core' Relief algorithm, i.e. examining strategies for neighbor selection and instance weighting, (2) improving scalability of the 'core' Relief algorithm to larger feature spaces through iterative approaches, (3) methods for flexibly adapting Relief to different data types, and (4) improving Relief run efficiency. Their strengths are that they are not dependent on heuristics, they run in low-order polynomial time, and they are noise-tolerant and robust to feature interactions, as well as being applicable for binary or continuous data; however, it does not discriminate between redundant features, and low numbers of training instances fool the algorithm. == Relief Algorithm == Take a data set with n instances of p features, belonging to two known classes. Within the data set, each feature should be scaled to the interval [0 1] (binary data should remain as 0 and 1). The algorithm will be repeated m times. Start with a p-long weight vector (W) of zeros. At each iteration, take the feature vector (X) belonging to one random instance, and the feature vectors of the instance closest to X (by Euclidean distance) from each class. The closest same-class instance is called 'near-hit', and the closest different-class instance is called 'near-miss'. Update the weight vector such that W i = W i − ( x i − n e a r H i t i ) 2 + ( x i − n e a r M i s s i ) 2 , {\displaystyle W_{i}=W_{i}-(x_{i}-\mathrm {nearHit} _{i})^{2}+(x_{i}-\mathrm {nearMiss} _{i})^{2},} where i {\displaystyle i} indexes the components and runs from 1 to p. Thus the weight of any given feature decreases if it differs from that feature in nearby instances of the same class more than nearby instances of the other class, and increases in the reverse case. After m iterations, divide each element of the weight vector by m. This becomes the relevance vector. Features are selected if their relevance is greater than a threshold τ. Kira and Rendell's experiments showed a clear contrast between relevant and irrelevant features, allowing τ to be determined by inspection. However, it can also be determined by Chebyshev's inequality for a given confidence level (α) that a τ of 1/sqrt(αm) is good enough to make the probability of a Type I error less than α, although it is stated that τ can be much smaller than that. Relief was also described as generalizable to multinomial classification by decomposition into a number of binary problems. == ReliefF Algorithm == Kononenko et al. propose a number of updates to Relief. Firstly, they find the near-hit and near-miss instances using the Manhattan (L1) norm rather than the Euclidean (L2) norm, although the rationale is not specified. Furthermore, they found taking the absolute differences between xi and near-hiti, and xi and near-missi to be sufficient when updating the weight vector (rather than the square of those differences). === Reliable probability estimation === Rather than repeating the algorithm m times, implement it exhaustively (i.e. n times, once for each instance) for relatively small n (up to one thousand). Furthermore, rather than finding the single nearest hit and single nearest miss, which may cause redundant and noisy attributes to affect the selection of the nearest neighbors, ReliefF searches for k nearest hits and misses and averages their contribution to the weights of each feature. k can be tuned for any individual problem. === Incomplete data === In ReliefF, the contribution of missing values to the feature weight is determined using the conditional probability that two values should be the same or different, approximated with relative frequencies from the data set. This can be calculated if one or both features are missing. === Multi-class problems === Rather than use Kira and Rendell's proposed decomposition of a multinomial classification into a number of binomial problems, ReliefF searches for k near misses from each different class and averages their contributions for updating W, weighted with the prior probability of each class. == Other Relief-based Algorithm Extensions/Derivatives == The following RBAs are arranged chronologically from oldest to most recent. They include methods for improving (1) the core Relief algorithm concept, (2) iterative approaches for scalability, (3) adaptations to different data types, (4) strategies for computational efficiency, or (5) some combination of these goals. For more on RBAs see these book chapters or this most recent review paper. === RRELIEFF === Robnik-Šikonja and Kononenko propose further updates to ReliefF, making it appropriate for regression. === Relieved-F === Introduced deterministic neighbor selection approach and a new approach for incomplete data handling. === Iterative Relief === Implemented method to address bias against non-monotonic features. Introduced the first iterative Relief approach. For the first time, neighbors were uniquely determined by a radius threshold and instances were weighted by their distance from the target instance. === I-RELIEF === Introduced sigmoidal weighting based on distance from target instance. All instance pairs (not just a defined subset of neighbors) contributed to score updates. Proposed an on-line learning variant of Relief. Extended the iterative Relief concept. Introduced local-learning updates between iterations for improved convergence. === TuRF (a.k.a. Tuned ReliefF) === Specifically sought to address noise in large feature spaces through the recursive elimination of features and the iterative application of ReliefF. === Evaporative Cooling ReliefF === Similarly seeking to address noise in large feature spaces. Utilized an iterative `evaporative' removal of lowest quality features using ReliefF scores in association with mutual information. === EReliefF (a.k.a. Extended ReliefF) === Addressing issues related to incomplete and multi-class data. === VLSReliefF (a.k.a. Very Large Scale ReliefF) === Dramatically improves the efficiency of detecting 2-way feature interactions in very large feature spaces by scoring random feature subsets rather than the entire feature space. === ReliefMSS === Introduced calculation of feature weights relative to average feature 'diff' between instance pairs. === SURF === SURF identifies nearest neighbors (both hits and misses) based on a distance threshold from the target instance defined by the average distance between all pairs of instances in the training data. Results suggest improved power to detect 2-way epistatic interactions over ReliefF. === SURF (a.k.a. SURFStar) === SURF extends the SURF algorithm to not only utilized 'near' neighbors in scoring updates, but 'far' instances as well, but employing inverted scoring updates for 'far instance pairs. Results suggest improved power to detect 2-way epistatic interactions over SURF, but an inability to detect simple main effects (i.e. univariate associations). === SWRF === SWRF extends the SURF algorithm adopting sigmoid weighting to take distance from the threshold into account. Also introduced a modular framework for further developing RBAs called MoRF. === MultiSURF (a.k.a. MultiSURFStar) === MultiSURF extends the SURF algorithm adapting the near/far neighborhood boundaries based on the average and standard deviation of distances from the target instance to all others. MultiSURF uses the standard deviation to define a dead-band zone where 'middle-distance' instances do not contribute to scoring. Evidence suggests MultiSURF performs best in detecting pure 2-way feature interactions. === Reli

    Read more →