AI Face Kissing Free Online

AI Face Kissing Free Online — independent reviews, comparisons, pricing and step-by-step guides on Aizhi.

  • Foodsi

    Foodsi

    Foodsi is a Polish mobile application that connects customers with restaurants, convenience stores, bakeries and cafes that have a surplus of food, allowing its users to buy the surplus at a reduced price. The service launched in 2019 in Warsaw and has expanded to other major cities in Poland. In 2023, a new feature was introduced in the app, allowing users to buy packages not only with self-pickup but also with delivery. The products range has also been expanded to include unsold magazines, cosmetics or plants. == History == The company was created in 2019 in Poland by Mateusz Kowalczyk and Jakub Fryszczyn. During studies in their home country and abroad, when they made a living working in restaurants and bakeries, they recognized the problem and the scale of food waste. They launched the application by themselves, having previously raised PLN 100,000 on their own for the purpose. Initially, Foodsi was an Android-only app, but over time, an IOS version was developed. In 2022, the startup raised PLN 6 million in a seed round from VC companies including CofounderZone and Status Starter, as well as private investors such as founders of Pyszne.pl. As of December 2023, it claimed more than 5000 businesses, serving over 1,5 million users, have saved nearly 3 million bags of food. == Purpose == Foodsi aims to significantly reduce food waste, which contributes to the Sustainable Development Goals. The application bridges the gap between the customers who are looking for shopping deals and the companies that want to reduce surplus products but are unable to sell them at a normal price. This allows the customers to buy unsold products for as little as 30% of the normal price. The company claims that every 4 out of 5 packages are sold on average. As of 2019 Foodsi employed more than 30 people. By 2024 it was more than 50. For now, Foodsi operates in major Polish cities such as Warsaw, Kraków, Trójmiasto, Wrocław, Poznań etc. However, in the upcoming years, Foodsi plans to expand to other countries. == Use == To start selling surplus, a company must leave Foodsi its contact information to register in the system. Registration in the app is completely free of charge. Then, companies offer available packages anticipating what won’t be sold and post them in the app along with the price so that users can buy them and pick them up. Companies can put their packages in the app at any time during the day. Users can pick up packages from bakeries, grocery stores, restaurants, but also florists and beauty stores. Foodsi charges a small commission on each package from the cooperating companies. If a user wants to start ordering packages from Foodsi, he or she needs to install the app on their mobile phone (Android or IOS) and register an account. The app displays a list of restaurants and other venues available in a specific region set by the user's location. Customers can see the price, address, distance and time range for package pickup. Packages are usually in the form of so-called 'surprise-packages', meaning that customers do not know specifically what kind of food/product will be inside. Some restaurants offer a choice of different package sizes. Prices are up to 70% lower than those of the original products. Customers have to show up at the restaurant to pick up the package using their phone at a time specified in the app. == Awards == Auler All-Stars 2025 - 3rd place Deloitte Technology Fast 50 - 2025 Central Europe Executive Club - Innowacja Roku: Żywność i Rolnictwo - Wyróżnienie (2025) Stena Circular Economy Award - Lider Gospodarki Obiegu Zamkniętego (2025) - wyróżnienie w kategorii start-up wdrażający GOZ na rynku polskim 255th place in the international poll FoodTech 500 2025 Finalist for the EY Entrepreneur Of The Year™ 2025 Wpływowi 2024 - Laureat w kategorii “Zrównoważony rozwój” Supplier of the Year 2024 - XXII Food & Business Forum Supplier of the Year 2024 - VII Sweets & Coffee Forum Innovative Leader 2024 - Leader in Food / Food-Tech Category - Executive Summit “Orzeł Innowacji - Start-up z potencjałem Polska-Świat” (Rzeczpospolita, 2024) 102nd place in the international poll FoodTech 500 2024 Auler 2023 Startup of the Year 2023 according to money.pl Start(up) w zrównoważoną przyszłość Kongresu Kompas ESG 2023 Marka Godna Zaufania according to My Company Polska 2023 184th place in the international poll FoodTech 500 2023 In 2023, Foodsi co-founder Mateusz Kowalczyk was recognized by Forbes magazine and included in its "30 before 30" list.

    Read more →
  • Variable-order Bayesian network

    Variable-order Bayesian network

    Variable-order Bayesian network (VOBN) models provide an important extension of both the Bayesian network models and the variable-order Markov models. VOBN models are used in machine learning in general and have shown great potential in bioinformatics applications. These models extend the widely used position weight matrix (PWM) models, Markov models, and Bayesian network (BN) models. In contrast to the BN models, where each random variable depends on a fixed subset of random variables, in VOBN models these subsets may vary based on the specific realization of observed variables. The observed realizations are often called the context and, hence, VOBN models are also known as context-specific Bayesian networks. The flexibility in the definition of conditioning subsets of variables turns out to be a real advantage in classification and analysis applications, as the statistical dependencies between random variables in a sequence of variables (not necessarily adjacent) may be taken into account efficiently, and in a position-specific and context-specific manner.

    Read more →
  • Mean squared prediction error

    Mean squared prediction error

    In statistics the mean squared prediction error (MSPE), also known as mean squared error of the predictions, of a smoothing, curve fitting, or regression procedure is the expected value of the squared prediction errors (PE), the square difference between the fitted values implied by the predictive function g ^ {\displaystyle {\widehat {g}}} and the values of the (unobservable) true value g. It is an inverse measure of the explanatory power of g ^ , {\displaystyle {\widehat {g}},} and can be used in the process of cross-validation of an estimated model. Knowledge of g would be required in order to calculate the MSPE exactly; in practice, MSPE is estimated. == Formulation == If the smoothing or fitting procedure has projection matrix (i.e., hat matrix) L, which maps the observed values vector y {\displaystyle y} to predicted values vector y ^ = L y , {\displaystyle {\hat {y}}=Ly,} then PE and MSPE are formulated as: P E i = g ( x i ) − g ^ ( x i ) , {\displaystyle \operatorname {PE_{i}} =g(x_{i})-{\widehat {g}}(x_{i}),} MSPE = E ⁡ [ PE i 2 ] = ∑ i = 1 n PE i 2 ⁡ / n . {\displaystyle \operatorname {MSPE} =\operatorname {E} \left[\operatorname {PE} _{i}^{2}\right]=\sum _{i=1}^{n}\operatorname {PE} _{i}^{2}/n.} The MSPE can be decomposed into two terms: the squared bias (mean error) of the fitted values and the variance of the fitted values: MSPE = ME 2 + VAR , {\displaystyle \operatorname {MSPE} =\operatorname {ME} ^{2}+\operatorname {VAR} ,} ME = E ⁡ [ g ^ ( x i ) − g ( x i ) ] {\displaystyle \operatorname {ME} =\operatorname {E} \left[{\widehat {g}}(x_{i})-g(x_{i})\right]} VAR = E ⁡ [ ( g ^ ( x i ) − E ⁡ [ g ( x i ) ] ) 2 ] . {\displaystyle \operatorname {VAR} =\operatorname {E} \left[\left({\widehat {g}}(x_{i})-\operatorname {E} \left[{g}(x_{i})\right]\right)^{2}\right].} The quantity SSPE=nMSPE is called sum squared prediction error. The root mean squared prediction error is the square root of MSPE: RMSPE=√MSPE. == Computation of MSPE over out-of-sample data == The mean squared prediction error can be computed exactly in two contexts. First, with a data sample of length n, the data analyst may run the regression over only q of the data points (with q < n), holding back the other n – q data points with the specific purpose of using them to compute the estimated model’s MSPE out of sample (i.e., not using data that were used in the model estimation process). Since the regression process is tailored to the q in-sample points, normally the in-sample MSPE will be smaller than the out-of-sample one computed over the n – q held-back points. If the increase in the MSPE out of sample compared to in sample is relatively slight, that results in the model being viewed favorably. And if two models are to be compared, the one with the lower MSPE over the n – q out-of-sample data points is viewed more favorably, regardless of the models’ relative in-sample performances. The out-of-sample MSPE in this context is exact for the out-of-sample data points that it was computed over, but is merely an estimate of the model’s MSPE for the mostly unobserved population from which the data were drawn. Second, as time goes on more data may become available to the data analyst, and then the MSPE can be computed over these new data. == Estimation of MSPE over the population == When the model has been estimated over all available data with none held back, the MSPE of the model over the entire population of mostly unobserved data can be estimated as follows. For the model y i = g ( x i ) + σ ε i {\displaystyle y_{i}=g(x_{i})+\sigma \varepsilon _{i}} where ε i ∼ N ( 0 , 1 ) {\displaystyle \varepsilon _{i}\sim {\mathcal {N}}(0,1)} , one may write n ⋅ MSPE ⁡ ( L ) = g T ( I − L ) T ( I − L ) g + σ 2 tr ⁡ [ L T L ] . {\displaystyle n\cdot \operatorname {MSPE} (L)=g^{\text{T}}(I-L)^{\text{T}}(I-L)g+\sigma ^{2}\operatorname {tr} \left[L^{\text{T}}L\right].} Using in-sample data values, the first term on the right side is equivalent to ∑ i = 1 n ( E ⁡ [ g ( x i ) − g ^ ( x i ) ] ) 2 = E ⁡ [ ∑ i = 1 n ( y i − g ^ ( x i ) ) 2 ] − σ 2 tr ⁡ [ ( I − L ) T ( I − L ) ] . {\displaystyle \sum _{i=1}^{n}\left(\operatorname {E} \left[g(x_{i})-{\widehat {g}}(x_{i})\right]\right)^{2}=\operatorname {E} \left[\sum _{i=1}^{n}\left(y_{i}-{\widehat {g}}(x_{i})\right)^{2}\right]-\sigma ^{2}\operatorname {tr} \left[\left(I-L\right)^{T}\left(I-L\right)\right].} Thus, n ⋅ MSPE ⁡ ( L ) = E ⁡ [ ∑ i = 1 n ( y i − g ^ ( x i ) ) 2 ] − σ 2 ( n − tr ⁡ [ L ] ) . {\displaystyle n\cdot \operatorname {MSPE} (L)=\operatorname {E} \left[\sum _{i=1}^{n}\left(y_{i}-{\widehat {g}}(x_{i})\right)^{2}\right]-\sigma ^{2}\left(n-\operatorname {tr} \left[L\right]\right).} If σ 2 {\displaystyle \sigma ^{2}} is known or well-estimated by σ ^ 2 {\displaystyle {\widehat {\sigma }}^{2}} , it becomes possible to estimate MSPE by n ⋅ M S P E ^ ⁡ ( L ) = ∑ i = 1 n ( y i − g ^ ( x i ) ) 2 − σ ^ 2 ( n − tr ⁡ [ L ] ) . {\displaystyle n\cdot \operatorname {\widehat {MSPE}} (L)=\sum _{i=1}^{n}\left(y_{i}-{\widehat {g}}(x_{i})\right)^{2}-{\widehat {\sigma }}^{2}\left(n-\operatorname {tr} \left[L\right]\right).} Colin Mallows advocated this method in the construction of his model selection statistic Cp, which is a normalized version of the estimated MSPE: C p = ∑ i = 1 n ( y i − g ^ ( x i ) ) 2 σ ^ 2 − n + 2 p . {\displaystyle C_{p}={\frac {\sum _{i=1}^{n}\left(y_{i}-{\widehat {g}}(x_{i})\right)^{2}}{{\widehat {\sigma }}^{2}}}-n+2p.} where p the number of estimated parameters p and σ ^ 2 {\displaystyle {\widehat {\sigma }}^{2}} is computed from the version of the model that includes all possible regressors. That concludes this proof.

    Read more →
  • PVLV

    PVLV

    The primary value learned value (PVLV) model is a possible explanation for the reward-predictive firing properties of dopamine (DA) neurons. It simulates behavioral and neural data on Pavlovian conditioning and the midbrain dopaminergic neurons that fire in proportion to unexpected rewards. It is an alternative to the temporal-differences (TD) algorithm. It is used as part of Leabra.

    Read more →
  • Wave Financial

    Wave Financial

    Wave is a Canadian company that provides financial services and software for small businesses. Wave is headquartered in the East Bayfront neighbourhood in Toronto, Canada. The company's first product was free online accounting software designed for businesses with 1–9 employees, followed by invoicing, personal finance and receipt-scanning software (OCR). In 2012, Wave began branching into financial services, initially with Payments by Wave (credit card processing) and Payroll by Wave, followed in February 2017 by Lending by Wave, which has since been discontinued. == History == CEO Kirk Simpson and CPO James Lochrie launched Wave Accounting Inc. in July 2009, Wave Accounting launched to the public on November 16, 2010. In June 2011, Series A funding led by OMERS Ventures was closed. In September 2011, FedDev Ontario invested one million dollars in funding. In October 2011, a $5-million investment led by U.S. venture capital firm Charles River Ventures was announced. In May 2012, Wave Accounting closed its series B financing round led by The Social+Capital Partnership, with follow-on participation from Charles River Ventures and OMERS Ventures. Wave acquired a company called Small Payroll in November 2011, which was later launched as a payroll product called Wave Payroll. In February 2012, Wave officially launched Wave Payroll to the public in Canada, followed by the American release in November of the same year. In August, 2012, the company announced the acquisition of Vuru.co, an online stock-tracking service. Terms of the deal were not disclosed. In December 2012, the company rebranded itself as Wave to emphasize its broadened spectrum of services. On March 14, 2019, the company acquired Every, a Toronto-based fintech company that provides business accounts and debit cards to small businesses. On June 11, 2019, the company announced it was being acquired by tax preparation company, H&R Block, for $537 million. On June 15, 2022, Wave announced that Kirk Simpson would be leaving and being replaced as CEO by Zahir Khoja. In May 2025, US customers of Wave were transitioned to a new Payroll processing system supported by CheckHQ. The new integration improved support for US employers by handling employer tax withholding and payments in all 50 US States. == Products == The company's initial product, Accounting by Wave, is a double entry accounting tool. Services include direct bank data imports, invoicing and expense tracking, customizable chart of accounts, and journal transactions. Accounting by Wave integrates with expense tracking software Shoeboxed and e-commerce website Etsy. The next product launched was Payroll by Wave, which was launched in 2012 after the acquisition of SmallPayroll.ca. Payroll by Wave is only available in the US and Canada. Invoicing by Wave is an offshoot of the company's earlier accounting tools. Additional products launched on or shortly after the company's rebrand in December 2012 include: a credit card processing tool, Payments by Wave, built initially on integration with Stripe credit card processing. However, Wave does not report merchant fees correctly for countries where Stripe charges a tax such as GST. In these cases, the merchant fees are reported without tax and do not match your Stripe account. a receipt scanning tool, Receipts by Wave. In 2017, Wave signed an agreement to provide its platform on RBC's online business banking site. The RBC-Wave service will be co-branded. == Taxes supported == The company's software supports tax-exclusive pricing, such as U.S. sales tax, where taxes are added on top of prices quoted. This has two effects: When scanning receipts users must manually add the tax, and input the amount. When making an invoice, users must put in a price before tax, and the system will add the tax on top. This makes Wave unable to handle taxes in countries like Australia where prices must be quoted inclusive of all taxes, such as GST. There is no way to set an invoice total and have Wave calculate the tax portion as a percentage. == Pricing and business model == As of June 10, 2024, Wave offers two tiers for its software: a free Starter plan with limitations on some features, and a paid Pro plan. In addition to its paid plan, revenue from the company comes from other paid financial services the company offers: Payments by Wave: Card processing which includes debit, credit and prepaid cards as well as ACH (bank payments) in the United States. Fees are a percentage of the transaction. Payroll by Wave: Monthly subscription fee plus usage fees. Wave previously included advertising on its pages as a source of revenue. Advertising was removed in January 2017. In 2017, Wave raised $24m (USD) in funding led by NAB Ventures. In 2019, H&R Block announced the acquisition of Wave in a cash deal worth $405 million USD.

    Read more →
  • Ho–Kashyap algorithm

    Ho–Kashyap algorithm

    The Ho–Kashyap algorithm is an iterative method in machine learning for finding a linear decision boundary that separates two linearly separable classes. It was developed by Yu-Chi Ho and Rangasami L. Kashyap in 1965, and usually presented as a problem in linear programming. == Setup == Given a training set consisting of samples from two classes, the Ho–Kashyap algorithm seeks to find a weight vector w {\displaystyle \mathbf {w} } and a margin vector b {\displaystyle \mathbf {b} } such that: Y w = b {\displaystyle \mathbf {Yw} =\mathbf {b} } where Y {\displaystyle \mathbf {Y} } is the augmented data matrix with samples from both classes (with appropriate sign conventions, e.g., samples from class 2 are negated), w {\displaystyle \mathbf {w} } is the weight vector to be determined, and b {\displaystyle \mathbf {b} } is a positive margin vector. The algorithm minimizes the criterion function: J ( w , b ) = | | Y w − b | | 2 {\displaystyle J(\mathbf {w} ,\mathbf {b} )=||\mathbf {Yw} -\mathbf {b} ||^{2}} subject to the constraint that b > 0 {\displaystyle \mathbf {b} >\mathbf {0} } (element-wise). Given a problem of linearly separating two classes, we consider a dataset of elements { ( x i , y i ) } i ∈ 1 : N {\displaystyle \{(\mathbf {x_{i}} ,y_{i})\}_{i\in 1:N}} where y i ∈ { − 1 , + 1 } {\displaystyle y_{i}\in \{-1,+1\}} . Linearly separating them by a perceptron is equivalent to finding weight and bias w , b {\displaystyle \mathbf {w} ,b} for a perceptron, such that: [ y 1 x 1 1 ⋮ ⋮ y N x N 1 ] [ w b ] > 0 {\displaystyle {\begin{bmatrix}y_{1}\mathbf {x} _{1}&1\\\vdots &\vdots \\y_{N}\mathbf {x} _{N}&1\\\end{bmatrix}}{\begin{bmatrix}\mathbf {w} \\b\end{bmatrix}}>0} == Algorithm == The idea of the Ho–Kashyap algorithm is as follows: Given any b {\displaystyle \mathbf {b} } , the corresponding w {\displaystyle \mathbf {w} } is known: It is simply w = Y + b {\displaystyle \mathbf {w} =\mathbf {Y} ^{+}\mathbf {b} } , where Y + {\displaystyle \mathbf {Y} ^{+}} denotes the Moore–Penrose pseudoinverse of Y {\displaystyle \mathbf {Y} } . Therefore, it only remains to find b {\displaystyle \mathbf {b} } by gradient descent. However, the gradient descent may sometimes decrease some of the coordinates of b {\displaystyle \mathbf {b} } , which may cause some coordinates of b {\displaystyle \mathbf {b} } to become negative, which is undesirable. Therefore, whenever some coordinates of b {\displaystyle \mathbf {b} } would have decreased, those coordinates are unchanged instead. As for the coordinates of b {\displaystyle \mathbf {b} } that would increase, those would increase without issue. Formally, the algorithm is as follows: Initialization: Set b ( 0 ) {\displaystyle \mathbf {b} (0)} to an arbitrary positive vector, typically b ( 0 ) = 1 {\displaystyle \mathbf {b} (0)=\mathbf {1} } (a vector of ones). Set the iteration counter k = 0 {\displaystyle k=0} . Set w ( 0 ) = Y + b ( 0 ) {\displaystyle \mathbf {w} (0)=\mathbf {Y} ^{+}\mathbf {b} (0)} Loop until convergence, or until iteration counter exceeds some k m a x {\displaystyle k_{max}} . Error calculation: Compute the error vector: e ( k ) = Y w ( k ) − b ( k ) {\displaystyle \mathbf {e} (k)=\mathbf {Yw} (k)-\mathbf {b} (k)} . Margin update: Update the margin vector: b ( k + 1 ) = b ( k ) + 2 η k ( e ( k ) + | e ( k ) | ) {\displaystyle \mathbf {b} (k+1)=\mathbf {b} (k)+2\eta _{k}(\mathbf {e} (k)+|\mathbf {e} (k)|)} where η k {\displaystyle \eta _{k}} is a positive learning rate parameter, and | e ( k ) | {\displaystyle |\mathbf {e} (k)|} denotes the element-wise absolute value. Weight calculation: Compute the weight vector using the pseudoinverse: w ( k + 1 ) = Y + b ( k + 1 ) {\displaystyle \mathbf {w} (k+1)=\mathbf {Y} ^{+}\mathbf {b} (k+1)} . Convergence check: If | | e ( k ) | | ≤ θ {\displaystyle ||\mathbf {e} (k)||\leq \theta } for some predetermined threshold θ {\displaystyle \theta } (close to zero), then return b ( k + 1 ) , w ( k + 1 ) {\displaystyle \mathbf {b} (k+1),\mathbf {w} (k+1)} . if e ( k ) ≤ 0 {\displaystyle \mathbf {e} (k)\leq \mathbf {0} } (all components non-positive), return "Samples not separable.". Return "Algorithm failed to converge in time.". == Properties == If the training data is linearly separable, the algorithm converges to a solution (where e ( k ) = 0 {\displaystyle \mathbf {e} (k)=\mathbf {0} } ) in a finite number of iterations. If the data is not linearly separable, the algorithm may or may not ever reach the point where e ( k ) = 0 {\displaystyle \mathbf {e} (k)=\mathbf {0} } . However, if it does happen that e ( k ) ≤ 0 {\displaystyle \mathbf {e} (k)\leq \mathbf {0} } at some iteration, this proves non-separability. The convergence rate depends on the choice of the learning rate parameter ρ {\displaystyle \rho } and the degree of linear separability of the data. == Relationship to other algorithms == Perceptron algorithm: Both seek linear separators. The perceptron updates weights incrementally based on individual misclassified samples, while Ho–Kashyap is a batch method that processes all samples to compute the pseudoinverse and updates based on an overall error vector. Linear discriminant analysis (LDA): LDA assumes underlying Gaussian distributions with equal covariances for the classes and derives the decision boundary from these statistical assumptions. Ho–Kashyap makes no explicit distributional assumptions and instead tries to solve a system of linear inequalities directly. Support vector machines (SVM): For linearly separable data, SVMs aim to find the maximum-margin hyperplane. The Ho–Kashyap algorithm finds a separating hyperplane but not necessarily the one with the maximum margin. If the data is not separable, soft-margin SVMs allow for some misclassifications by optimizing a trade-off between margin size and misclassification penalty, while Ho–Kashyap provides a least-squares solution. == Variants == Modified Ho–Kashyap algorithm changes weight calculation step w ( k + 1 ) = Y + b ( k + 1 ) {\displaystyle \mathbf {w} (k+1)=\mathbf {Y} ^{+}\mathbf {b} (k+1)} to w ( k + 1 ) = w ( k ) + η k Y + | e ( k ) | {\displaystyle \mathbf {w} (k+1)=\mathbf {w} (k)+\eta _{k}\mathbf {Y} ^{+}|\mathbf {e} (k)|} . Kernel Ho–Kashyap algorithm: Applies kernel methods (the "kernel trick") to the Ho–Kashyap framework to enable non-linear classification by implicitly mapping data to a higher-dimensional feature space.

    Read more →
  • FastICA

    FastICA

    FastICA is an efficient and popular algorithm for independent component analysis invented by Aapo Hyvärinen at Helsinki University of Technology. Like most ICA algorithms, FastICA seeks an orthogonal rotation of prewhitened data, through a fixed-point iteration scheme, that maximizes a measure of non-Gaussianity of the rotated components. Non-gaussianity serves as a proxy for statistical independence, which is a very strong condition and requires infinite data to verify. FastICA can also be alternatively derived as an approximative Newton iteration. == Algorithm == === Prewhitening the data === Let the X := ( x i j ) ∈ R N × M {\displaystyle \mathbf {X} :=(x_{ij})\in \mathbb {R} ^{N\times M}} denote the input data matrix, M {\displaystyle M} the number of columns corresponding with the number of samples of mixed signals and N {\displaystyle N} the number of rows corresponding with the number of independent source signals. The input data matrix X {\displaystyle \mathbf {X} } must be prewhitened, or centered and whitened, before applying the FastICA algorithm to it. Centering the data entails demeaning each component of the input data X {\displaystyle \mathbf {X} } , that is, for each i = 1 , … , N {\displaystyle i=1,\ldots ,N} and j = 1 , … , M {\displaystyle j=1,\ldots ,M} . After centering, each row of X {\displaystyle \mathbf {X} } has an expected value of 0 {\displaystyle 0} . Whitening the data requires a linear transformation L : R N × M → R N × M {\displaystyle \mathbf {L} :\mathbb {R} ^{N\times M}\to \mathbb {R} ^{N\times M}} of the centered data so that the components of L ( X ) {\displaystyle \mathbf {L} (\mathbf {X} )} are uncorrelated and have variance one. More precisely, if X {\displaystyle \mathbf {X} } is a centered data matrix, the covariance of L x := L ( X ) {\displaystyle \mathbf {L} _{\mathbf {x} }:=\mathbf {L} (\mathbf {X} )} is the ( N × N ) {\displaystyle (N\times N)} -dimensional identity matrix, that is, A common method for whitening is by performing an eigenvalue decomposition on the covariance matrix of the centered data X {\displaystyle \mathbf {X} } , E { X X T } = E D E T {\displaystyle E\left\{\mathbf {X} \mathbf {X} ^{T}\right\}=\mathbf {E} \mathbf {D} \mathbf {E} ^{T}} , where E {\displaystyle \mathbf {E} } is the matrix of eigenvectors and D {\displaystyle \mathbf {D} } is the diagonal matrix of eigenvalues. The whitened data matrix is defined thus by === Single component extraction === The iterative algorithm finds the direction for the weight vector w ∈ R N {\displaystyle \mathbf {w} \in \mathbb {R} ^{N}} that maximizes a measure of non-Gaussianity of the projection w T X {\displaystyle \mathbf {w} ^{T}\mathbf {X} } , with X ∈ R N × M {\displaystyle \mathbf {X} \in \mathbb {R} ^{N\times M}} denoting a prewhitened data matrix as described above. Note that w {\displaystyle \mathbf {w} } is a column vector. To measure non-Gaussianity, FastICA relies on a nonquadratic nonlinear function f ( u ) {\displaystyle f(u)} , its first derivative g ( u ) {\displaystyle g(u)} , and its second derivative g ′ ( u ) {\displaystyle g^{\prime }(u)} . Hyvärinen states that the functions are useful for general purposes, while may be highly robust. The steps for extracting the weight vector w {\displaystyle \mathbf {w} } for single component in FastICA are the following: Randomize the initial weight vector w {\displaystyle \mathbf {w} } Let w + ← E { X g ( w T X ) T } − E { g ′ ( w T X ) } w {\displaystyle \mathbf {w} ^{+}\leftarrow E\left\{\mathbf {X} g(\mathbf {w} ^{T}\mathbf {X} )^{T}\right\}-E\left\{g'(\mathbf {w} ^{T}\mathbf {X} )\right\}\mathbf {w} } , where E { . . . } {\displaystyle E\left\{...\right\}} means averaging over all column-vectors of matrix X {\displaystyle \mathbf {X} } Let w ← w + / ‖ w + ‖ {\displaystyle \mathbf {w} \leftarrow \mathbf {w} ^{+}/\|\mathbf {w} ^{+}\|} If not converged, go back to 2 === Multiple component extraction === The single unit iterative algorithm estimates only one weight vector which extracts a single component. Estimating additional components that are mutually "independent" requires repeating the algorithm to obtain linearly independent projection vectors - note that the notion of independence here refers to maximizing non-Gaussianity in the estimated components. Hyvärinen provides several ways of extracting multiple components with the simplest being the following. Here, 1 M {\displaystyle \mathbf {1_{M}} } is a column vector of 1's of dimension M {\displaystyle M} . Algorithm FastICA Input: C {\displaystyle C} Number of desired components Input: X ∈ R N × M {\displaystyle \mathbf {X} \in \mathbb {R} ^{N\times M}} Prewhitened matrix, where each column represents an N {\displaystyle N} -dimensional sample, where C <= N {\displaystyle C<=N} Output: W ∈ R N × C {\displaystyle \mathbf {W} \in \mathbb {R} ^{N\times C}} Un-mixing matrix where each column projects X {\displaystyle \mathbf {X} } onto independent component. Output: S ∈ R C × M {\displaystyle \mathbf {S} \in \mathbb {R} ^{C\times M}} Independent components matrix, with M {\displaystyle M} columns representing a sample with C {\displaystyle C} dimensions. for p in 1 to C: w p ← {\displaystyle \mathbf {w_{p}} \leftarrow } Random vector of length N while w p {\displaystyle \mathbf {w_{p}} } changes w p ← 1 M X g ( w p T X ) T − 1 M g ′ ( w p T X ) 1 M w p {\displaystyle \mathbf {w_{p}} \leftarrow {\frac {1}{M}}\mathbf {X} g(\mathbf {w_{p}} ^{T}\mathbf {X} )^{T}-{\frac {1}{M}}g'(\mathbf {w_{p}} ^{T}\mathbf {X} )\mathbf {1_{M}} \mathbf {w_{p}} } w p ← w p − ∑ j = 1 p − 1 ( w p T w j ) w j {\displaystyle \mathbf {w_{p}} \leftarrow \mathbf {w_{p}} -\sum _{j=1}^{p-1}(\mathbf {w_{p}} ^{T}\mathbf {w_{j}} )\mathbf {w_{j}} } w p ← w p ‖ w p ‖ {\displaystyle \mathbf {w_{p}} \leftarrow {\frac {\mathbf {w_{p}} }{\|\mathbf {w_{p}} \|}}} output W ← [ w 1 , … , w C ] {\displaystyle \mathbf {W} \leftarrow {\begin{bmatrix}\mathbf {w_{1}} ,\dots ,\mathbf {w_{C}} \end{bmatrix}}} output S ← W T X {\displaystyle \mathbf {S} \leftarrow \mathbf {W^{T}} \mathbf {X} }

    Read more →
  • Apache Giraph

    Apache Giraph

    Apache Giraph is an Apache project to perform graph processing on big data. Giraph utilizes Apache Hadoop's MapReduce implementation to process graphs. Facebook used Giraph with some performance improvements to analyze one trillion edges using 200 machines in 4 minutes. Giraph is based on a paper published by Google about its own graph processing system called Pregel. It can be compared to other Big Graph processing libraries such as Cassovary. As of September 2023, it is no longer actively developed.

    Read more →
  • Write or Die

    Write or Die

    Write or Die is an online web application designed to combat writer's block by letting users of the application punish themselves if they slow down or stop typing in the application's window. How severe the punishments are depends on the mode the user chooses, which ranges from "Gentle" to "Kamikaze". It was reviewed by publications PCWorld, the Los Angeles Times and The Guardian, and it was most notably used by writers Helen Oyeyemi and David Nicholls. The creator, Jeff Printy, explained that he wrote the application because he wants "to be published and make a living as a writer."

    Read more →
  • T-distributed stochastic neighbor embedding

    T-distributed stochastic neighbor embedding

    t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Geoffrey Hinton and Sam Roweis, where Laurens van der Maaten and Hinton proposed the t-distributed variant. It is a nonlinear dimensionality reduction technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. The t-SNE algorithm comprises two main stages. First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence (KL divergence) between the two distributions with respect to the locations of the points in the map. While the original algorithm uses the Euclidean distance between objects as the base of its similarity metric, this can be changed as appropriate. A Riemannian variant is UMAP. t-SNE has been used for visualization in a wide range of applications, including genomics, computer security research, natural language processing, music analysis, cancer research, bioinformatics, geological domain interpretation, and biomedical signal processing. For a data set with n {\displaystyle n} elements, t-SNE runs in O ( n 2 ) {\displaystyle O(n^{2})} time and requires O ( n 2 ) {\displaystyle O(n^{2})} space. == Details == Given a set of N {\displaystyle N} high-dimensional objects x 1 , … , x N {\displaystyle \mathbf {x} _{1},\dots ,\mathbf {x} _{N}} , t-SNE first computes probabilities p i j {\displaystyle p_{ij}} that are proportional to the similarity of objects x i {\displaystyle \mathbf {x} _{i}} and x j {\displaystyle \mathbf {x} _{j}} , as follows. For i ≠ j {\displaystyle i\neq j} , define p j ∣ i = exp ⁡ ( − ‖ x i − x j ‖ 2 / 2 σ i 2 ) ∑ k ≠ i exp ⁡ ( − ‖ x i − x k ‖ 2 / 2 σ i 2 ) {\displaystyle p_{j\mid i}={\frac {\exp(-\lVert \mathbf {x} _{i}-\mathbf {x} _{j}\rVert ^{2}/2\sigma _{i}^{2})}{\sum _{k\neq i}\exp(-\lVert \mathbf {x} _{i}-\mathbf {x} _{k}\rVert ^{2}/2\sigma _{i}^{2})}}} and set p i ∣ i = 0 {\displaystyle p_{i\mid i}=0} . Note the above denominator ensures ∑ j p j ∣ i = 1 {\displaystyle \sum _{j}p_{j\mid i}=1} for all i {\displaystyle i} . As van der Maaten and Hinton explained: "The similarity of datapoint x j {\displaystyle x_{j}} to datapoint x i {\displaystyle x_{i}} is the conditional probability, p j | i {\displaystyle p_{j|i}} , that x i {\displaystyle x_{i}} would pick x j {\displaystyle x_{j}} as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at x i {\displaystyle x_{i}} ." Now define p i j = p j ∣ i + p i ∣ j 2 N {\displaystyle p_{ij}={\frac {p_{j\mid i}+p_{i\mid j}}{2N}}} This is motivated because p i {\displaystyle p_{i}} and p j {\displaystyle p_{j}} from the N samples are estimated as 1/N, so the conditional probability can be written as p i ∣ j = N p i j {\displaystyle p_{i\mid j}=Np_{ij}} and p j ∣ i = N p j i {\displaystyle p_{j\mid i}=Np_{ji}} . Since p i j = p j i {\displaystyle p_{ij}=p_{ji}} , you can obtain previous formula. Also note that p i i = 0 {\displaystyle p_{ii}=0} and ∑ i , j p i j = 1 {\displaystyle \sum _{i,j}p_{ij}=1} . The bandwidth of the Gaussian kernels σ i {\displaystyle \sigma _{i}} is set in such a way that the entropy of the conditional distribution equals a predefined entropy using the bisection method. As a result, the bandwidth is adapted to the density of the data: smaller values of σ i {\displaystyle \sigma _{i}} are used in denser parts of the data space. The entropy increases with the perplexity of this distribution P i {\displaystyle P_{i}} ; this relation is seen as P e r p ( P i ) = 2 H ( P i ) {\displaystyle Perp(P_{i})=2^{H(P_{i})}} where H ( P i ) {\displaystyle H(P_{i})} is the Shannon entropy H ( P i ) = − ∑ j p j | i log 2 ⁡ p j | i . {\displaystyle H(P_{i})=-\sum _{j}p_{j|i}\log _{2}p_{j|i}.} The perplexity is a hand-chosen parameter of t-SNE, and as the authors state, "perplexity can be interpreted as a smooth measure of the effective number of neighbors. The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.". Since the Gaussian kernel uses the Euclidean distance ‖ x i − x j ‖ {\displaystyle \lVert x_{i}-x_{j}\rVert } , it is affected by the curse of dimensionality, and in high dimensional data when distances lose the ability to discriminate, the p i j {\displaystyle p_{ij}} become too similar (asymptotically, they would converge to a constant). It has been proposed to adjust the distances with a power transform, based on the intrinsic dimension of each point, to alleviate this. t-SNE aims to learn a d {\displaystyle d} -dimensional map y 1 , … , y N {\displaystyle \mathbf {y} _{1},\dots ,\mathbf {y} _{N}} (with y i ∈ R d {\displaystyle \mathbf {y} _{i}\in \mathbb {R} ^{d}} and d {\displaystyle d} typically chosen as 2 or 3) that reflects the similarities p i j {\displaystyle p_{ij}} as well as possible. To this end, it measures similarities q i j {\displaystyle q_{ij}} between two points in the map y i {\displaystyle \mathbf {y} _{i}} and y j {\displaystyle \mathbf {y} _{j}} , using a very similar approach. Specifically, for i ≠ j {\displaystyle i\neq j} , define q i j {\displaystyle q_{ij}} as q i j = ( 1 + ‖ y i − y j ‖ 2 ) − 1 ∑ k ∑ l ≠ k ( 1 + ‖ y k − y l ‖ 2 ) − 1 {\displaystyle q_{ij}={\frac {(1+\lVert \mathbf {y} _{i}-\mathbf {y} _{j}\rVert ^{2})^{-1}}{\sum _{k}\sum _{l\neq k}(1+\lVert \mathbf {y} _{k}-\mathbf {y} _{l}\rVert ^{2})^{-1}}}} and set q i i = 0 {\displaystyle q_{ii}=0} . Herein a heavy-tailed Student t-distribution (with one-degree of freedom, which is the same as a Cauchy distribution) is used to measure similarities between low-dimensional points in order to allow dissimilar objects to be modeled far apart in the map. The locations of the points y i {\displaystyle \mathbf {y} _{i}} in the map are determined by minimizing the (non-symmetric) Kullback–Leibler divergence of the distribution P {\displaystyle P} from the distribution Q {\displaystyle Q} , that is: K L ( P ∥ Q ) = ∑ i ≠ j p i j log ⁡ p i j q i j {\displaystyle \mathrm {KL} \left(P\parallel Q\right)=\sum _{i\neq j}p_{ij}\log {\frac {p_{ij}}{q_{ij}}}} The minimization of the Kullback–Leibler divergence with respect to the points y i {\displaystyle \mathbf {y} _{i}} is performed using gradient descent. The result of this optimization is a map that reflects the similarities between the high-dimensional inputs. == Output == While t-SNE plots often seem to display clusters, the visual clusters can be strongly influenced by the chosen parameterization (especially the perplexity) and so a good understanding of the parameters for t-SNE is needed. Such "clusters" can be shown to even appear in structured data with no clear clustering, and so may be false findings. Similarly, the size of clusters produced by t-SNE is not informative, and neither is the distance between clusters. Thus, interactive exploration may be needed to choose parameters and validate results. It has been shown that t-SNE can often recover well-separated clusters, and with special parameter choices, approximates a simple form of spectral clustering. == Software == A C++ implementation of Barnes-Hut is available on the github account of one of the original authors. The R package Rtsne implements t-SNE in R. ELKI contains tSNE, also with Barnes-Hut approximation scikit-learn, a popular machine learning library in Python implements t-SNE with both exact solutions and the Barnes-Hut approximation. Tensorboard, the visualization kit associated with TensorFlow, also implements t-SNE (online version) The Julia package TSne implements t-SNE

    Read more →
  • Bioz

    Bioz

    Bioz is a search engine for life science experimentation. == History == Bioz was founded by Karin Lachmi and Daniel Levitt. Lachmi is a scientist who completed her postdoc in molecular and cellular biology at the Stanford University School of Medicine. During her lab work she found little available data regarding preferable lab tools, reagents and related products for experimentation. There are 50,000 vendors selling 300 million scientific products. She decided to start the company in order to provide researchers with adequate information for that purpose. Co-founder Daniel Levitt is an entrepreneur who sold his company WebAppoint to Microsoft in the year 2000. He also co-founded the company StemRad. At Bioz, Lachmi serves as the Chief Scientific Officer and Levitt serves as the chief executive officer. Bioz claims to have over a million researcher-users from 196 countries. Among the investors are Esther Dyson and the Stanford-StartX Fund. The company's advisory board includes Nobel Laureates in Chemistry Michael Levitt, Roger Kornberg, and Ada Yonath. == Technology == The company uses artificial intelligence, machine learning and natural language processing in order to extract experimentation data from scientific articles, such as the products that researchers used, the companies that supply the products, the protocol conditions that researchers selected, and the types of experiments and techniques. The algorithm ranks products based on how frequently they were used by researchers in their experiments, how recently a product was used, and the impact factor of the journal. The algorithm's output is a Bioz stars score for each product that was mentioned in an article. Bioz is a data-driven platform for product recommendations, which is contrary to platforms such as TripAdvisor and OpenTable that are based on user-generated reviews and ratings. The recommendations and scoring system that the company has developed are meant to assist researchers with the process of developing future medications and finding cures for diseases. They are guided towards products and techniques that were previously used by other researchers when planning and performing experiments. The company's revenue is based on selling SaaS subscriptions to researchers in biopharma companies. They also charge product suppliers for content syndication.

    Read more →
  • Out-of-bag error

    Out-of-bag error

    Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring the prediction error of random forests, boosted decision trees, and other machine learning models utilizing bootstrap aggregating (bagging). Bagging uses subsampling with replacement to create training samples for the model to learn from. OOB error is the mean prediction error on each training sample xi, using only the trees that did not have xi in their bootstrap sample. Bootstrap aggregating allows one to define an out-of-bag estimate of the prediction performance improvement by evaluating predictions on those observations that were not used in the building of the next base learner. == Out-of-bag dataset == When bootstrap aggregating is performed, two independent sets are created. One set, the bootstrap sample, is the data chosen to be "in-the-bag" by sampling with replacement. The out-of-bag set is all data not chosen in the sampling process. When this process is repeated, such as when building a random forest, many bootstrap samples and OOB sets are created. The OOB sets can be aggregated into one dataset, but each sample is only considered out-of-bag for the trees that do not include it in their bootstrap sample. The picture below shows that for each bag sampled, the data is separated into two groups. This example shows how bagging could be used in the context of diagnosing disease. A set of patients are the original dataset, but each model is trained only by the patients in its bag. The patients in each out-of-bag set can be used to test their respective models. The test would consider whether the model can accurately determine if the patient has the disease. == Calculating out-of-bag error == Since each out-of-bag set is not used to train the model, it is a good test for the performance of the model. The specific calculation of OOB error depends on the implementation of the model, but a general calculation is as follows. Find all models (or trees, in the case of a random forest) that are not trained by the OOB instance. Take the majority vote of these models' result for the OOB instance, compared to the true value of the OOB instance. Compile the OOB error for all instances in the OOB dataset. The bagging process can be customized to fit the needs of a model. To ensure an accurate model, the bootstrap training sample size should be close to that of the original set. Also, the number of iterations (trees) of the model (forest) should be considered to find the true OOB error. The OOB error will stabilize over many iterations so starting with a high number of iterations is a good idea. Shown in the example to the right, the OOB error can be found using the method above once the forest is set up. == Comparison to cross-validation == Out-of-bag error and cross-validation (CV) are different methods of measuring the error estimate of a machine learning model. Over many iterations, the two methods should produce a very similar error estimate. That is, once the OOB error stabilizes, it will converge to the cross-validation (specifically leave-one-out cross-validation) error. The advantage of the OOB method is that it requires less computation and allows one to test the model as it is being trained. == Accuracy and Consistency == Out-of-bag error is used frequently for error estimation within random forests but with the conclusion of a study done by Silke Janitza and Roman Hornung, out-of-bag error has shown to overestimate in settings that include an equal number of observations from all response classes (balanced samples), small sample sizes, a large number of predictor variables, small correlation between predictors, and weak effects.

    Read more →
  • Adversarial machine learning

    Adversarial machine learning

    Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. Machine learning techniques are mostly designed to work on specific problem sets, under the assumption that the training and test data are generated from the same statistical distribution (IID). However, this assumption is often violated in practical high-stake applications, where users may intentionally supply fabricated data that violates the statistical assumption. Most common attacks in adversarial machine learning include evasion attacks, data poisoning attacks, Byzantine attacks and model extraction. == History == At the MIT Spam Conference in January 2004, John Graham-Cumming showed that a machine-learning spam filter could be used to defeat another machine-learning spam filter by automatically learning which words to add to a spam email to get the email classified as not spam. In 2004, Nilesh Dalvi and others noted that linear classifiers used in spam filters could be defeated by simple "evasion attacks" as spammers inserted "good words" into their spam emails. (Around 2007, some spammers added random noise to fuzz words within "image spam" in order to defeat OCR-based filters.) In 2006, Marco Barreno and others published "Can Machine Learning Be Secure?", outlining a broad taxonomy of attacks. As late as 2013 many researchers continued to hope that non-linear classifiers (such as support vector machines and neural networks) might be robust to adversaries, until Battista Biggio and others demonstrated the first gradient-based attacks on such machine-learning models (2012–2013). In 2012, deep neural networks began to dominate computer vision problems; starting in 2014, Christian Szegedy and others demonstrated that deep neural networks could be fooled by adversaries, again using a gradient-based attack to craft adversarial perturbations. Further work would show that adversarial attacks are harder to produce in uncontrolled environments, due to the different environmental constraints that cancel out the effect of noise. For example, any small rotation or slight illumination on an adversarial image can destroy the adversariality. In addition, researchers such as Google Brain's Nick Frosst point out that it is much easier to make self-driving cars miss stop signs by physically removing the sign itself, rather than creating adversarial examples. Frosst also believes that the adversarial machine learning community incorrectly assumes models trained on a certain data distribution will also perform well on a completely different data distribution. He suggests that a new approach to machine learning should be explored, and is currently working on a unique neural network that has characteristics more similar to human perception than state-of-the-art approaches. While adversarial machine learning continues to be heavily rooted in academia, large tech companies such as Google, Microsoft, and IBM have begun curating documentation and open source code bases to allow others to concretely assess the robustness of machine learning models and minimize the risk of adversarial attacks. === Examples === Examples include attacks in spam filtering, where spam messages are obfuscated through the misspelling of "bad" words or the insertion of "good" words; attacks in computer security, such as obfuscating malware code within network packets or modifying the characteristics of a network flow to mislead intrusion detection; attacks in biometric recognition where fake biometric traits may be exploited to impersonate a legitimate user; or to compromise users' template galleries that adapt to updated traits over time. Researchers showed that by changing only one-pixel it was possible to fool deep learning algorithms. Others 3-D printed a toy turtle with a texture engineered to make Google's object detection AI classify it as a rifle regardless of the angle from which the turtle was viewed. Creating the turtle required only low-cost commercially available 3-D printing technology. A machine-tweaked image of a dog was shown to look like a cat to both computers and humans. A 2019 study reported that humans can guess how machines will classify adversarial images. Researchers discovered methods for perturbing the appearance of a stop sign such that an autonomous vehicle classified it as a merge or speed limit sign. A data poisoning filter called Nightshade was released in 2023 by researchers at the University of Chicago. It was created for use by visual artists to put on their artwork to corrupt the data set of text-to-image models, which usually scrape their data from the internet without the consent of the image creator. McAfee attacked Tesla's former Mobileye system, fooling it into driving 50 mph over the speed limit, simply by adding a two-inch strip of black tape to a speed limit sign. Adversarial patterns on glasses or clothing designed to deceive facial-recognition systems or license-plate readers, have led to a niche industry of "stealth streetwear". An adversarial attack on a neural network can allow an attacker to inject algorithms into the target system. Researchers can also create adversarial audio inputs to disguise commands to intelligent assistants in benign-seeming audio; a parallel literature explores human perception of such stimuli. Clustering algorithms are used in security applications. Malware and computer virus analysis aims to identify malware families, and to generate specific detection signatures. In the context of malware detection, researchers have proposed methods for adversarial malware generation that automatically craft binaries to evade learning-based detectors while preserving malicious functionality. Optimization-based attacks such as GAMMA use genetic algorithms to inject benign content (for example, padding or new PE sections) into Windows executables, framing evasion as a constrained optimization problem that balances misclassification success with the size of the injected payload and showing transferability to commercial antivirus products. Complementary work uses generative adversarial networks (GANs) to learn feature-space perturbations that cause malware to be classified as benign; Mal-LSGAN, for instance, replaces the standard GAN loss with a least-squares objective and modified activation functions to improve training stability and produce adversarial malware examples that substantially reduce true positive rates across multiple detectors. == Challenges in applying machine learning to security == Researchers have observed that the constraints under which machine-learning techniques function in the security domain are different from those of common benchmark domains. Security data may change over time, include mislabeled samples, or reflect adversarial behavior, which complicates evaluation and reproducibility. === Data collection issues === Security datasets vary across formats, including binaries, network traces, and log files. Studies have reported that the process of converting these sources into features can introduce bias or inconsistencies. In addition, time-based leakage can occur when related malware samples are not properly separated across training and testing splits, which may lead to overly optimistic results. === Labeling and ground truth challenges === Malware labels are often unstable because different antivirus engines may classify the same sample in conflicting ways. Ceschin et al. note that families may be renamed or reorganized over time, causing further discrepancies in ground truth and reducing the reliability of benchmarks. === Concept drift === Because malware creators continuously adapt their techniques, the statistical properties of malicious samples also change. This form of concept drift has been widely documented and may reduce model performance unless systems are updated regularly or incorporate mechanisms for incremental learning. === Feature robustness === Researchers differentiate between features that can be easily manipulated and those that are more resistant to modification. For example, simple static attributes, such as header fields, may be altered by attackers, while structural features, such as control-flow graphs, are generally more stable but computationally expensive to extract. === Class imbalance === In realistic deployment environments, the proportion of malicious samples can be extremely low, ranging from 0.01% to 2% of total data. This unbalanced distribution causes models to develop a bias towards the majority class, achieving high accuracy but failing to identify malicious samples. Prior approaches to this problem have included both data-level solutions and sequence-specific models. Methods like n-gram and Long Short-Term Memory (LSTM) networks can model sequential data, but their performance has been shown to decline significantly when malware samples are realistically proportioned in the training set, demonstrating the limitations in

    Read more →
  • BrownBoost

    BrownBoost

    BrownBoost is a boosting algorithm that may be robust to noisy datasets. BrownBoost is an adaptive version of the boost by majority algorithm. As is the case for all boosting algorithms, BrownBoost is used in conjunction with other machine learning methods. BrownBoost was introduced by Yoav Freund in 2001. == Motivation == AdaBoost performs well on a variety of datasets; however, it can be shown that AdaBoost does not perform well on noisy data sets. This is a result of AdaBoost's focus on examples that are repeatedly misclassified. In contrast, BrownBoost effectively "gives up" on examples that are repeatedly misclassified. The core assumption of BrownBoost is that noisy examples will be repeatedly mislabeled by the weak hypotheses and non-noisy examples will be correctly labeled frequently enough to not be "given up on." Thus only noisy examples will be "given up on," whereas non-noisy examples will contribute to the final classifier. In turn, if the final classifier is learned from the non-noisy examples, the generalization error of the final classifier may be much better than if learned from noisy and non-noisy examples. The user of the algorithm can set the amount of error to be tolerated in the training set. Thus, if the training set is noisy (say 10% of all examples are assumed to be mislabeled), the booster can be told to accept a 10% error rate. Since the noisy examples may be ignored, only the true examples will contribute to the learning process. == Algorithm description == BrownBoost uses a non-convex potential loss function, thus it does not fit into the AdaBoost framework. The non-convex optimization provides a method to avoid overfitting noisy data sets. However, in contrast to boosting algorithms that analytically minimize a convex loss function (e.g. AdaBoost and LogitBoost), BrownBoost solves a system of two equations and two unknowns using standard numerical methods. The only parameter of BrownBoost ( c {\displaystyle c} in the algorithm) is the "time" the algorithm runs. The theory of BrownBoost states that each hypothesis takes a variable amount of time ( t {\displaystyle t} in the algorithm) which is directly related to the weight given to the hypothesis α {\displaystyle \alpha } . The time parameter in BrownBoost is analogous to the number of iterations T {\displaystyle T} in AdaBoost. A larger value of c {\displaystyle c} means that BrownBoost will treat the data as if it were less noisy and therefore will give up on fewer examples. Conversely, a smaller value of c {\displaystyle c} means that BrownBoost will treat the data as more noisy and give up on more examples. During each iteration of the algorithm, a hypothesis is selected with some advantage over random guessing. The weight of this hypothesis α {\displaystyle \alpha } and the "amount of time passed" t {\displaystyle t} during the iteration are simultaneously solved in a system of two non-linear equations ( 1. uncorrelated hypothesis w.r.t example weights and 2. hold the potential constant) with two unknowns (weight of hypothesis α {\displaystyle \alpha } and time passed t {\displaystyle t} ). This can be solved by bisection (as implemented in the JBoost software package) or Newton's method (as described in the original paper by Freund). Once these equations are solved, the margins of each example ( r i ( x j ) {\displaystyle r_{i}(x_{j})} in the algorithm) and the amount of time remaining s {\displaystyle s} are updated appropriately. This process is repeated until there is no time remaining. The initial potential is defined to be 1 m ∑ j = 1 m 1 − erf ( c ) = 1 − erf ( c ) {\displaystyle {\frac {1}{m}}\sum _{j=1}^{m}1-{\mbox{erf}}({\sqrt {c}})=1-{\mbox{erf}}({\sqrt {c}})} . Since a constraint of each iteration is that the potential be held constant, the final potential is 1 m ∑ j = 1 m 1 − erf ( r i ( x j ) / c ) = 1 − erf ( c ) {\displaystyle {\frac {1}{m}}\sum _{j=1}^{m}1-{\mbox{erf}}(r_{i}(x_{j})/{\sqrt {c}})=1-{\mbox{erf}}({\sqrt {c}})} . Thus the final error is likely to be near 1 − erf ( c ) {\displaystyle 1-{\mbox{erf}}({\sqrt {c}})} . However, the final potential function is not the 0–1 loss error function. For the final error to be exactly 1 − erf ( c ) {\displaystyle 1-{\mbox{erf}}({\sqrt {c}})} , the variance of the loss function must decrease linearly w.r.t. time to form the 0–1 loss function at the end of boosting iterations. This is not yet discussed in the literature and is not in the definition of the algorithm below. The final classifier is a linear combination of weak hypotheses and is evaluated in the same manner as most other boosting algorithms. == BrownBoost learning algorithm definition == Input: m {\displaystyle m} training examples ( x 1 , y 1 ) , … , ( x m , y m ) {\displaystyle (x_{1},y_{1}),\ldots ,(x_{m},y_{m})} where x j ∈ X , y j ∈ Y = { − 1 , + 1 } {\displaystyle x_{j}\in X,\,y_{j}\in Y=\{-1,+1\}} The parameter c {\displaystyle c} Initialise: s = c {\displaystyle s=c} . (The value of s {\displaystyle s} is the amount of time remaining in the game) r i ( x j ) = 0 {\displaystyle r_{i}(x_{j})=0} ∀ j {\displaystyle \forall j} . The value of r i ( x j ) {\displaystyle r_{i}(x_{j})} is the margin at iteration i {\displaystyle i} for example x j {\displaystyle x_{j}} . While s > 0 {\displaystyle s>0} : Set the weights of each example: W i ( x j ) = e − ( r i ( x j ) + s ) 2 c {\displaystyle W_{i}(x_{j})=e^{-{\frac {(r_{i}(x_{j})+s)^{2}}{c}}}} , where r i ( x j ) {\displaystyle r_{i}(x_{j})} is the margin of example x j {\displaystyle x_{j}} Find a classifier h i : X → { − 1 , + 1 } {\displaystyle h_{i}:X\to \{-1,+1\}} such that ∑ j W i ( x j ) h i ( x j ) y j > 0 {\displaystyle \sum _{j}W_{i}(x_{j})h_{i}(x_{j})y_{j}>0} Find values α , t {\displaystyle \alpha ,t} that satisfy the equation: ∑ j h i ( x j ) y j e − ( r i ( x j ) + α h i ( x j ) y j + s − t ) 2 c = 0 {\displaystyle \sum _{j}h_{i}(x_{j})y_{j}e^{-{\frac {(r_{i}(x_{j})+\alpha h_{i}(x_{j})y_{j}+s-t)^{2}}{c}}}=0} . (Note this is similar to the condition E W i + 1 [ h i ( x j ) y j ] = 0 {\displaystyle E_{W_{i+1}}[h_{i}(x_{j})y_{j}]=0} set forth by Schapire and Singer. In this setting, we are numerically finding the W i + 1 = exp ⁡ ( ⋯ ⋯ ) {\displaystyle W_{i+1}=\exp \left({\frac {\cdots }{\cdots }}\right)} such that E W i + 1 [ h i ( x j ) y j ] = 0 {\displaystyle E_{W_{i+1}}[h_{i}(x_{j})y_{j}]=0} .) This update is subject to the constraint ∑ ( Φ ( r i ( x j ) + α h ( x j ) y j + s − t ) − Φ ( r i ( x j ) + s ) ) = 0 {\displaystyle \sum \left(\Phi \left(r_{i}(x_{j})+\alpha h(x_{j})y_{j}+s-t\right)-\Phi \left(r_{i}(x_{j})+s\right)\right)=0} , where Φ ( z ) = 1 − erf ( z / c ) {\displaystyle \Phi (z)=1-{\mbox{erf}}(z/{\sqrt {c}})} is the potential loss for a point with margin r i ( x j ) {\displaystyle r_{i}(x_{j})} Update the margins for each example: r i + 1 ( x j ) = r i ( x j ) + α h ( x j ) y j {\displaystyle r_{i+1}(x_{j})=r_{i}(x_{j})+\alpha h(x_{j})y_{j}} Update the time remaining: s = s − t {\displaystyle s=s-t} Output: H ( x ) = sign ( ∑ i α i h i ( x ) ) {\displaystyle H(x)={\textrm {sign}}\left(\sum _{i}\alpha _{i}h_{i}(x)\right)} == Empirical results == In preliminary experimental results with noisy datasets, BrownBoost outperformed AdaBoost's generalization error; however, LogitBoost performed as well as BrownBoost. An implementation of BrownBoost can be found in the open source software JBoost.

    Read more →
  • Probit model

    Probit model

    In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from probability + unit. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model. A probit model is a popular specification for a binary response model. As such it treats the same set of problems as does logistic regression using similar techniques. When viewed in the generalized linear model framework, the probit model employs a probit link function. It is most often estimated using the maximum likelihood procedure, such an estimation being called a probit regression. == Conceptual framework == Suppose a response variable Y is binary, that is it can have only two possible outcomes which we will denote as 1 and 0. For example, Y may represent presence/absence of a certain condition, success/failure of some device, answer yes/no on a survey, etc. We also have a vector of regressors X, which are assumed to influence the outcome Y. Specifically, we assume that the model takes the form P ( Y = 1 ∣ X ) = Φ ( X T β ) , {\displaystyle P(Y=1\mid X)=\Phi (X^{\operatorname {T} }\beta ),} where P is the probability and Φ {\displaystyle \Phi } is the cumulative distribution function (CDF) of the standard normal distribution. The parameters β are typically estimated by maximum likelihood. It is possible to motivate the probit model as a latent variable model. Suppose there exists an auxiliary random variable Y ∗ = X T β + ε , {\displaystyle Y^{\ast }=X^{T}\beta +\varepsilon ,} where ε ~ N(0, 1). Then Y can be viewed as an indicator for whether this latent variable is positive: Y = { 1 Y ∗ > 0 0 otherwise } = { 1 X T β + ε > 0 0 otherwise } {\displaystyle Y=\left.{\begin{cases}1&Y^{}>0\\0&{\text{otherwise}}\end{cases}}\right\}=\left.{\begin{cases}1&X^{\operatorname {T} }\beta +\varepsilon >0\\0&{\text{otherwise}}\end{cases}}\right\}} The use of the standard normal distribution causes no loss of generality compared with the use of a normal distribution with an arbitrary mean and standard deviation, because adding a fixed amount to the mean can be compensated by subtracting the same amount from the intercept, and multiplying the standard deviation by a fixed amount can be compensated by multiplying the weights by the same amount. To see that the two models are equivalent, note that P ( Y = 1 ∣ X ) = P ( Y ∗ > 0 ) = P ( X T β + ε > 0 ) = P ( ε > − X T β ) = P ( ε < X T β ) by symmetry of the normal distribution = Φ ( X T β ) {\displaystyle {\begin{aligned}P(Y=1\mid X)&=P(Y^{\ast }>0)\\&=P(X^{\operatorname {T} }\beta +\varepsilon >0)\\&=P(\varepsilon >-X^{\operatorname {T} }\beta )\\&=P(\varepsilon 0 {\displaystyle t,\lim _{n\rightarrow \infty }n_{t}/n=c_{t}>0} . Denote p ^ t = r t / n t {\displaystyle {\hat {p}}_{t}=r_{t}/n_{t}} σ ^ t 2 = 1 n t p ^ t ( 1 − p ^ t ) φ 2 ( Φ − 1 ( p ^ t ) ) {\displaystyle {\hat {\sigma }}_{t}^{2}={\frac {1}{n_{t}}}{\frac {{\hat {p}}_{t}(1-{\hat {p}}_{t})}{\varphi ^{2}{\big (}\Phi ^{-1}({\hat {p}}_{t}){\big )}}}} Then Berkson's minimum chi-square estimator is a generalized least squares estimator in a regression of Φ − 1 ( p ^ t ) {\displaystyle \Phi ^{-1}({\hat {p}}_{t})} on x ( t ) {\displaystyle x_{(t)}} with weights σ ^ t − 2 {\displaystyle {\hat {\sigma }}_{t}^{-2}} : β ^ = ( ∑ t = 1 T σ ^ t − 2 x ( t ) x ( t ) T ) − 1 ∑ t = 1 T σ ^ t − 2 x ( t ) Φ − 1 ( p ^ t ) {\displaystyle {\hat {\beta }}={\Bigg (}\sum _{t=1}^{T}{\hat {\sigma }}_{t}^{-2}x_{(t)}x_{(t)}^{\operatorname {T} }{\Bigg )}^{-1}\sum _{t=1}^{T}{\hat {\sigma }}_{t}^{-2}x_{(t)}\Phi ^{-1}({\hat {p}}_{t})} It can be shown that this estimator is consistent (as n→∞ and T fixed), asymptotically normal and efficient. Its advantage is the presence of a closed-form formula for the estimator. However, it is only meaningful to carry out this analysis when individual observations are not available, only their aggregated counts r t {\displaystyle r_{t}} , n t {\disp

    Read more →