AdaBoost

AdaBoost

AdaBoost (short for Adaptive Boosting) is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many types of learning algorithm to improve performance. The output of multiple weak learners is combined into a weighted sum that represents the final output of the boosted classifier. Usually, AdaBoost is presented for binary classification, although it can be generalized to multiple classes or bounded intervals of real values. AdaBoost is adaptive in the sense that subsequent weak learners (models) are adjusted in favor of instances misclassified by previous models. In some problems, it can be less susceptible to overfitting than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing, the final model can be proven to converge to a strong learner. Although AdaBoost is typically used to combine weak base learners (such as decision stumps), it has been shown to also effectively combine strong base learners (such as deeper decision trees), producing an even more accurate model. Every learning algorithm tends to suit some problem types better than others, and typically has many different parameters and configurations to adjust before it achieves optimal performance on a dataset. AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier. When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree-growing algorithm such that later trees tend to focus on harder-to-classify examples. == Training == AdaBoost refers to a particular method of training a boosted classifier. A boosted classifier is a classifier of the form F T ( x ) = ∑ t = 1 T f t ( x ) {\displaystyle F_{T}(x)=\sum _{t=1}^{T}f_{t}(x)} where each f t {\displaystyle f_{t}} is a weak learner that takes an object x {\displaystyle x} as input and returns a value indicating the class of the object. For example, in the two-class problem, the sign of the weak learner's output identifies the predicted object class and the absolute value gives the confidence in that classification. Each weak learner produces an output hypothesis h {\displaystyle h} which fixes a prediction h ( x i ) {\displaystyle h(x_{i})} for each sample in the training set. At each iteration t {\displaystyle t} , a weak learner is selected and assigned a coefficient α t {\displaystyle \alpha _{t}} such that the total training error E t {\displaystyle E_{t}} of the resulting t {\displaystyle t} -stage boosted classifier is minimized. E t = ∑ i E [ F t − 1 ( x i ) + α t h ( x i ) ] {\displaystyle E_{t}=\sum _{i}E[F_{t-1}(x_{i})+\alpha _{t}h(x_{i})]} Here F t − 1 ( x ) {\displaystyle F_{t-1}(x)} is the boosted classifier that has been built up to the previous stage of training and f t ( x ) = α t h ( x ) {\displaystyle f_{t}(x)=\alpha _{t}h(x)} is the weak learner that is being considered for addition to the final classifier. === Weighting === At each iteration of the training process, a weight w i , t {\displaystyle w_{i,t}} is assigned to each sample in the training set equal to the current error E ( F t − 1 ( x i ) ) {\displaystyle E(F_{t-1}(x_{i}))} on that sample. These weights can be used in the training of the weak learner. For instance, decision trees can be grown which favor the splitting of sets of samples with large weights. == Derivation == This derivation follows Rojas (2009): Suppose we have a data set { ( x 1 , y 1 ) , … , ( x N , y N ) } {\displaystyle \{(x_{1},y_{1}),\ldots ,(x_{N},y_{N})\}} where each item x i {\displaystyle x_{i}} has an associated class y i ∈ { − 1 , 1 } {\displaystyle y_{i}\in \{-1,1\}} , and a set of weak classifiers { k 1 , … , k L } {\displaystyle \{k_{1},\ldots ,k_{L}\}} each of which outputs a classification k j ( x i ) ∈ { − 1 , 1 } {\displaystyle k_{j}(x_{i})\in \{-1,1\}} for each item. After the ( m − 1 ) {\displaystyle (m-1)} -th iteration our boosted classifier is a linear combination of the weak classifiers of the form: C ( m − 1 ) ( x i ) = α 1 k 1 ( x i ) + ⋯ + α m − 1 k m − 1 ( x i ) , {\displaystyle C_{(m-1)}(x_{i})=\alpha _{1}k_{1}(x_{i})+\cdots +\alpha _{m-1}k_{m-1}(x_{i}),} where the class will be the sign of C ( m − 1 ) ( x i ) {\displaystyle C_{(m-1)}(x_{i})} . At the m {\displaystyle m} -th iteration we want to extend this to a better boosted classifier by adding another weak classifier k m {\displaystyle k_{m}} , with another weight α m {\displaystyle \alpha _{m}} : C m ( x i ) = C ( m − 1 ) ( x i ) + α m k m ( x i ) {\displaystyle C_{m}(x_{i})=C_{(m-1)}(x_{i})+\alpha _{m}k_{m}(x_{i})} So it remains to determine which weak classifier is the best choice for k m {\displaystyle k_{m}} , and what its weight α m {\displaystyle \alpha _{m}} should be. We define the total error E {\displaystyle E} of C m {\displaystyle C_{m}} as the sum of its exponential loss on each data point, given as follows: E = ∑ i = 1 N e − y i C m ( x i ) = ∑ i = 1 N e − y i C ( m − 1 ) ( x i ) e − y i α m k m ( x i ) {\displaystyle E=\sum _{i=1}^{N}e^{-y_{i}C_{m}(x_{i})}=\sum _{i=1}^{N}e^{-y_{i}C_{(m-1)}(x_{i})}e^{-y_{i}\alpha _{m}k_{m}(x_{i})}} Letting w i ( 1 ) = 1 {\displaystyle w_{i}^{(1)}=1} and w i ( m ) = e − y i C m − 1 ( x i ) {\displaystyle w_{i}^{(m)}=e^{-y_{i}C_{m-1}(x_{i})}} for m > 1 {\displaystyle m>1} , we have: E = ∑ i = 1 N w i ( m ) e − y i α m k m ( x i ) {\displaystyle E=\sum _{i=1}^{N}w_{i}^{(m)}e^{-y_{i}\alpha _{m}k_{m}(x_{i})}} We can split this summation between those data points that are correctly classified by k m {\displaystyle k_{m}} (so y i k m ( x i ) = 1 {\displaystyle y_{i}k_{m}(x_{i})=1} ) and those that are misclassified (so y i k m ( x i ) = − 1 {\displaystyle y_{i}k_{m}(x_{i})=-1} ): E = ∑ y i = k m ( x i ) w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) e α m = ∑ i = 1 N w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) ( e α m − e − α m ) {\displaystyle {\begin{aligned}E&=\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}e^{\alpha _{m}}\\&=\sum _{i=1}^{N}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}\left(e^{\alpha _{m}}-e^{-\alpha _{m}}\right)\end{aligned}}} Since the only part of the right-hand side of this equation that depends on k m {\displaystyle k_{m}} is ∑ y i ≠ k m ( x i ) w i ( m ) {\textstyle \sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}} , we see that the k m {\displaystyle k_{m}} that minimizes E {\displaystyle E} is the one in the set { k 1 , … , k L } {\displaystyle \{k_{1},\ldots ,k_{L}\}} that minimizes ∑ y i ≠ k m ( x i ) w i ( m ) {\textstyle \sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}} [assuming that α m > 0 {\displaystyle \alpha _{m}>0} ], i.e. the weak classifier with the lowest weighted error (with weights w i ( m ) = e − y i C m − 1 ( x i ) {\displaystyle w_{i}^{(m)}=e^{-y_{i}C_{m-1}(x_{i})}} ). To determine the desired weight α m {\displaystyle \alpha _{m}} that minimizes E {\displaystyle E} with the k m {\displaystyle k_{m}} that we just determined, we differentiate: d E d α m = d ( ∑ y i = k m ( x i ) w i ( m ) e − α m + ∑ y i ≠ k m ( x i ) w i ( m ) e α m ) d α m {\displaystyle {\frac {dE}{d\alpha _{m}}}={\frac {d(\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}e^{-\alpha _{m}}+\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}e^{\alpha _{m}})}{d\alpha _{m}}}} The value of α m {\displaystyle \alpha _{m}} that minimizes the above expression is: α m = 1 2 ln ⁡ ( ∑ y i = k m ( x i ) w i ( m ) ∑ y i ≠ k m ( x i ) w i ( m ) ) {\displaystyle \alpha _{m}={\frac {1}{2}}\ln \left({\frac {\sum _{y_{i}=k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}}\right)} We calculate the weighted error rate of the weak classifier to be ϵ m = ∑ y i ≠ k m ( x i ) w i ( m ) ∑ i = 1 N w i ( m ) {\displaystyle \epsilon _{m}={\frac {\sum _{y_{i}\neq k_{m}(x_{i})}w_{i}^{(m)}}{\sum _{i=1}^{N}w_{i}^{(m)}}}} , so it follows that: α m = 1 2 ln ⁡ ( 1 − ϵ m ϵ m ) {\displaystyle \alpha _{m}={\frac {1}{2}}\ln \left({\frac {1-\epsilon _{m}}{\epsilon _{m}}}\right)} which is the negative logit function multiplied by 0.5. Due to the convexity of E {\displaystyle E} as a function of α m {\displaystyle \alpha _{m}} , this new expression for α m {\displaystyle \alpha _{m}} gives the global minimum of the loss function. Note: This derivation only applies when k m ( x i ) ∈ { − 1 , 1 } {\displaystyle k_{m}(x_{i})\in \{-1,1\}} , though it can be a good starting guess in other cases, such as when the weak learner is biased ( k m ( x ) ∈ { a , b } , a ≠ − b {\displaystyle k_{m}(x)\in \{a,b\},a\neq -b} ), has multiple leaves ( k m ( x ) ∈ { a , b , … , n } {\displaystyle k_{m}(x)\in \{a,b,\dots ,n\}} ) or is some other function k m ( x ) ∈ R {\displaystyle k_{m}(x)\in \mathbb {R} } . Thus we have derived the AdaBoost algorithm: At each

Butler in a Box

Butler in a Box was an early voice-controlled home automation device developed in 1983 by magician Gus Searcy and programmer Franz Kavan. The device allowed users to control various home electronics, such as lights and phones, using voice commands. It predated modern smart speakers and virtual assistants by several decades. == History == The idea for the Butler in a Box originated in 1983 when Searcy was asked by friends why he couldn't simply command lights to turn on and off if he could pull rabbits out of hats, given his background as a professional magician. Searcy partnered with former IBM programmer Kavan to develop the device, with their first prototype being named "Sidney". The Butler in a Box combined remote control technology with voice recognition to enable control of home devices. However, it faced challenges due to the technological limitations of the era and its high price point of nearly $1,500 (equivalent to around $3,700 in 2021). == Features and functionality == Users could activate the Butler in a Box by speaking a wake word, typically a traditional butler name, and the device would address the user as "boss". It was capable of performing tasks such as: Turning lights on and off, controlling individual zones if lights were connected to remote control modules Making and receiving phone calls Setting timers Pairing with sensors to function as a security alarm system However, the device required extensive voice training for each user, a time-consuming process compared to modern voice recognition. Additionally, settings and trained commands would be lost if power was out for over 3 hours due to the volatile memory technology used at the time. == Reception and legacy == While innovative for its time, the Butler in a Box did not achieve widespread commercial success due to its high price and the technical limitations of the 1980s. Nevertheless, it served as an important early step in the development of home automation and showcased the potential for voice-controlled technology to enhance accessibility and convenience in the home. Decades later, products like Amazon Alexa, Google Home, and Apple's Siri would make voice-controlled smart home devices commonplace and affordable, building on the groundwork laid by early attempts like the Butler in a Box.

Netflix Prize

The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films, i.e. without the users being identified except by numbers assigned for the contest. The competition was held by Netflix, a video streaming service, and was open to anyone who was neither connected with Netflix (current and former employees, agents, close relatives of Netflix employees, etc.) nor a resident of certain blocked countries (such as Cuba or North Korea). On September 21, 2009, the grand prize of US$1,000,000 was given to the BellKor's Pragmatic Chaos team which bested Netflix's own algorithm for predicting ratings by 10.06%. == Problem and data sets == Netflix provided a training data set of 100,480,507 ratings that 480,189 users gave to 17,770 movies. Each training rating is a quadruplet of the form . The user and movie fields are integer IDs, while grades are from 1 to 5 (integer) stars. The qualifying data set contains over 2,817,131 triplets of the form , with grades known only to the jury. A participating team's algorithm must predict grades on the entire qualifying set, but they are informed of the score for only half of the data: a quiz set of 1,408,342 ratings. The other half is the test set of 1,408,789, and performance on this is used by the jury to determine potential prize winners. Only the judges know which ratings are in the quiz set, and which are in the test set—this arrangement is intended to make it difficult to hill climb on the test set. Submitted predictions are scored against the true grades in the form of root mean squared error (RMSE), and the goal is to reduce this error as much as possible. Note that, while the actual grades are integers in the range 1 to 5, submitted predictions need not be. Netflix also identified a probe subset of 1,408,395 ratings within the training data set. The probe, quiz, and test data sets were chosen to have similar statistical properties. In summary, the data used in the Netflix Prize looks as follows: Training set (99,072,112 ratings not including the probe set; 100,480,507 including the probe set) Probe set (1,408,395 ratings) Qualifying set (2,817,131 ratings) consisting of: Test set (1,408,789 ratings), used to determine winners Quiz set (1,408,342 ratings), used to calculate leaderboard scores For each movie, the title and year of release are provided in a separate dataset. No information at all is provided about users. In order to protect the privacy of the customers, "some of the rating data for some customers in the training and qualifying sets have been deliberately perturbed in one or more of the following ways: deleting ratings; inserting alternative ratings and dates; and modifying rating dates." The training set is constructed such that the average user rated over 200 movies, and the average movie was rated by over 5000 users. But there is wide variance in the data—some movies in the training set have as few as 3 ratings, while one user rated over 17,000 movies. There was some controversy as to the choice of RMSE as the defining metric. It has been claimed that even as small an improvement as 1% RMSE results in a significant difference in the ranking of the "top-10" most recommended movies for a user. == Prizes == Prizes were based on improvement over Netflix's own algorithm, called Cinematch, or the previous year's score if a team has made improvement beyond a certain threshold. A trivial algorithm that predicts for each movie in the quiz set its average grade from the training data produces an RMSE of 1.0540. Cinematch uses "straightforward statistical linear models with a lot of data conditioning." The performance of Cinematch had plateaued by 2006. Using only the training data, Cinematch scores an RMSE of 0.9514 on the quiz data, roughly a 10% improvement over the trivial algorithm. Cinematch has a similar performance on the test set, 0.9525. In order to win the grand prize of $1,000,000, a participating team had to improve this by another 10%, to achieve 0.8572 on the test set. Such an improvement on the quiz set corresponds to an RMSE of 0.8563. As long as no team won the grand prize, a progress prize of $50,000 was awarded every year for the best result thus far. However, in order to win this prize, an algorithm had to improve the RMSE on the quiz set by at least 1% over the previous progress prize winner (or over Cinematch, the first year). If no submission succeeded, the progress prize was not to be awarded for that year. To win a progress or grand prize a participant had to provide source code and a description of the algorithm to the jury within one week after being contacted by them. Following verification the winner also had to provide a non-exclusive license to Netflix. Netflix would publish only the description, not the source code, of the system. (To keep their algorithm and source code secret, a team could choose not to claim a prize.) The jury also kept their predictions secret from other participants. A team could send as many attempts to predict grades as they wish. Originally submissions were limited to once a week, but the interval was quickly modified to once a day. A team's best submission so far counted as their current submission. Once one of the teams succeeded in improving the RMSE by 10% or more, the jury would issue a last call, giving all teams 30 days to send their submissions. Only then, the team with the best submission was asked for the algorithm description, source code, and non-exclusive license, and, after successful verification; declared a grand prize winner. The contest would last until the grand prize winner was declared. Had no one received the grand prize, it would have lasted for at least five years (until October 2, 2011). After that date, the contest could have been terminated at any time at Netflix's sole discretion. == Progress over the years == The competition began on October 2, 2006. By October 8, a team called WXYZConsulting had already beaten Cinematch's results. By October 15, there were three teams who had beaten Cinematch, one of them by 1.06%, enough to qualify for the annual progress prize. By June 2007 over 20,000 teams had registered for the competition from over 150 countries. 2,000 teams had submitted over 13,000 prediction sets. Over the first year of the competition, a handful of front-runners traded first place. The more prominent ones were: WXYZConsulting, a team of Wei Xu and Yi Zhang. (A front runner during November–December 2006.) ML@UToronto A, a team from the University of Toronto led by Prof. Geoffrey Hinton. (A front runner during parts of October–December 2006.) Gravity, a team of four scientists from the Budapest University of Technology (A front runner during January–May 2007.) BellKor, a group of scientists from AT&T Labs. (A front runner since May 2007.) Dinosaur Planet, a team of three undergraduates from Princeton University. (A front runner on September 3, 2007 for one hour before BellKor snatched back the lead.) The algorithms used by the leading teams were usually an ensemble of singular value decomposition, k-nearest neighbor, neural networks, and so on. On August 12, 2007, many contestants gathered at the KDD Cup and Workshop 2007, held at San Jose, California. During the workshop all four of the top teams on the leaderboard at that time presented their techniques. The team from IBM Research—Yan Liu, Saharon Rosset, Claudia Perlich, and Zhenzhen Kou—won the third place in Task 1 and first place in Task 2. Over the second year of the competition, only three teams reached the leading position: BellKor, a group of scientists from AT&T Labs (front runner during May 2007 – September 2008) BigChaos, a team of Austrian scientists from Commendo Research & Consulting (single team front runner since October 2008) BellKor in BigChaos, a joint team of the two leading single teams (a front runner since September 2008) === 2007 Progress Prize === On September 2, 2007, the competition entered the "last call" period for the 2007 Progress Prize. Over 40,000 teams from 186 countries had entered the contest. They had thirty days to tender submissions for consideration. At the beginning of this period the leading team was BellKor, with an RMSE of 0.8728 (8.26% improvement), followed by Dinosaur Planet (RMSE = 0.8769; 7.83% improvement), and Gravity (RMSE = 0.8785; 7.66% improvement). In the last hour of the last call period, an entry by "KorBell" took first place. This turned out to be an alternate name for Team BellKor. On November 13, 2007, team KorBell (formerly BellKor) was declared the winner of the $50,000 Progress Prize with an RMSE of 0.8712 (8.43% improvement). The team consisted of three researchers from AT&T Labs, Yehuda Koren, Robert Bell, and Chris Volinsky. As required, they published a description of their a

Deepfake pornography

Deepfake pornography is a form of non-consensual AI pornography created by altering existing photographs or videos using deepfake technology to modify the appearance of the participants. The use of deepfake pornography has sparked controversy because it involves the making and sharing of realistic videos featuring non-consenting individuals and is sometimes used for revenge porn. Many countries have criminalized this "new voyeurism" through legislative measures and technological solutions. == History == The term "deepfake" was coined in 2017 on a Reddit forum where users shared altered pornographic videos created using machine learning algorithms. It is a combination of the word "deep learning", which refers to the program used to create the videos, and "fake" meaning the videos are not real. Deepfake pornography was originally created on a small individual scale using a combination of machine learning algorithms, computer vision techniques, and AI software. The process began by gathering a large amount of source material (including both images and videos) of a person's face, and then using a deep learning model to train a Generative Adversarial Network to create a fake video that convincingly swaps the face of the source material onto the body of a pornographic performer. However, the production process has significantly evolved since 2018, with the advent of several public apps that have largely automated the process. While several AI "nudification" apps emerged on mainstream platforms like Google Play and the Apple App Store around 2023, major tech storefronts have since implemented stricter policies and automated detection to ban such software. Consequently, the proliferation of non-consensual deepfake pornography has largely shifted to decentralized websites, specialized online forums, and third-party messaging bot ecosystems. Deepfake pornography is sometimes confused with fake nude photography, but the two are mostly different. Fake nude photography typically uses non-sexual images and merely makes it appear that the people in them are nude. == Notable cases == Deepfake technology has been used to create non-consensual and pornographic images and videos of famous women. One of the earliest examples occurred in 2017 when a deepfake pornographic video of Gal Gadot was created by a Reddit user and quickly spread online. Since then, there have been numerous instances of similar deepfake content targeting other female celebrities, such as Emma Watson, Natalie Portman, and Scarlett Johansson. Johansson spoke publicly on the issue in December 2018, condemning the practice but also refusing legal action because she views the harassment as inevitable. === Rana Ayyub === In 2018, Rana Ayyub, an Indian investigative journalist, was the target of an online hate campaign stemming from her condemnation of the Indian government, specifically her speaking out against the rape of an eight-year-old Kashmiri girl. Ayyub was bombarded with rape and death threats, and had a doctored pornographic video of her circulated online. In a Huffington Post article, Ayyub discussed the long-lasting psychological and social effects this experience has had on her. She explained that she continued to struggle with her mental health and how the images and videos continued to resurface whenever she took a high-profile case. === Atrioc controversy === In 2023, Twitch streamer Atrioc stirred controversy when he accidentally revealed deepfake pornographic material featuring female Twitch streamers while on live. The influencer has since admitted to paying for AI generated porn, and apologized to the women and his fans. === Taylor Swift === In January 2024, AI-generated sexually explicit images of American singer Taylor Swift were posted on X (formerly Twitter), and spread to other platforms such as Facebook, Reddit and Instagram. One tweet with the images was viewed over 45 million times before being removed. A report from 404 Media found that the images appeared to have originated from a Telegram group, whose members used tools such as Microsoft Designer to generate the images, using misspellings and keyword hacks to work around Designer's content filters. After the material was posted, Swift's fans posted concert footage and images to bury the deepfake images, and reported the accounts posting the deepfakes. Searches for Swift's name were temporarily disabled on X, returning an error message instead. Graphika, a disinformation research firm, traced the creation of the images back to a 4chan community. A source close to Swift told the Daily Mail that she would be considering legal action, saying, "Whether or not legal action will be taken is being decided, but there is one thing that is clear: These fake AI-generated images are abusive, offensive, exploitative, and done without Taylor's consent and/or knowledge." The controversy drew condemnation from White House Press Secretary Karine Jean-Pierre, Microsoft CEO Satya Nadella, the Rape, Abuse & Incest National Network, and SAG-AFTRA. Several US politicians called for federal legislation against deepfake pornography. Later in the month, US senators Dick Durbin, Lindsey Graham, Amy Klobuchar and Josh Hawley introduced a bipartisan bill that would allow victims to sue individuals who produced or possessed "digital forgeries" with intent to distribute, or those who received the material knowing it was made non-consensually. === 2024 Telegram deepfake scandal === It emerged in South Korea in August 2024, that many teachers and female students were victims of deepfake images created by users who utilized AI technology. Journalist Ko Narin of The Hankyoreh uncovered the deepfake images through Telegram chats. On Telegram, group chats were created specifically for image-based sexual abuse of women, including middle and high school students, teachers, and even family members. Women with photos on social media platforms like KakaoTalk, Instagram, and Facebook are often targeted as well. Perpetrators use AI bots to generate fake images, which are then sold or widely shared, along with the victims' social media accounts, phone numbers, and KakaoTalk usernames. One Telegram group reportedly drew around 220,000 members, according to a Guardian report. Investigations revealed numerous chat groups on Telegram where users, mainly teenagers, create and share explicit deepfake images of classmates and teachers. The issue came in the wake of a troubling history of digital sex crimes, notably the notorious Nth Room case in 2019. The Korean Teachers Union estimated that more than 200 schools had been affected by these incidents. Activists called for a "national emergency" declaration to address the problem. South Korean police reported over 800 deepfake sex crime cases by the end of September 2024, a stark rise from just 156 cases in 2021, with most victims and offenders being teenagers. On September 21, 6,000 people gathered at Marronnier Park in northeastern Seoul to demand stronger legal action against deepfake crimes targeting women. On September 26, following widespread outrage over the Telegram scandal, South Korean lawmakers passed a bill criminalizing the possession or viewing of sexually explicit deepfake images and videos, imposing penalties that include prison terms and fines. Under the new law, those caught buying, saving, or watching such material could face up to three years in prison or fines up to 30 million won ($22,600). At the time the bill was proposed, creating sexually explicit deepfakes for distribution carried a maximum penalty of five years, but the new legislation would increase this to seven years, regardless of intent. By October 2024, it was estimated that "nudify" deep fake bots on Telegram were up to four million monthly users. === 2025–2026 Grok/X chatbot deepfake scandal === In December 2025, Bloomberg reported that X users found Grok would comply with unconsensual requests to digitally undress individuals, including minors, or show them performing sexually explicit acts. The majority of these prompts were targeted at women and girls. An analysis of 20,000 images generated by Grok between December 25, 2025 and January 1, 2026 showed 2% were of people in bikinis or transparent clothes and appeared to be 18 or younger, including 30 of "young or very young" women or girls. A separate analysis conducted over 24 hours from January 5 to 6 calculated that users had Grok create 6,700 sexually suggestive or nudified images per hour. xAI responded to requests for comment from media organizations with the automated reply, "Legacy Media Lies". The bot's image generation sparked an international backlash and calls for legal or regulatory action from officials in the European Union, United Kingdom, Poland, France, India, Malaysia, and Brazil. === Fernandes–Ulmen case === German TV presenter Collien Fernandes, filed a complaint against her ex-husband, actor Christian Ulmen, for several accusation including, ident

For a Breath I Tarry

"For a Breath I Tarry" is a 1966 post-apocalyptic novelette by American writer Roger Zelazny, which was nominated for the Hugo Award for Best Novelette in 1967. Set in a future long after the self-extinction of humanity, the novelette recounts the tale of Frost, a sentient machine. Although humans have caused their own extinction, the sentient machines that they created continue the work of rebuilding a shattered Earth. Along the way, the story explores the differences between humanity and machines, the former experiencing the world qualitatively, while the latter doing so quantitatively. This difference is illustrated through philosophical conversations between Frost and another machine named Mordel. Frost's goal of becoming human, along with literary allusions, drives the plot and sets the tone of the novelette. These allusions include the first chapter of the Book of Job, in both situation and language, since verses are both quoted directly and paraphrased. In addition, the first three chapters of the Book of Genesis are echoed. Finally, Frost and Mordel enter into a Faustian bargain, though with better results than in the original story. The other major character is the Beta Machine, Frost's peer in the Southern Hemisphere. (Frost controls the Northern Hemisphere.) The novelette hints that though being a machine, Beta has a feminine personality. After Frost has succeeded in his millennium-long quest to become human (via recovered DNA), Beta agrees to join him in becoming human—suggesting the possibility of rebirth for the human race. The novelette has appeared in collections of Zelazny's works and in anthologies. The title is from a phrase in the poet A. E. Housman's collection A Shropshire Lad.

VoxForge

VoxForge is a free speech corpus and acoustic model repository for open source speech recognition engines. VoxForge was set up to collect transcribed speech to create a free GPL speech corpus in order to be uses with open source speech recognition engines. The speech audio files will be 'compiled' into acoustic models for use with open source speech recognition engines such as Julius, ISIP, and Sphinx and HTK (note: HTK has distribution restrictions). VoxForge has used LibriVox as a source of audio data since 2007.

Data analysis for fraud detection

Fraud represents a significant problem for governments and businesses and specialized analysis techniques for discovering fraud using them are required. Some of these methods include knowledge discovery in databases (KDD), data mining, machine learning and statistics. They offer applicable and successful solutions in different areas of electronic fraud crimes. In general, the primary reason to use data analytics techniques is to tackle fraud since many internal control systems have serious weaknesses. For example, the currently prevailing approach employed by many law enforcement agencies to detect companies involved in potential cases of fraud consists in receiving circumstantial evidence or complaints from whistleblowers. As a result, a large number of fraud cases remain undetected and unprosecuted. In order to effectively test, detect, validate, correct error and monitor control systems against fraudulent activities, businesses entities and organizations rely on specialized data analytics techniques such as data mining, data matching, the sounds like function, regression analysis, clustering analysis, and gap analysis. Techniques used for fraud detection fall into two primary classes: statistical techniques and artificial intelligence. == Statistical techniques == Examples of statistical data analysis techniques are: Data preprocessing techniques for detection, validation, error correction, and filling up of missing or incorrect data. Calculation of various statistical parameters such as averages, quantiles, performance metrics, probability distributions, and so on. For example, the averages may include average length of call, average number of calls per month and average delays in bill payment. Models and probability distributions of various business activities either in terms of various parameters or probability distributions. Computing user profiles. Time-series analysis of time-dependent data. Clustering and classification to find patterns and associations among groups of data. Data matching Data matching is used to compare two sets of collected data. The process can be performed based on algorithms or programmed loops. Trying to match sets of data against each other or comparing complex data types. Data matching is used to remove duplicate records and identify links between two data sets for marketing, security or other uses. Sounds like Function is used to find values that sound similar. The Phonetic similarity is one way to locate possible duplicate values, or inconsistent spelling in manually entered data. The ‘sounds like’ function converts the comparison strings to four-character American Soundex codes, which are based on the first letter, and the first three consonants after the first letter, in each string. Regression analysis allows you to examine the relationship between two or more variables of interest. Regression analysis estimates relationships between independent variables and a dependent variable. This method can be used to help understand and identify relationships among variables and predict actual results. Gap analysis is used to determine whether business requirements are being met, if not, what are the steps that should be taken to meet successfully. Matching algorithms to detect anomalies in the behavior of transactions or users as compared to previously known models and profiles. Techniques are also needed to eliminate false alarms, estimate risks, and predict future of current transactions or users. Some forensic accountants specialize in forensic analytics which is the procurement and analysis of electronic data to reconstruct, detect, or otherwise support a claim of financial fraud. The main steps in forensic analytics are data collection, data preparation, data analysis, and reporting. For example, forensic analytics may be used to review an employee's purchasing card activity to assess whether any of the purchases were diverted or divertible for personal use. == Artificial intelligence == Fraud detection is a knowledge-intensive activity. The main AI techniques used for fraud detection include: Data mining to classify, cluster, and segment the data and automatically find associations and rules in the data that may signify interesting patterns, including those related to fraud. Expert systems to encode expertise for detecting fraud in the form of rules. Pattern recognition to detect approximate classes, clusters, or patterns of suspicious behavior either automatically (unsupervised) or to match given inputs. Machine learning techniques to automatically identify characteristics of fraud. Neural nets to independently generate classification, clustering, generalization, and forecasting that can then be compared against conclusions raised in internal audits or formal financial documents such as 10-Q. Other techniques such as link analysis, Bayesian networks, decision theory, and sequence matching are also used for fraud detection. A new and novel technique called System properties approach has also been employed where ever rank data is available. Statistical analysis of research data is the most comprehensive method for determining if data fraud exists. Data fraud as defined by the Office of Research Integrity (ORI) includes fabrication, falsification and plagiarism. == Machine learning and data mining == Early data analysis techniques were oriented toward extracting quantitative and statistical data characteristics. These techniques facilitate useful data interpretations and can help to get better insights into the processes behind the data. Although the traditional data analysis techniques can indirectly lead us to knowledge, it is still created by human analysts. To go beyond, a data analysis system has to be equipped with a substantial amount of background knowledge, and be able to perform reasoning tasks involving that knowledge and the data provided. In effort to meet this goal, researchers have turned to ideas from the machine learning field. This is a natural source of ideas, since the machine learning task can be described as turning background knowledge and examples (input) into knowledge (output). If data mining results in discovering meaningful patterns, data turns into information. Information or patterns that are novel, valid and potentially useful are not merely information, but knowledge. One speaks of discovering knowledge, before hidden in the huge amount of data, but now revealed. The machine learning and artificial intelligence solutions may be classified into two categories: 'supervised' and 'unsupervised' learning. These methods seek for accounts, customers, suppliers, etc. that behave 'unusually' in order to output suspicion scores, rules or visual anomalies, depending on the method. Whether supervised or unsupervised methods are used, note that the output gives us only an indication of fraud likelihood. No stand alone statistical analysis can assure that a particular object is a fraudulent one, but they can identify them with very high degrees of accuracy. As a result, effective collaboration between machine learning model and human analysts is vital to the success of fraud detection applications. === Supervised learning === In supervised learning, a random sub-sample of all records is taken and manually classified as either 'fraudulent' or 'non-fraudulent' (task can be decomposed on more classes to meet algorithm requirements). Relatively rare events such as fraud may need to be over sampled to get a big enough sample size. These manually classified records are then used to train a supervised machine learning algorithm. After building a model using this training data, the algorithm should be able to classify new records as either fraudulent or non-fraudulent. Supervised neural networks, fuzzy neural nets, and combinations of neural nets and rules, have been extensively explored and used for detecting fraud in mobile phone networks and financial statement fraud. Bayesian learning neural network is implemented for credit card fraud detection, telecommunications fraud, auto claim fraud detection, and medical insurance fraud. Hybrid knowledge/statistical-based systems, where expert knowledge is integrated with statistical power, use a series of data mining techniques for the purpose of detecting cellular clone fraud. Specifically, a rule-learning program to uncover indicators of fraudulent behaviour from a large database of customer transactions is implemented. Cahill et al. (2000) design a fraud signature, based on data of fraudulent calls, to detect telecommunications fraud. For scoring a call for fraud its probability under the account signature is compared to its probability under a fraud signature. The fraud signature is updated sequentially, enabling event-driven fraud detection. Link analysis comprehends a different approach. It relates known fraudsters to other individuals, using record linkage and social network methods. This type of detection is only able to detect fra