Account verification

Account verification

Account verification is the process of verifying that a new or existing account is owned and operated by a specified real individual or organization. A number of websites, for example social media websites, offer account verification services. Verified accounts are often visually distinguished by check mark icons or badges next to the names of individuals or organizations. Account verification can enhance the quality of online services, mitigating sockpuppetry, bots, trolling, spam, vandalism, fake news, disinformation and election interference. == History == Account verification was introduced by Twitter in June 2009, initially as a feature for public figures and accounts of interest, individuals in "music, acting, fashion, government, politics, religion, journalism, media, sports, business and other key interest areas". A similar verification system was adopted by Google+ in 2011, Facebook page in October 2015 (Available in United States, Canada, United Kingdom, Australia and New Zealand) Facebook profile and Facebook page in 2018 (Available in Worldwide) Instagram in 2014, and Pinterest in 2015. On YouTube, users are able to submit a request for a verification badge once they obtain 100,000 or more subscribers. It also has an "official artist" badge for musicians and bands. In July 2016, Twitter announced that, beyond public figures, any individual would be able to apply for account verification. This was temporarily suspended in February 2018, following a backlash over the verification of one of the organisers of the far-right Unite the Right rally due to a perception that verification conveys "credibility" or "importance". In March 2018, during a live-stream on Periscope, Jack Dorsey, co-founder and CEO of Twitter, discussed the idea of allowing any individual to get a verified account. Twitter reopened account verification applications in May 2021 after revamping their account verification criteria. This time offering notability criteria for the account categories of government, companies, brands, and organizations, news organizations and journalists, entertainment, sports and activists, organizers, and other influential individuals. Instagram began allowing users to request verification in August 2018. In April 2018, Mark Zuckerberg, co-founder and CEO of Facebook, announced that purchasers of political or issue-based advertisements would be required to verify their identities and locations. He also indicated that Facebook would require individuals who manage large pages to be verified. In May 2018, Kent Walker, senior vice president of Google, announced that, in the United States, purchasers of political-leaning advertisements would need to verify their identities. In November 2022, Elon Musk included a blue verification check mark with a paid Twitter Blue monthly membership. Prior to Musk's acquisition of Twitter, Twitter offered this check mark at no charge to confirmed high profile users. On December 19, 2022, Twitter introduced two new check mark colors: gold for accounts from official businesses and organizations, and grey for accounts from governments or multilateral organizations. The type of check mark can be confirmed by visiting the profile page, then clicking or tapping on the check mark. == Techniques == === Identity verification services === Identity verification services are third-party solutions which can be used to ensure that a person provides information which is associated with the identity of a real person. Such services may verify the authenticity of identity documents such as drivers licenses or passports, called documentary verification, or may verify identity information against authoritative sources such as credit bureaus or government data, called nondocumentary verification. === Identity documents verification === The uploading of scanned or photographed identity documents is a practice in use, for example, at Facebook. According to Facebook, there are two reasons that a person would be asked to send a scan of or photograph of an ID to Facebook: to show account ownership and to confirm their name. In January 2018, Facebook purchased Confirm.io, a startup that was advancing technologies to verify the authenticity of identification documentation. === Biometric verification === === Behavioral verification === Behavioral verification is the computer-aided and automated detection and analysis of behaviors and patterns of behavior to verify accounts. Behaviors to detect include those of sockpuppets, bots, cyborgs, trolls, spammers, vandals, and sources and spreaders of fake news, disinformation and election interference. Behavioral verification processes can flag accounts as suspicious, exclude accounts from suspicion, or offer corroborating evidence for processes of account verification. === Bank account verification === Identity verification is required to establish bank accounts and other financial accounts in many jurisdictions. Verifying identity in the financial sector is often required by regulation such as Know Your Customer or Customer Identification Program. Accordingly, bank accounts can be of use as corroborating evidence when performing account verification. Bank account information can be provided when creating or verifying an account or when making a purchase. === Postal address verification === Postal address information can be provided when creating or verifying an account or when making and subsequently shipping a purchase. A hyperlink or code can be sent to a user by mail, recipients entering it on a website verifying their postal address. === Telephone number verification === A telephone number can be provided when creating or verifying an account or added to an account to obtain a set of features. During the process of verifying a telephone number, a confirmation code is sent to a phone number specified by a user, for example in an SMS message sent to a mobile phone. As the user receives the code sent, they can enter it on the website to confirm their receipt. === Email verification === An email account is often required to create an account. During this process, a confirmation hyperlink is sent in an email message to an email address specified by a person. The email recipient is instructed in the email message to navigate to the provided confirmation hyperlink if and only if they are the person creating an account. The act of navigating to the hyperlink confirms receipt of the email by the person. The added value of an email account for purposes of account verification depends upon the process of account verification performed by the specific email service provider. === Multi-factor verification === Multi-factor account verification is account verification which simultaneously utilizes a number of techniques. === Multi-party verification === The processes of account verification utilized by multiple service providers can corroborate one another. OpenID Connect includes a user information protocol which can be used to link multiple accounts, corroborating user information. == Account verification and good standing == On some services, account verification is synonymous with good standing. Twitter reserves the right to remove account verification from users' accounts at any time without notice. Reasons for removal may reflect behaviors on and off Twitter and include: promoting hate and/or violence against, or directly attacking or threatening other people on the basis of race, ethnicity, national origin, sexual orientation, gender, gender identity, religious affiliation, age, disability, or disease; supporting organizations or individuals that promote the above; inciting or engaging in the harassment of others; violence and dangerous behavior; directly or indirectly threatening or encouraging any form of physical violence against an individual or any group of people, including threatening or promoting terrorism; violent, gruesome, shocking, or disturbing imagery; self-harm, suicide; and engaging in other activity on Twitter that violates the Twitter Rules. In April 2023, Blue ticks were removed from all Twitter accounts that had not subscribed to Twitter Blue.

Cloud management

Cloud management refers to the administration and oversight of cloud computing products and services. Public clouds are managed by cloud service providers, which operate the underlying infrastructure such as servers, storage, networking, and data center facilities. Users may also opt to manage their public cloud services with a third-party cloud management tool. Users of public cloud services can generally select from three basic cloud provisioning categories: User self-provisioning: Customers purchase cloud services directly from the provider, typically through a web form or console interface. The customer pays on a per-transaction basis. Advanced provisioning: Customers contract in advance a predetermined amount of resources, which are prepared in advance of service. The customer pays a flat fee or a monthly fee. Dynamic provisioning: The provider allocates resources when the customer needs them, then decommissions them when they are no longer needed. The customer is charged on a pay-per-use basis. Managing a private cloud requires software tools to help create a virtualized pool of compute resources, provide a self-service portal for end users and handle security, resource allocation, tracking and billing. Management tools for private clouds tend to be service driven, as opposed to resource driven, because cloud environments are typically highly virtualized and organized in terms of portable workloads. In hybrid cloud environments, compute, network and storage resources must be managed across multiple domains, so a good management strategy should start by defining what needs to be managed, and where and how to do it. Policies to help govern these domains should include configuration and installation of images, access control, and budgeting and reporting. Access control often includes the use of Single sign-on (SSO), in which a user logs in once and gains access to all systems without being prompted to log in again at each of them. == Characteristics of Cloud Management == Cloud management combines software and technologies in a design for managing cloud environments. Software developers have responded to the management challenges of cloud computing with a variety of cloud management platforms and tools. These tools include native tools offered by public cloud providers as well as third-party tools designed to provide consistent functionality across multiple cloud providers. Administrators must balance the competing requirements of efficient consistency across different cloud platforms with access to different native functionality within individual cloud platforms. The growing acceptance of public cloud and increased multicloud usage is driving the need for consistent cross-platform management. Rapid adoption of cloud services is introducing a new set of management challenges for those technical professionals responsible for managing IT systems and services. Cloud-management platforms and tools should have the ability to provide minimum functionality in the following categories. Functionality can be both natively provided or orchestrated via third-party integration. Provisioning and orchestration: create, modify, and delete resources as well as orchestrate workflows and management of workloads Automation: Enable cloud consumption and deployment of app services via infrastructure-as-code and other DevOps concepts Security and compliance: manage role-based access of cloud services and enforce security configurations Service request: collect and fulfill requests from users to access and deploy cloud resources. Monitoring and logging: collect performance and availability metrics as well as automate incident management and log aggregation Inventory and classification: discover and maintain pre-existing brownfield cloud resources plus monitor and manage changes Cost management and optimization: track and rightsize cloud spend and align capacity and performance to actual demand Migration, backup, and DR: enable data protection, disaster recovery, and data mobility via snapshots and/or data replication Organizations may group these criteria into key use cases including Cloud Brokerage, DevOps Automation, Governance, and Day-2 Life Cycle Operations. Enterprises with large-scale cloud implementations may require more robust cloud management tools which include specific characteristics, such as the ability to manage multiple platforms from a single point of reference, or intelligent analytics to automate processes like application lifecycle management. High-end cloud management tools should also have the ability to handle system failures automatically with capabilities such as self-monitoring, an explicit notification mechanism, and include failover and self-healing capabilities. == Multi-Cloud and Hybrid Cloud Management Challenges == Legacy management infrastructures, which are based on the concept of dedicated system relationships and architecture constructs, are not well suited to cloud environments where instances are continually launched and decommissioned. Instead, the dynamic nature of cloud computing requires monitoring and management tools that are adaptable, extensible and customizable. Cloud computing presents a number of management challenges. Companies using public clouds do not have ownership of the equipment hosting the cloud environment, and because the environment is not contained within their own networks, public cloud customers do not have full visibility or control. Users of public cloud services must also integrate with an architecture defined by the cloud provider, using its specific parameters for working with cloud components. Integration includes tying into the cloud APIs for configuring IP addresses, subnets, firewalls and data service functions for storage. Because control of these functions is based on the cloud provider’s infrastructure and services, public cloud users must integrate with the cloud infrastructure management. Capacity management is a challenge for both public and private cloud environments because end users have the ability to deploy applications using self-service portals. Applications of all sizes may appear in the environment, consume an unpredictable amount of resources, then disappear at any time. A possible solution is profiling the applications impact on computational resources. As result, the performance models allow the prediction of how resource utilization changes according to application patterns. Thus, resources can be dynamically scaled to meet the expected demand. This is critical to cloud providers that need to provision resources quickly to meet a growing demand by their applications. Charge-back—or, pricing resource use on a granular basis—is a challenge for both public and private cloud environments. Charge-back is a challenge for public cloud service providers because they must price their services competitively while still creating profit. Users of public cloud services may find charge-back challenging because it is difficult for IT groups to assess actual resource costs on a granular basis due to overlapping resources within an organization that may be paid for by an individual business unit, such as electrical power. For private cloud operators, charge-back is fairly straightforward, but the challenge lies in guessing how to allocate resources as closely as possible to actual resource usage to achieve the greatest operational efficiency. Exceeding budgets can be a risk. Hybrid cloud environments, which combine public and private cloud services, sometimes with traditional infrastructure elements, present their own set of management challenges. These include security concerns if sensitive data lands on public cloud servers, budget concerns around overuse of storage or bandwidth and proliferation of mismanaged images. Managing the information flow in a hybrid cloud environment is also a significant challenge. On-premises clouds must share information with applications hosted off-premises by public cloud providers, and this information may change constantly. Hybrid cloud environments also typically include a complex mix of policies, permissions and limits that must be managed consistently across both public and private clouds. == Cloud Management Platforms (CMP) == CMPs provide a means for a cloud service customer to manage the deployment and operation of applications and associated datasets across multiple cloud service infrastructures, including both on-premises cloud infrastructure and public cloud service provider infrastructure. In other words, CMPs provide management capabilities for hybrid cloud and multi-cloud environments. A cloud management platform (CMP) provides broad cloud management functionality atop both public cloud provider platforms and private cloud platforms. CMPs manage cloud services and resources that are distributed across multiple cloud platforms. The value of CMPs stands in delivering the maximum level of consistency between platforms without comp

Multinomial logistic regression

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables (which may be real-valued, binary-valued, categorical-valued, etc.). Multinomial logistic regression is known by a variety of other names, including polytomous LR, multiclass LR, softmax regression, multinomial logit (mlogit), the maximum entropy (MaxEnt) classifier, and the conditional maximum entropy model. == Background == Multinomial logistic regression is used when the dependent variable in question is nominal (equivalently categorical, meaning that it falls into any one of a set of categories that cannot be ordered in any meaningful way) and for which there are more than two categories. Some examples would be: Which major will a college student choose, given their grades, stated likes and dislikes, etc.? Which blood type does a person have, given the results of various diagnostic tests? In a hands-free mobile phone dialing application, which person's name was spoken, given various properties of the speech signal? Which candidate will a person vote for, given particular demographic characteristics? Which country will a firm locate an office in, given the characteristics of the firm and of the various candidate countries? These are all statistical classification problems. They all have in common a dependent variable to be predicted that comes from one of a limited set of items that cannot be meaningfully ordered, as well as a set of independent variables (also known as features, explanators, etc.), which are used to predict the dependent variable. Multinomial logistic regression is a particular solution to classification problems that use a linear combination of the observed features and some problem-specific parameters to estimate the probability of each particular value of the dependent variable. The best values of the parameters for a given problem are usually determined from some training data (e.g. some people for whom both the diagnostic test results and blood types are known, or some examples of known words being spoken). == Assumptions == The multinomial logistic model assumes that data are case-specific; that is, each independent variable has a single value for each case. As with other types of regression, there is no need for the independent variables to be statistically independent from each other (unlike, for example, in a naive Bayes classifier); however, collinearity is assumed to be relatively low, as it becomes difficult to differentiate between the impact of several variables if this is not the case. If the multinomial logit is used to model choices, it relies on the assumption of independence of irrelevant alternatives (IIA), which is not always desirable. This assumption states that the odds of preferring one class over another do not depend on the presence or absence of other "irrelevant" alternatives. For example, the relative probabilities of taking a car or bus to work do not change if a bicycle is added as an additional possibility. This allows the choice of K alternatives to be modeled as a set of K − 1 independent binary choices, in which one alternative is chosen as a "pivot" and the other K − 1 compared against it, one at a time. The IIA hypothesis is a core hypothesis in rational choice theory; however numerous studies in psychology show that individuals often violate this assumption when making choices. An example of a problem case arises if choices include a car and a blue bus. Suppose the odds ratio between the two is 1 : 1. Now if the option of a red bus is introduced, a person may be indifferent between a red and a blue bus, and hence may exhibit a car : blue bus : red bus odds ratio of 1 : 0.5 : 0.5, thus maintaining a 1 : 1 ratio of car : any bus while adopting a changed car : blue bus ratio of 1 : 0.5. Here the red bus option was not in fact irrelevant, because a red bus was a perfect substitute for a blue bus. If the multinomial logit is used to model choices, it may in some situations impose too much constraint on the relative preferences between the different alternatives. It is especially important to take into account if the analysis aims to predict how choices would change if one alternative were to disappear (for instance if one political candidate withdraws from a three candidate race). Other models like the nested logit or the multinomial probit may be used in such cases as they allow for violation of the IIA. == Model == === Introduction === There are multiple equivalent ways to describe the mathematical model underlying multinomial logistic regression. This can make it difficult to compare different treatments of the subject in different texts. The article on logistic regression presents a number of equivalent formulations of simple logistic regression, and many of these have analogues in the multinomial logit model. The idea behind all of them, as in many other statistical classification techniques, is to construct a linear predictor function that constructs a score from a set of weights that are linearly combined with the explanatory variables (features) of a given observation using a dot product: score ⁡ ( X i , k ) = β k ⋅ X i , {\displaystyle \operatorname {score} (\mathbf {X} _{i},k)={\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i},} where Xi is the vector of explanatory variables describing observation i, βk is a vector of weights (or regression coefficients) corresponding to outcome k, and score(Xi, k) is the score associated with assigning observation i to category k. In discrete choice theory, where observations represent people and outcomes represent choices, the score is considered the utility associated with person i choosing outcome k. The predicted outcome is the one with the highest score. The difference between the multinomial logit model and numerous other methods, models, algorithms, etc. with the same basic setup (the perceptron algorithm, support vector machines, linear discriminant analysis, etc.) is the procedure for determining (training) the optimal weights/coefficients and the way that the score is interpreted. In particular, in the multinomial logit model, the score can directly be converted to a probability value, indicating the probability of observation i choosing outcome k given the measured characteristics of the observation. This provides a principled way of incorporating the prediction of a particular multinomial logit model into a larger procedure that may involve multiple such predictions, each with a possibility of error. Without such means of combining predictions, errors tend to multiply. For example, imagine a large predictive model that is broken down into a series of submodels where the prediction of a given submodel is used as the input of another submodel, and that prediction is in turn used as the input into a third submodel, etc. If each submodel has 90% accuracy in its predictions, and there are five submodels in series, then the overall model has only 0.95 = 59% accuracy. If each submodel has 80% accuracy, then overall accuracy drops to 0.85 = 33% accuracy. This issue is known as error propagation and is a serious problem in real-world predictive models, which are usually composed of numerous parts. Predicting probabilities of each possible outcome, rather than simply making a single optimal prediction, is one means of alleviating this issue. === Setup === The basic setup is the same as in logistic regression, the only difference being that the dependent variables are categorical rather than binary, i.e. there are K possible outcomes rather than just two. The following description is somewhat shortened; for more details, consult the logistic regression article. ==== Data points ==== Specifically, it is assumed that we have a series of N observed data points. Each data point i (ranging from 1 to N) consists of a set of M explanatory variables x1,i ... xM,i (also known as independent variables, predictor variables, features, etc.), and an associated categorical outcome Yi (also known as dependent variable, response variable), which can take on one of K possible values. These possible values represent logically separate categories (e.g. different political parties, blood types, etc.), and are often described mathematically by arbitrarily assigning each a number from 1 to K. The explanatory variables and outcome represent observed properties of the data points, and are often thought of as originating in the observations of N "experiments" — although an "experiment" may consist of nothing more than gathering data. The goal of multinomial logistic regression is to construct a model that explains the relationship between the explanatory variables and the outcome, so tha

LogitBoost

In machine learning and computational learning theory, LogitBoost is a boosting algorithm formulated by Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The original paper casts the AdaBoost algorithm into a statistical framework. Specifically, if one considers AdaBoost as a generalized additive model and then applies the cost function of logistic regression, one can derive the LogitBoost algorithm. == Minimizing the LogitBoost cost function == LogitBoost can be seen as a convex optimization. Specifically, given that we seek an additive model of the form f = ∑ t α t h t {\displaystyle f=\sum _{t}\alpha _{t}h_{t}} the LogitBoost algorithm minimizes the logistic loss: ∑ i log ⁡ ( 1 + e − y i f ( x i ) ) {\displaystyle \sum _{i}\log \left(1+e^{-y_{i}f(x_{i})}\right)}

Weighted majority algorithm (machine learning)

In machine learning, weighted majority algorithm (WMA) is a meta learning algorithm used to construct a compound algorithm from a pool of prediction algorithms, which could be any type of learning algorithms, classifiers, or even real human experts. The algorithm assumes that we have no prior knowledge about the accuracy of the algorithms in the pool, but there are sufficient reasons to believe that one or more will perform well. Assume that the problem is a binary decision problem. To construct the compound algorithm, a positive weight is given to each of the algorithms in the pool. The compound algorithm then collects weighted votes from all the algorithms in the pool, and gives the prediction that has a higher vote. If the compound algorithm makes a mistake, the algorithms in the pool that contributed to the wrong predicting will be discounted by a certain ratio β where 0<β<1. It can be shown that the upper bounds on the number of mistakes made in a given sequence of predictions from a pool of algorithms A {\displaystyle \mathbf {A} } is O ( l o g | A | + m ) {\displaystyle \mathbf {O(log|A|+m)} } if one algorithm in x i {\displaystyle \mathbf {x} _{i}} makes at most m {\displaystyle \mathbf {m} } mistakes. There are many variations of the weighted majority algorithm to handle different situations, like shifting targets, infinite pools, or randomized predictions. The core mechanism remains similar, with the final performances of the compound algorithm bounded by a function of the performance of the specialist (best performing algorithm) in the pool.

Single particle analysis

Single particle analysis is a group of related computerized image processing techniques used to analyze images from transmission electron microscopy (TEM). These methods were developed to improve and extend the information obtainable from TEM images of particulate samples, typically proteins or other large biological entities such as viruses. Individual images of stained or unstained particles are very noisy, making interpretation difficult. Combining several digitized images of similar particles together gives an image with stronger and more easily interpretable features. An extension of this technique uses single particle methods to build up a three-dimensional reconstruction of the particle. Using cryo-electron microscopy it has become possible to generate reconstructions with sub-nanometer, near-atomic resolution resolution first in the case of highly symmetric viruses, and now in smaller, asymmetric proteins as well. == Techniques == Single particle analysis can be done on both negatively stained and vitreous ice-embedded transmission electron cryomicroscopy (CryoTEM) samples. Single particle analysis methods are, in general, reliant on the sample being homogeneous, although techniques for dealing with conformational heterogeneity are being developed. Images (micrographs) are taken with an electron microscope using charged-coupled device (CCD) detectors coupled to a phosphorescent layer (in the past, they were instead collected on film and digitized using high-quality scanners). The image processing is carried out using specialized software programs, often run on multi-processor computer clusters. Depending on the sample or the desired results, various steps of two- or three-dimensional processing can be done. === Alignment and classification === Biological samples, and especially samples embedded in thin vitreous ice, are highly radiation sensitive, thus only low electron doses can be used to image the sample. This low dose, as well as variations in the metal stain used (if used) means images have high noise relative to the signal given by the particle being observed. By aligning several similar images to each other so they are in register and then averaging them, an image with higher signal-to-noise ratio can be obtained. As the noise is mostly randomly distributed and the underlying image features constant, by averaging the intensity of each pixel over several images only the constant features are reinforced. Typically, the optimal alignment (a translation and an in-plane rotation) to map one image onto another is calculated by cross-correlation. However, a micrograph often contains particles in multiple different orientations and/or conformations, and so to get more representative image averages, a method is required to group similar particle images together into multiple sets. This is normally carried out using one of several data analysis and image classification algorithms, such as multi-variate statistical analysis and hierarchical ascendant classification, or k-means clustering. Often data sets of tens of thousands of particle images are used, and to reach an optimal solution an iterative procedure of alignment and classification is used, whereby strong image averages produced by classification are used as reference images for a subsequent alignment of the whole data set. === Image filtering === Image filtering (band-pass filtering) is often used to reduce the influence of high and/or low spatial frequency information in the images, which can affect the results of the alignment and classification procedures. This is particularly useful in negative stain images. The algorithms make use of fast Fourier transforms (FFT), often employing Gaussian shaped soft-edged masks in reciprocal space to suppress certain frequency ranges. High-pass filters remove low spatial frequencies (such as ramp or gradient effects), leaving the higher frequencies intact. Low-pass filters remove high spatial frequency features and have a blurring effect on fine details. === Contrast transfer function === Due to the nature of image formation in the electron microscope, bright-field TEM images are obtained using significant underfocus. This, along with features inherent in the microscope's lens system, creates blurring of the collected images visible as a point spread function. The combined effects of the imaging conditions are known as the contrast transfer function (CTF), and can be approximated mathematically as a function in reciprocal space. Specialized image processing techniques such as phase flipping and amplitude correction / Wiener filtering can (at least partially) correct for the CTF, and allow high resolution reconstructions. === Three-dimensional reconstruction === Transmission electron microscopy images are projections of the object showing the distribution of density through the object, similar to medical X-rays. By making use of the projection-slice theorem a three-dimensional reconstruction of the object can be generated by combining many images (2D projections) of the object taken from a range of viewing angles. Proteins in vitreous ice ideally adopt a random distribution of orientations (or viewing angles), allowing a fairly isotropic reconstruction if a large number of particle images are used. This contrasts with electron tomography, where the viewing angles are limited due to the geometry of the sample/imaging set up, giving an anisotropic reconstruction. Filtered back projection is a commonly used method of generating 3D reconstructions in single particle analysis, although many alternative algorithms exist. Before a reconstruction can be made, the orientation of the object in each image needs to be estimated. Several methods have been developed to work out the relative Euler angles of each image. Some are based on common lines (common 1D projections and sinograms), others use iterative projection matching algorithms. The latter works by beginning with a simple, low resolution 3D starting model and compares the experimental images to projections of the model and creates a new 3D to bootstrap towards a solution. Methods are also available for making 3D reconstructions of helical samples (such as tobacco mosaic virus), taking advantage of the inherent helical symmetry. Both real space methods (treating sections of the helix as single particles) and reciprocal space methods (using diffraction patterns) can be used for these samples. === Tilt methods === The specimen stage of the microscope can be tilted (typically along a single axis), allowing the single particle technique known as random conical tilt. An area of the specimen is imaged at both zero and at high angle (~60-70 degrees) tilts, or in the case of the related method of orthogonal tilt reconstruction, +45 and −45 degrees. Pairs of particles corresponding to the same object at two different tilts (tilt pairs) are selected, and by following the parameters used in subsequent alignment and classification steps a three-dimensional reconstruction can be generated relatively easily. This is because the viewing angle (defined as three Euler angles) of each particle is known from the tilt geometry. 3D reconstructions from random conical tilt suffer from missing information resulting from a restricted range of orientations. Known as the missing cone (due to the shape in reciprocal space), this causes distortions in the 3D maps. However, the missing cone problem can often be overcome by combining several tilt reconstructions. Tilt methods are best suited to negatively stained samples, and can be used for particles that adsorb to the carbon support film in preferred orientations. The phenomenon known as charging or beam-induced movement makes collecting high-tilt images of samples in vitreous ice challenging. === Map visualization and fitting === Various software programs are available that allow viewing the 3D maps. These often enable the user to manually dock in protein coordinates (structures from X-ray crystallography, NMR, or a computational model such as one found in the AlphaFold Protein Structure Database) of subunits into the electron density. Several programs can also fit subunits computationally; as of the 2020s using these programs tend to produce better accuracy than manual docking because they can perform labor-intensive tasks such as: The scale of SPA-derived maps depends on knowing the pixel size (angstorms per pixel), which is not always accurate. Programs can automatically correct for this difference by using coordinate data or by using knowledge of chemical bonds. Many proteins are made up of several roughly rigid protein domains linked by flexible parts. Pre-existing coordinate data, whether experimental or computational, may not exactly match the inter-domain positioning of the cyro-EM map. Modern programs can automatically "chop" pre-existing coordinate data into individual domains and fit them in individually. For higher-resolution structures, it is pos

Evolutionary algorithm

Evolutionary algorithms (EA) reproduce essential elements of biological evolution in a computer algorithm in order to solve "difficult" problems, at least approximately, for which no exact or satisfactory solution methods are known. They are metaheuristics and population-based bio-inspired algorithms and evolutionary computation, which itself are part of the field of computational intelligence. The mechanisms of biological evolution that an EA mainly imitates are reproduction, mutation, recombination and selection. Candidate solutions to the optimization problem play the role of individuals in a population, and the fitness function determines the quality of the solutions (see also loss function). Evolution of the population then takes place after the repeated application of the above operators. Evolutionary algorithms often perform well approximating solutions to all types of problems because they ideally do not make any assumption about the underlying fitness landscape. Techniques from evolutionary algorithms applied to the modeling of biological evolution are generally limited to explorations of microevolution (microevolutionary processes) and planning models based upon cellular processes. In most real applications of EAs, computational complexity is a prohibiting factor. In fact, this computational complexity is due to fitness function evaluation. Fitness approximation is one of the solutions to overcome this difficulty. However, seemingly simple EA can solve often complex problems; therefore, there may be no direct link between algorithm complexity and problem complexity. == Generic definition == The following is an example of a generic evolutionary algorithm: Randomly generate the initial population of individuals, the first generation. Evaluate the fitness of each individual in the population. Check, if the goal is reached and the algorithm can be terminated. Select individuals as parents, preferably of higher fitness. Produce offspring with optional crossover (mimicking reproduction). Apply mutation operations on the offspring. Select individuals preferably of lower fitness for replacement with new individuals (mimicking natural selection). Return to 2 == Types == Similar techniques differ in genetic representation and other implementation details, and the nature of the particular applied problem. Genetic algorithm – This is the most popular type of EA. One seeks the solution of a problem in the form of strings of numbers (traditionally binary, although the best representations are usually those that reflect something about the problem being solved), by applying operators such as recombination and mutation (sometimes one, sometimes both). This type of EA is often used in optimization problems. Genetic programming – Here the solutions are in the form of computer programs, and their fitness is determined by their ability to solve a computational problem. There are many variants of Genetic Programming: Cartesian genetic programming Gene expression programming Grammatical evolution Linear genetic programming Multi expression programming Evolutionary programming – Similar to evolution strategy, but with a deterministic selection of all parents. Evolution strategy (ES) – Works with vectors of real numbers as representations of solutions, and typically uses self-adaptive mutation rates. The method is mainly used for numerical optimization, although there are also variants for combinatorial tasks. CMA-ES Natural evolution strategy Differential evolution – Based on vector differences and is therefore primarily suited for numerical optimization problems. Coevolutionary algorithm – Similar to genetic algorithms and evolution strategies, but the created solutions are compared on the basis of their outcomes from interactions with other solutions. Solutions can either compete or cooperate during the search process. Coevolutionary algorithms are often used in scenarios where the fitness landscape is dynamic, complex, or involves competitive interactions. Neuroevolution – Similar to genetic programming but the genomes represent artificial neural networks by describing structure and connection weights. The genome encoding can be direct or indirect. Learning classifier system – Here the solution is a set of classifiers (rules or conditions). A Michigan-LCS evolves at the level of individual classifiers whereas a Pittsburgh-LCS uses populations of classifier-sets. Initially, classifiers were only binary, but now include real, neural net, or S-expression types. Fitness is typically determined with either a strength or accuracy based reinforcement learning or supervised learning approach. Quality–Diversity algorithms – QD algorithms simultaneously aim for high-quality and diverse solutions. Unlike traditional optimization algorithms that solely focus on finding the best solution to a problem, QD algorithms explore a wide variety of solutions across a problem space and keep those that are not just high performing, but also diverse and unique. == Theoretical background == The following theoretical principles apply to all or almost all EAs. === No free lunch theorem === The no free lunch theorem of optimization states that all optimization strategies are equally effective when the set of all optimization problems is considered. Under the same condition, no evolutionary algorithm is fundamentally better than another. This can only be the case if the set of all problems is restricted. This is exactly what is inevitably done in practice. Therefore, to improve an EA, it must exploit problem knowledge in some form (e.g. by choosing a certain mutation strength or a problem-adapted coding). Thus, if two EAs are compared, this constraint is implied. In addition, an EA can use problem specific knowledge by, for example, not randomly generating the entire start population, but creating some individuals through heuristics or other procedures. Another possibility to tailor an EA to a given problem domain is to involve suitable heuristics, local search procedures or other problem-related procedures in the process of generating the offspring. This form of extension of an EA is also known as a memetic algorithm. Both extensions play a major role in practical applications, as they can speed up the search process and make it more robust. === Convergence === For EAs in which, in addition to the offspring, at least the best individual of the parent generation is used to form the subsequent generation (so-called elitist EAs), there is a general proof of convergence under the condition that an optimum exists. Without loss of generality, a maximum search is assumed for the proof: From the property of elitist offspring acceptance and the existence of the optimum it follows that per generation k {\displaystyle k} an improvement of the fitness F {\displaystyle F} of the respective best individual x ′ {\displaystyle x'} will occur with a probability P > 0 {\displaystyle P>0} . Thus: F ( x 1 ′ ) ≤ F ( x 2 ′ ) ≤ F ( x 3 ′ ) ≤ ⋯ ≤ F ( x k ′ ) ≤ ⋯ {\displaystyle F(x'_{1})\leq F(x'_{2})\leq F(x'_{3})\leq \cdots \leq F(x'_{k})\leq \cdots } I.e., the fitness values represent a monotonically non-decreasing sequence, which is bounded due to the existence of the optimum. From this follows the convergence of the sequence against the optimum. Since the proof makes no statement about the speed of convergence, it is of little help in practical applications of EAs. But it does justify the recommendation to use elitist EAs. However, when using the usual panmictic population model, elitist EAs tend to converge prematurely more than non-elitist ones. In a panmictic population model, mate selection (see step 4 of the generic definition) is such that every individual in the entire population is eligible as a mate. In non-panmictic populations, selection is suitably restricted, so that the dispersal speed of better individuals is reduced compared to panmictic ones. Thus, the general risk of premature convergence of elitist EAs can be significantly reduced by suitable population models that restrict mate selection. === Virtual alphabets === With the theory of virtual alphabets, David E. Goldberg showed in 1990 that by using a representation with real numbers, an EA that uses classical recombination operators (e.g. uniform or n-point crossover) cannot reach certain areas of the search space, in contrast to a coding with binary numbers. This results in the recommendation for EAs with real representation to use arithmetic operators for recombination (e.g. arithmetic mean or intermediate recombination). With suitable operators, real-valued representations are more effective than binary ones, contrary to earlier opinion. == Comparison to other concepts == === Biological processes === A possible limitation of many evolutionary algorithms is their lack of a clear genotype–phenotype distinction. In nature, the fertilized egg cell undergoes a complex process known as embryogenesis to become a mature p