In concurrency control of databases, transaction processing (transaction management), and other transactional distributed applications, global serializability (or modular serializability) is a property of a global schedule of transactions. A global schedule is the unified schedule of all the individual database (and other transactional object) schedules in a multidatabase environment (e.g., federated database). Complying with global serializability means that the global schedule is serializable, has the serializability property, while each component database (module) has a serializable schedule as well. In other words, a collection of serializable components provides overall system serializability, which is usually incorrect. A need in correctness across databases in multidatabase systems makes global serializability a major goal for global concurrency control (or modular concurrency control). With the proliferation of the Internet, Cloud computing, Grid computing, and small, portable, powerful computing devices (e.g., smartphones), as well as increase in systems management sophistication, the need for atomic distributed transactions and thus effective global serializability techniques, to ensure correctness in and among distributed transactional applications, seems to increase. In a federated database system or any other more loosely defined multidatabase system, which are typically distributed in a communication network, transactions span multiple (and possibly distributed) databases. Enforcing global serializability in such system, where different databases may use different types of concurrency control, is problematic. Even if every local schedule of a single database is serializable, the global schedule of a whole system is not necessarily serializable. The massive communication exchanges of conflict information needed between databases to reach conflict serializability globally would lead to unacceptable performance, primarily due to computer and communication latency. Achieving global serializability effectively over different types of concurrency control has been open for several years. == The global serializability problem == === Problem statement === The difficulties described above translate into the following problem: Find an efficient (high-performance and fault tolerant) method to enforce Global serializability (global conflict serializability) in a heterogeneous distributed environment of multiple autonomous database systems. The database systems may employ different concurrency control methods. No limitation should be imposed on the operations of either local transactions (confined to a single database system) or global transactions (span two or more database systems). === Quotations === Lack of an appropriate solution for the global serializability problem has driven researchers to look for alternatives to serializability as a correctness criterion in a multidatabase environment (e.g., see Relaxing global serializability below), and the problem has been characterized as difficult and open. The following two quotations demonstrate the mindset about it by the end of the year 1991, with similar quotations in numerous other articles: "Without knowledge about local as well as global transactions, it is highly unlikely that efficient global concurrency control can be provided... Additional complications occur when different component DBMSs [Database Management Systems] and the FDBMSs [Federated Database Management Systems] support different concurrency mechanisms... It is unlikely that a theoretically elegant solution that provides conflict serializability without sacrificing performance (i.e., concurrency and/or response time) and availability exists." === Proposed solutions === Several solutions, some partial, have been proposed for the global serializability problem. Among them: Global conflict graph (serializability graph, precedence graph) checking Distributed Two-phase locking (Distributed 2PL) Distributed Timestamp ordering Tickets (local logical timestamps which define local total orders, and are propagated to determine global partial order of transactions) == Relaxing global serializability == Some techniques have been developed for relaxed global serializability (i.e., they do not guarantee global serializability; see also Relaxing serializability). Among them (with several publications each): Quasi serializability Two-level serializability Another common reason nowadays for Global serializability relaxation is the requirement of availability of internet products and services. This requirement is typically answered by large scale data replication. The straightforward solution for synchronizing replicas' updates of a same database object is including all these updates in a single atomic distributed transaction. However, with many replicas such a transaction is very large, and may span several computers and networks that some of them are likely to be unavailable. Thus such a transaction is likely to end with abort and miss its purpose. Consequently, Optimistic replication (Lazy replication) is often utilized (e.g., in many products and services by Google, Amazon, Yahoo, and alike), while global serializability is relaxed and compromised for eventual consistency. In this case relaxation is done only for applications that are not expected to be harmed by it. Classes of schedules defined by relaxed global serializability properties either contain the global serializability class, or are incomparable with it. What differentiates techniques for relaxed global conflict serializability (RGCSR) properties from those of relaxed conflict serializability (RCSR) properties that are not RGCSR is typically the different way global cycles (span two or more databases) in the global conflict graph are handled. No distinction between global and local cycles exists for RCSR properties that are not RGCSR. RCSR contains RGCSR. Typically RGCSR techniques eliminate local cycles, i.e., provide local serializability (which can be achieved effectively by regular, known concurrency control methods); however, obviously they do not eliminate all global cycles (which would achieve global serializability).
SAP Cloud Infrastructure
SAP Cloud Infrastructure is an SAP-operated IaaS cloud platform, used to run SAP’s cloud business and customer-facing deployments for SAP and non-SAP workloads. It is developed and operated with open-source technologies within SAP’s data center network, based on OpenStack and Kubernetes and supporting SAP S/4HANA and general-purpose applications. It offers compute, storage, and platform services that are accessible to SAP customers. == History == In 2012, SAP promoted aspects of cloud computing. In October 2012, SAP announced a platform as a service called the SAP Cloud Platform. In May 2013, a managed private cloud called the S/4HANA Enterprise Cloud service was announced. SAP Converged Cloud was announced in January 2015. SAP Converged Cloud was originally developed as SAP's internal standardized Infrastructure as a Service (IaaS) offering to support SAP’s cloud solutions. Originating from SAP Converged Cloud, SAP Cloud Infrastructure was developed and announced as SAP’s cloud computing offering that is provided for both SAP and customer workloads. In 2025, it had a global footprint of 15 regions and 29 data centers, encompassing more than 200,000 active VMs and over 6,000 hypervisors. In September 2025, SAP announced an expansion of its European “SAP Sovereign Cloud” portfolio, explicitly naming SAP Cloud Infrastructure (alongside SAP Sovereign Cloud On-Site) as part of the stack positioned for public sector and regulated environments. == Services and Features == SAP Cloud Infrastructure (SCI) is an infrastructure-as-a-service (IaaS) offering by SAP that provides virtual compute, storage, and networking services, together with identity, key management, and operational services. SCI follows a self-service model and is managed via APIs and a web-based user interface. === Compute === SCI provides virtual machine instances that can be provisioned from operating system images and selected in predefined sizes (“flavors”). It supports lifecycle operations such as create/modify/resize/delete, power control, and snapshots; instances can be organized into server groups to influence placement policies. === Storage === SCI provides persistent storage services including: Block storage (virtual volumes) with attach/detach to instances, online expansion, cloning, snapshots, and provisioning volumes from images or snapshots. Object storage (containers and objects) managed via API/CLI with access control lists (ACLs) and configurable redundancy options. File storage (shared file systems) with access controls, online resize, snapshots/restore, and replication across availability zones. === Networking === SCI provides software-defined networking (SDN) for tenant networks (networks, subnets, routers) and connectivity features such as floating IPs for public reachability. Network security controls include security groups and firewall policies; connectivity options include BGP-based VPN networking. === Load balancing and DNS === SCI includes managed load balancing for distributing traffic across backend instances and an authoritative DNS service (DNSaaS) with API-based management of DNS zones and records, including options for zone sharing/transfer across projects/tenants and service integrations for automated record creation. === Identity, access, and key management === SCI includes identity and access management for authentication/authorization in projects/tenants (for example token handling, role assignment, and credential management) and key/secrets management for storing and controlling access to secret material such as keys and certificates, including support for different backends (depending on configuration). === Cloud-native services === SCI includes a container image registry (image push/pull, access policies, and lifecycle controls) and an auto-scaling capability for file shares based on configurable rules. === Observability and audit === SCI includes metrics and audit logging capabilities for operational monitoring and for listing/filtering audit-relevant events across services. === Availability and service levels === SCI documentation describes availability-related features such as load balancing, storage redundancy options, and replication for file shares across availability zones. SAP cloud services are governed by contractual service-level agreements (SLA); SAP Cloud Infrastructure references an SLA supplement defining infrastructure-specific terms when referenced in order forms. === SAP cloud services === SAP cloud services can run on different underlying infrastructures, including SAP Cloud Infrastructure in addition to SAP NS2 or hyperscalers. SAP cloud solutions available on SAP Cloud Infrastructure include SAP Cloud ERP, SAP HCM, SAP Solutions for Spend Management, Supply Chain Management, Business Transformation Management, and SAP Business Technology Platform (including related analytics and business data solutions). For example, SAP HANA Cloud documentation lists SAP Cloud Infrastructure as one of the supported infrastructures alongside hyperscalers. === Sustainability === SAP describes sustainability initiatives for its data centers, including energy-efficient infrastructure (for example, advanced cooling systems and power management), renewable electricity usage where feasible, and operational practices such as recycling electronic waste and minimizing water usage. SAP also references environmental management and energy management standards such as ISO 14001 and ISO 50001 for its data center operations. SAP-owned data centers run with 100% renewable electricity and that renewable electricity has been used since 2014 to power SAP facilities including owned data centers and co-locations. == SAP Cloud Infrastructure for SAP Sovereign Cloud == SAP Sovereign Cloud is a portfolio of SAP solutions designed to help organizations adopt SAP cloud solutions such as the SAP Cloud ERP while maintaining control over data, infrastructure, and compliance in line with local laws and regulations. The portfolio offers multiple deployment options, including SAP Cloud Infrastructure and SAP Sovereign Cloud On-Site, alongside sovereign hyperscaler-based options such as via SAP NS2, and targets customers such as public-sector bodies and other highly regulated organizations. In Europe, SAP Cloud Infrastructure is an Infrastructure-as-a-Service (IaaS) deployment option within SAP Sovereign Cloud for SAP and customer / third party workloads, operated on SAP’s data center network and developed using open-source technologies, with customer data stored within the European Union. Sovereignty-related characteristics for the SAP Cloud Infrastructure include: EU footprint and ownership model: SAP-operated data centers in Germany include sites in St. Leon-Rot and Walldorf, and co-location sites in Frankfurt. EU AI Cloud: EU AI Cloud is a sovereign AI offering for Europe that provides secure, compliant environments for building and running AI, including governed access to auditable large language models from SAP and partners. It offers AI models on the SAP Cloud Infrastructure and SAP Business Technology Platform (SAP BTP), enabling deployment of AI applications and models on high-performance European infrastructure (including accelerator/GPU-based compute for AI workloads). Availability zones and secure interconnect: Three availability zones in three independent data centers in Germany, connected via SAP-owned fiber on SAP-owned property. Facility and security standards: ISO/IEC 27001 governance of delivery and operations of SAP cloud services and SAP-owned data centers. Additional facility and availability standards: EN 50600 availability class 3 (European data centre standard) and/or ISO/IEC 22237 availability class 3 (international equivalent). Technology foundation: Based on open-source cloud infrastructure framework (OpenStack) and Kubernetes, without dependencies on hyperscaler technologies. Sovereignty controls: Data sovereignty (data residency), operational sovereignty (administration and maintenance restricted to approved, security-cleared personnel), technical sovereignty (locally hosted control planes with separation via encryption or dedicated infrastructure), and legal sovereignty (use of locally based legal entities or those in approved countries). Classified information processing: Roadmap to meet high and very high requirements for handling classified or sensitive information under European regulatory and security regimes. Public-sector readiness and EU sovereignty assurance levels: Implemented to meet SEAL-3 (Digital Resilience) and SEAL-4 (Full Digital Sovereignty) of the European Commission’s Cloud Sovereignty Framework. Staffing constraints: Operations model selectable to restrict sensitive operations to vetted personnel from EU or NATO countries.
Chromosome (evolutionary algorithm)
A chromosome or genotype in evolutionary algorithms (EA) is a set of parameters which define a proposed solution of the problem that the evolutionary algorithm is trying to solve. The set of all solutions, also called individuals according to the biological model, is known as the population. The genome of an individual consists of one, more rarely of several, chromosomes and corresponds to the genetic representation of the task to be solved. A chromosome is composed of a set of genes, where a gene consists of one or more semantically connected parameters, which are often also called decision variables. They determine one or more phenotypic characteristics of the individual or at least have an influence on them. In the basic form of genetic algorithms, the chromosome is represented as a binary string, while in later variants and in EAs in general, a wide variety of other data structures are used. == Chromosome design == When creating the genetic representation of a task, it is determined which decision variables and other degrees of freedom of the task should be improved by the EA and possible additional heuristics and how the genotype-phenotype mapping should look like. The design of a chromosome translates these considerations into concrete data structures for which an EA then has to be selected, configured, extended, or, in the worst case, created. Finding a suitable representation of the problem domain for a chromosome is an important consideration, as a good representation will make the search easier by limiting the search space; similarly, a poorer representation will allow a larger search space. In this context, suitable mutation and crossover operators must also be found or newly defined to fit the chosen chromosome design. An important requirement for these operators is that they not only allow all points in the search space to be reached in principle, but also make this as easy as possible. The following requirements must be met by a well-suited chromosome: It must allow the accessibility of all admissible points in the search space. Design of the chromosome in such a way that it covers only the search space and no additional areas. so that there is no redundancy or only as little redundancy as possible. Observance of strong causality: small changes in the chromosome should only lead to small changes in the phenotype. This is also called locality of the relationship between search and problem space. Designing the chromosome in such a way that it excludes prohibited regions in the search space completely or as much as possible. While the first requirement is indispensable, depending on the application and the EA used, one usually only has to be satisfied with fulfilling the remaining requirements as far as possible. The evolutionary search is supported and possibly considerably accelerated by a fulfillment as complete as possible. == Examples of chromosomes == === Chromosomes for binary codings === In their classical form, GAs use bit strings and map the decision variables to be optimized onto them. An example for one Boolean and three integer decision variables with the value ranges 0 ≤ D 1 ≤ 60 {\displaystyle 0\leq D_{1}\leq 60} , 28 ≤ D 2 ≤ 30 {\displaystyle 28\leq D_{2}\leq 30} and − 12 ≤ D 3 ≤ 14 {\displaystyle -12\leq D_{3}\leq 14} may illustrate this: Note that the negative number here is given in two's complement. This straight forward representation uses five bits to represent the three values of D 2 {\displaystyle D_{2}} , although two bits would suffice. This is a significant redundancy. An improved alternative, where 28 is to be added for the genotype-phenotype mapping, could look like this: with D 2 = 28 + D 2 ′ = 29 {\displaystyle D_{2}=28+D'_{2}=29} . === Chromosomes with real-valued or integer genes === For the processing of tasks with real-valued or mixed-integer decision variables, EAs such as the evolution strategy or the real-coded GAs are suited. In the case of mixed-integer values, rounding is often used, but this represents some violation of the redundancy requirement. If the necessary precisions of the real values can be reasonably narrowed down, this violation can be remedied by using integer-coded GAs. For this purpose, the valid digits of real values are mapped to integers by multiplication with a suitable factor. For example, 12.380 becomes the integer 12380 by multiplying by 1000. This must of course be taken into account in genotype-phenotype mapping for evaluation and result presentation. A common form is a chromosome consisting of a list or an array of integer or real values. === Chromosomes for permutations === Combinatorial problems are mainly concerned with finding an optimal sequence of a set of elementary items. As an example, consider the problem of the traveling salesman who wants to visit a given number of cities exactly once on the shortest possible tour. The simplest and most obvious mapping onto a chromosome is to number the cities consecutively, to interpret a resulting sequence as permutation and to store it directly in a chromosome, where one gene corresponds to the ordinal number of a city. Then, however, the variation operators may only change the gene order and not remove or duplicate any genes. The chromosome thus contains the path of a possible tour to the cities. As an example the sequence 3 , 5 , 7 , 1 , 4 , 2 , 9 , 6 , 8 {\displaystyle 3,5,7,1,4,2,9,6,8} of nine cities may serve, to which the following chromosome corresponds: In addition to this encoding frequently called path representation, there are several other ways of representing a permutation, for example the ordinal representation or the matrix representation. === Chromosomes for co-evolution === When a genetic representation contains, in addition to the decision variables, additional information that influences evolution and/or the mapping of the genotype to the phenotype and is itself subject to evolution, this is referred to as co-evolution. A typical example is the evolution strategy (ES), which includes one or more mutation step sizes as strategy parameters in each chromosome. Another example is an additional gene to control a selection heuristic for resource allocation in a scheduling tasks. This approach is based on the assumption that good solutions are based on an appropriate selection of strategy parameters or on control gene(s) that influences genotype-phenotype mapping. The success of the ES gives evidence to this assumption. === Chromosomes for complex representations === The chromosomes presented above are well suited for processing tasks of continuous, mixed-integer, pure-integer or combinatorial optimization. For a combination of these optimization areas, on the other hand, it becomes increasingly difficult to map them to simple strings of values, depending on the task. The following extension of the gene concept is proposed by the EA GLEAM (General Learning Evolutionary Algorithm and Method) for this purpose: A gene is considered to be the description of an element or elementary trait of the phenotype, which may have multiple parameters. For this purpose, gene types are defined that contain as many parameters of the appropriate data type as are required to describe the particular element of the phenotype. A chromosome now consists of genes as data objects of the gene types, whereby, depending on the application, each gene type occurs exactly once as a gene or can be contained in the chromosome any number of times. The latter leads to chromosomes of dynamic length, as they are required for some problems. The gene type definitions also contain information on the permissible value ranges of the gene parameters, which are observed during chromosome generation and by corresponding mutations, so they cannot lead to lethal mutations. For tasks with a combinatorial part, there are suitable genetic operators that can move or reposition genes as a whole, i.e. with their parameters. A scheduling task is used as an illustration, in which workflows are to be scheduled that require different numbers of heterogeneous resources. A workflow specifies which work steps can be processed in parallel and which have to be executed one after the other. In this context, heterogeneous resources mean different processing times at different costs in addition to different processing capabilities. Each scheduling operation therefore requires one or more parameters that determine the resource selection, where the value ranges of the parameters depend on the number of alternative resources available for each work step. A suitable chromosome provides one gene type per work step and in this case one corresponding gene, which has one parameter for each required resource. The order of genes determines the order of scheduling operations and, therefore, the precedence in case of allocation conflicts. The exemplary gene type definition of work step 15 with two resources, for which there are four and seven alternatives respectively
Silhouette (clustering)
Silhouette is a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified. It was proposed by Belgian statistician Peter Rousseeuw in 1987. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette value ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. A clustering with an average silhouette width of over 0.7 is considered to be "strong", a value over 0.5 "reasonable", and over 0.25 "weak". However, with an increasing dimensionality of the data, it becomes difficult to achieve such high values because of the curse of dimensionality, as the distances become more similar. The silhouette score is specialized for measuring cluster quality when the clusters are convex-shaped, and may not perform well if the data clusters have irregular shapes or are of varying sizes. The silhouette value can be calculated with any distance metric, such as Euclidean distance or Manhattan distance. == Definition == Assume the data have been clustered via any technique, such as k-medoids or k-means, into k {\displaystyle k} clusters. For data point i ∈ C i {\displaystyle i\in C_{i}} (data point i {\displaystyle i} in the cluster C i {\displaystyle C_{i}} ), calculate a ( i ) {\displaystyle a(i)} , the average distance that i {\displaystyle i} is from all other points in that cluster: a ( i ) = 1 | C i | − 1 ∑ j ∈ C i , i ≠ j d ( i , j ) {\displaystyle a(i)={\frac {1}{|C_{i}|-1}}\sum _{j\in C_{i},i\neq j}d(i,j)} where | C i | {\displaystyle |C_{i}|} is the number of points belonging to cluster C i {\displaystyle C_{i}} , and d ( i , j ) {\displaystyle d(i,j)} is the distance between data points i {\displaystyle i} and j {\displaystyle j} in the cluster C i {\displaystyle C_{i}} (we divide by | C i | − 1 {\displaystyle |C_{i}|-1} because the distance d ( i , i ) {\displaystyle d(i,i)} is not included in the sum). a ( i ) {\displaystyle a(i)} can be interpreted as a measure of how well i {\displaystyle i} is assigned to its cluster (the smaller the value, the better the assignment). We then define the mean dissimilarity of point i {\displaystyle i} to some cluster C j {\displaystyle C_{j}} as the mean of the distance from i {\displaystyle i} to all points in C j {\displaystyle C_{j}} (where C j ≠ C i {\displaystyle C_{j}\neq C_{i}} ). For each data point i ∈ C i {\displaystyle i\in C_{i}} , we now define b ( i ) {\displaystyle b(i)} as the average distance between i {\displaystyle i} and the points in the closest cluster (hence: "min") that i {\displaystyle i} does not belong to: b ( i ) = min j ≠ i 1 | C j | ∑ l ∈ C j d ( i , l ) {\displaystyle b(i)=\min _{j\neq i}{\frac {1}{|C_{j}|}}\sum _{l\in C_{j}}d(i,l)} The cluster with the smallest mean dissimilarity is said to be the "neighboring cluster" of i {\displaystyle i} because it is the next best fit cluster for point i {\displaystyle i} . We now define a silhouette (value) of one data point i {\displaystyle i} s ( i ) = b ( i ) − a ( i ) max { a ( i ) , b ( i ) } {\displaystyle s(i)={\frac {b(i)-a(i)}{\max\{a(i),b(i)\}}}} , if | C i | > 1 {\displaystyle |C_{i}|>1} and s ( i ) = 0 {\displaystyle s(i)=0} , if | C i | = 1 {\displaystyle |C_{i}|=1} , which can also be written as s ( i ) = { 1 − a ( i ) b ( i ) , if a ( i ) < b ( i ) 0 , if a ( i ) = b ( i ) b ( i ) a ( i ) − 1 , if a ( i ) > b ( i ) {\displaystyle s(i)={\begin{cases}1-{\frac {a(i)}{b(i)}},&{\mbox{ if }}a(i)b(i)\\\end{cases}}} From the above definition, s ( i ) {\displaystyle s(i)} is bounded to the interval [ − 1 , 1 ] {\displaystyle [-1,1]} , i.e. − 1 ≤ s ( i ) ≤ 1. {\displaystyle -1\leq s(i)\leq 1.} Note that a ( i ) {\displaystyle a(i)} is not clearly defined for clusters with size = 1, in which case we set s ( i ) = 0 {\displaystyle s(i)=0} . This choice is arbitrary, but neutral in the sense that it is at the midpoint of the bounds, -1 and 1. For s ( i ) {\displaystyle s(i)} to be close to 1 we require a ( i ) ≪ b ( i ) {\displaystyle a(i)\ll b(i)} . As a ( i ) {\displaystyle a(i)} is a measure of how dissimilar i {\displaystyle i} is to its own cluster, a small value means it is well matched. Furthermore, a large b ( i ) {\displaystyle b(i)} implies that i {\displaystyle i} is badly matched to its neighbouring cluster. Thus an s ( i ) {\displaystyle s(i)} close to 1 means that the data is appropriately clustered. If s ( i ) {\displaystyle s(i)} is close to -1, then by the same logic we see that i {\displaystyle i} would be more appropriate if it was clustered in its neighbouring cluster. An s ( i ) {\displaystyle s(i)} near zero means that the datum is on the border of two natural clusters. The mean s ( i ) {\displaystyle s(i)} over all points of a cluster is a measure of how tightly grouped all the points in the cluster are. Thus the mean s ( i ) {\displaystyle s(i)} over all data of the entire dataset is a measure of how appropriately the data have been clustered. If there are too many or too few clusters, as may occur when a poor choice of k {\displaystyle k} is used in the clustering algorithm (e.g., k-means), some of the clusters will typically display much narrower silhouettes than the rest. Thus silhouette plots and means may be used to determine the natural number of clusters within a dataset. One can also increase the likelihood of the silhouette being maximized at the correct number of clusters by re-scaling the data using feature weights that are cluster specific. Kaufman et al. introduced the term silhouette coefficient for the maximum value of the mean s ( i ) {\displaystyle s(i)} over all data of the entire dataset, i.e., S C = max k s ~ ( k ) , {\displaystyle SC=\max _{k}{\tilde {s}}\left(k\right),} where s ~ ( k ) {\displaystyle {\tilde {s}}\left(k\right)} represents the mean s ( i ) {\displaystyle s(i)} over all data of the entire dataset for a specific number of clusters k {\displaystyle k} . The silhouette coefficient describes the best possible clustering possible for a given number of clusters, as measured by the highest average silhouette score for all points in the dataset. == Simplified and medoid silhouette == Computing the silhouette coefficient needs all O ( N 2 ) {\displaystyle {\mathcal {O}}(N^{2})} pairwise distances, making this evaluation much more costly than clustering with k-means. For a clustering with centers μ C I {\displaystyle \mu _{C_{I}}} for each cluster C I {\displaystyle C_{I}} , we can use the following simplified Silhouette for each point i ∈ C I {\displaystyle i\in C_{I}} instead, which can be computed using only O ( N k ) {\displaystyle {\mathcal {O}}(Nk)} distances: a ′ ( i ) = d ( i , μ C I ) {\displaystyle a'(i)=d(i,\mu _{C_{I}})} and b ′ ( i ) = min C J ≠ C I d ( i , μ C J ) {\displaystyle b'(i)=\min _{C_{J}\neq C_{I}}d(i,\mu _{C_{J}})} , which has the additional benefit that a ′ ( i ) {\displaystyle a'(i)} is always defined, then define accordingly the simplified silhouette and simplified silhouette coefficient s ′ ( i ) = b ′ ( i ) − a ′ ( i ) max { a ′ ( i ) , b ′ ( i ) } {\displaystyle s'(i)={\frac {b'(i)-a'(i)}{\max\{a'(i),b'(i)\}}}} S C ′ = max k 1 N ∑ i s ′ ( i ) {\displaystyle SC'=\max _{k}{\frac {1}{N}}\sum _{i}s'\left(i\right)} . If the cluster centers are medoids (as in k-medoids clustering) instead of arithmetic means (as in k-means clustering), this is also called the medoid-based silhouette or medoid silhouette. If every object is assigned to the nearest medoid (as in k-medoids clustering), we know that a ′ ( i ) ≤ b ′ ( i ) {\displaystyle a'(i)\leq b'(i)} , and hence s ′ ( i ) = b ′ ( i ) − a ′ ( i ) b ′ ( i ) = 1 − a ′ ( i ) b ′ ( i ) {\displaystyle s'(i)={\frac {b'(i)-a'(i)}{b'(i)}}=1-{\frac {a'(i)}{b'(i)}}} . == Silhouette clustering == Instead of using the average silhouette to evaluate a clustering obtained from, e.g., k-medoids or k-means, we can try to directly find a solution that maximizes the Silhouette. We do not have a closed form solution to maximize this, but it will usually be best to assign points to the nearest cluster as done by these methods. Van der Laan et al. proposed to adapt the standard algorithm for k-medoids, PAM, for this purpose and call this algorithm PAMSIL: Choose initial medoids by using PAM Compute the average silhouette of this initial solution For each pair of a medoid m and a non-medoid x swap m and x compute the average silhouette of the resulting solution remember the best swap un-swap m and x for the next iteration Perform the best swap and return to
Weka (software)
Waikato Environment for Knowledge Analysis (Weka) is a collection of machine learning and data analysis free software licensed under the GNU General Public License. It was developed at the University of Waikato, New Zealand, and is the companion software to the book "Data Mining: Practical Machine Learning Tools and Techniques". == Description == Weka contains a collection of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to these functions. The original non-Java version of Weka was a Tcl/Tk front-end to (mostly third-party) modeling algorithms implemented in other programming languages, plus data preprocessing utilities in C, and a makefile-based system for running machine learning experiments. This original version was primarily designed as a tool for analyzing data from agricultural domains, but the more recent fully Java-based version (Weka 3), for which development started in 1997, is now used in many different application areas, in particular for educational purposes and research. Advantages of Weka include: Free availability under the GNU General Public License. Portability, since it is fully implemented in the Java programming language and thus runs on almost any modern computing platform. A comprehensive collection of data preprocessing and modeling techniques. Ease of use due to its graphical user interfaces. Weka supports several standard data mining tasks, more specifically, data preprocessing, clustering, classification, regression, visualization, and feature selection. Input to Weka is expected to be formatted according the Attribute-Relational File Format and with the filename bearing the .arff extension. All of Weka's techniques are predicated on the assumption that the data is available as one flat file or relation, where each data point is described by a fixed number of attributes (normally, numeric or nominal attributes, but some other attribute types are also supported). Weka provides access to SQL databases using Java Database Connectivity and can process the result returned by a database query. Weka provides access to deep learning with Deeplearning4j. It is not capable of multi-relational data mining, but there is separate software for converting a collection of linked database tables into a single table that is suitable for processing using Weka. Another important area that is currently not covered by the algorithms included in the Weka distribution is sequence modeling. == Extension packages == In version 3.7.2, a package manager was added to allow the easier installation of extension packages. Some functionality that used to be included with Weka prior to this version has since been moved into such extension packages, but this change also makes it easier for others to contribute extensions to Weka and to maintain the software, as this modular architecture allows independent updates of the Weka core and individual extensions. == History == In 1993, the University of Waikato in New Zealand began development of the original version of Weka, which became a mix of Tcl/Tk, C, and makefiles. In 1997, the decision was made to redevelop Weka from scratch in Java, including implementations of modeling algorithms. In 2005, Weka received the SIGKDD Data Mining and Knowledge Discovery Service Award. In 2006, Pentaho Corporation acquired an exclusive licence to use Weka for business intelligence. It forms the data mining and predictive analytics component of the Pentaho business intelligence suite. Pentaho has since been acquired by Hitachi Vantara, and Weka now underpins the PMI (Plugin for Machine Intelligence) open source component. == Related tools == Auto-WEKA is an automated machine learning system for Weka. Environment for DeveLoping KDD-Applications Supported by Index-Structures (ELKI) is a similar project to Weka with a focus on cluster analysis, i.e., unsupervised methods. H2O.ai is an open-source data science and machine learning platform KNIME is a machine learning and data mining software implemented in Java. Massive Online Analysis (MOA) is an open-source project for large scale mining of data streams, also developed at the University of Waikato in New Zealand. Neural Designer is a data mining software based on deep learning techniques written in C++. Orange is a similar open-source project for data mining, machine learning and visualization based on scikit-learn. RapidMiner is a commercial machine learning framework implemented in Java which integrates Weka. scikit-learn is a popular machine learning library in Python.
Spleak
Spleak was an IM platform where users could publish and rate content. It existed in the form of six bots covering as many subject areas: CelebSpleak, SportSpleak, VoteSpleak, TVSpleak, GameSpleak, and StyleSpleak. == Overview == Users can add a "multi-Spleak" (which contains all of the different Spleak bots in one) or add the separate bots to their IM buddy lists on MSN and AIM. Users are also allowed access to Spleak online by using a CelebSpleak, SportSpleak, or VoteSpleak widget, or through the CelebSpleak and SportSpleak applications with Facebook. Spleak was an alternate reality game and is moving to its own company, Spleak Media Network. "Celebrate Spleak" was introduced throughout 2007, launched in 2008, and was forced to retire in 2009. == Key people == Spleak was co-founded by Morten Lund and Nicolaj Reffstrup. The company's chief executive officer is Morrie Eisenburg; Josh Scott is Vice President in Product and Tyler Wells is Vice President in Engineering.
ImageNets
ImageNets is an open source framework for rapid prototyping of machine vision algorithms, developed by the Institute of Automation. == Description == ImageNets is an open source and platform independent (Windows & Linux) framework for rapid prototyping of machine vision algorithms. With the GUI ImageNet Designer, no programming knowledge is required to perform operations on images. A configured ImageNet can be loaded and executed from C++ code without the need for loading the ImageNet Designer GUI to achieve higher execution performance. == History == ImageNets was developed by the Institute of Automation, University of Bremen, Germany. The software was first publicly released in 2010. Originally, ImageNets was developed for the Care-Providing Robot FRIEND but it can be used for a wide range of computer vision applications.