Correlation clustering

Correlation clustering

Clustering is the problem of partitioning data points into groups based on similarity or dissimilarity. Correlation clustering is a clustering framework in which a set of objects is partitioned into clusters based on pairwise similarity and dissimilarity information, without requiring the number of clusters to be specified in advance. == Description of the problem == In machine learning, correlation clustering (also known as cluster editing) considers settings in which pairwise similarity or dissimilarity relationships between objects are known. A standard formulation models the input as an unweighted complete graph G = ( V , E ) {\displaystyle G=(V,E)} , where each edge is labeled either + {\displaystyle +} or − {\displaystyle -} (that is, the graph is a signed graph), indicating whether the corresponding endpoints are similar or dissimilar. The goal is to find a clustering (that is, a partition of V {\displaystyle V} ) that either maximizes the number of agreements—the sum of positive edges whose endpoints lie in the same cluster and negative edges whose endpoints lie in different clusters—or minimizes the number of disagreements—the sum of positive edges whose endpoints are separated and negative edges whose endpoints lie in the same cluster. Unlike other clustering methods such as k-means, correlation clustering does not require choosing the number of clusters k {\displaystyle k} in advance. It is not always possible to find a clustering with zero disagreements. For example, consider a triangle graph containing two positive edges and one negative edge. In this case, every clustering incurs at least one disagreement. Such configurations are referred to in the literature as bad triangles. From a computational perspective, optimizing the correlation clustering objective is challenging. The (decision version of the) problem is NP-complete. A large body of subsequent work has developed approximation algorithms for correlation clustering under various assumptions, including complete or general graphs and unweighted or weighted graphs, for both minimization and maximization objectives. This problem is considered one of the fundamental combinatorial optimization problems, and many algorithmic techniques have been developed to address it. The problem has also been studied extensively across multiple disciplines. A comprehensive literature review of early correlation clustering research is provided by Wahid and Hassini. == Formal Definitions == Let G = ( V , E ) {\displaystyle G=(V,E)} be a graph with nodes V {\displaystyle V} and edges E {\displaystyle E} . A clustering of G {\displaystyle G} is a partition of its node set Π = { π 1 , … , π k } {\displaystyle \Pi =\{\pi _{1},\dots ,\pi _{k}\}} with V = π 1 ∪ ⋯ ∪ π k {\displaystyle V=\pi _{1}\cup \dots \cup \pi _{k}} and π i ∩ π j = ∅ {\displaystyle \pi _{i}\cap \pi _{j}=\emptyset } for i ≠ j {\displaystyle i\neq j} . For a given clustering Π {\displaystyle \Pi } , let δ ( Π ) = { { u , v } ∈ E ∣ { u , v } ⊈ π ∀ π ∈ Π } {\displaystyle \delta (\Pi )=\{\{u,v\}\in E\mid \{u,v\}\not \subseteq \pi \;\forall \pi \in \Pi \}} denote the subset of edges of G {\displaystyle G} whose endpoints are in different subsets of the clustering Π {\displaystyle \Pi } . Now, let w : E → R ≥ 0 {\displaystyle w\colon E\to \mathbb {R} _{\geq 0}} be a function that assigns a non-negative weight to each edge of the graph and let E = E + ∪ E − {\displaystyle E=E^{+}\cup E^{-}} be a partition of the edges into attractive ( E + {\displaystyle E^{+}} ) and repulsive ( E − {\displaystyle E^{-}} ) edges; that is, the edges are signed. The minimum disagreement correlation clustering problem is the following optimization problem: minimize Π ∑ e ∈ E + ∩ δ ( Π ) w e + ∑ e ∈ E − ∖ δ ( Π ) w e . {\displaystyle {\begin{aligned}&{\underset {\Pi }{\operatorname {minimize} }}&&\sum _{e\in E^{+}\cap \delta (\Pi )}w_{e}+\sum _{e\in E^{-}\setminus \delta (\Pi )}w_{e}\;.\end{aligned}}} Here, the set E + ∩ δ ( Π ) {\displaystyle E^{+}\cap \delta (\Pi )} contains the attractive edges whose endpoints are in different components with respect to the clustering Π {\displaystyle \Pi } and the set E − ∖ δ ( Π ) {\displaystyle E^{-}\setminus \delta (\Pi )} contains the repulsive edges whose endpoints are in the same component with respect to the clustering Π {\displaystyle \Pi } . Together these two sets contain all edges that disagree with the clustering Π {\displaystyle \Pi } . Similarly to the minimum disagreement correlation clustering problem, the maximum agreement correlation clustering problem is defined as maximize Π ∑ e ∈ E + ∖ δ ( Π ) w e + ∑ e ∈ E − ∩ δ ( Π ) w e . {\displaystyle {\begin{aligned}&{\underset {\Pi }{\operatorname {maximize} }}&&\sum _{e\in E^{+}\setminus \delta (\Pi )}w_{e}+\sum _{e\in E^{-}\cap \delta (\Pi )}w_{e}\;.\end{aligned}}} Here, the set E + ∖ δ ( Π ) {\displaystyle E^{+}\setminus \delta (\Pi )} contains the attractive edges whose endpoints are in the same component with respect to the clustering Π {\displaystyle \Pi } and the set E − ∩ δ ( Π ) {\displaystyle E^{-}\cap \delta (\Pi )} contains the repulsive edges whose endpoints are in different components with respect to the clustering Π {\displaystyle \Pi } . Together these two sets contain all edges that agree with the clustering Π {\displaystyle \Pi } . Instead of formulating the correlation clustering problem in terms of non-negative edge weights and a partition of the edges into attractive and repulsive edges the problem is also formulated in terms of positive and negative edge costs without partitioning the set of edges explicitly. For given weights w : E → R ≥ 0 {\displaystyle w\colon E\to \mathbb {R} _{\geq 0}} and a given partition E = E + ∪ E − {\displaystyle E=E^{+}\cup E^{-}} of the edges into attractive and repulsive edges, the edge costs can be defined by c e = { w e if e ∈ E + − w e if e ∈ E − {\displaystyle {\begin{aligned}c_{e}={\begin{cases}\;\;w_{e}&{\text{if }}e\in E^{+}\\-w_{e}&{\text{if }}e\in E^{-}\end{cases}}\end{aligned}}} for all e ∈ E {\displaystyle e\in E} . An edge whose endpoints are in different clusters is said to be cut. The set δ ( Π ) {\displaystyle \delta (\Pi )} of all edges that are cut is often called a multicut of G {\displaystyle G} . The minimum cost multicut problem is the problem of finding a clustering Π {\displaystyle \Pi } of G {\displaystyle G} such that the sum of the costs of the edges whose endpoints are in different clusters is minimal: minimize Π ∑ e ∈ δ ( Π ) c e . {\displaystyle {\begin{aligned}&{\underset {\Pi }{\operatorname {minimize} }}&&\sum _{e\in \delta (\Pi )}c_{e}\;.\end{aligned}}} Similar to the minimum cost multicut problem, coalition structure generation in weighted graph games is the problem of finding a clustering such that the sum of the costs of the edges that are not cut is maximal: maximize Π ∑ e ∈ E ∖ δ ( Π ) c e . {\displaystyle {\begin{aligned}&{\underset {\Pi }{\operatorname {maximize} }}&&\sum _{e\in E\setminus \delta (\Pi )}c_{e}\;.\end{aligned}}} This formulation is also known as the clique partitioning problem. It can be shown that all four problems that are formulated above are equivalent. This means that a clustering that is optimal with respect to any of the four objectives is optimal for all of the four objectives. == Algorithms == If the graph admits a clustering with zero disagreements, then deleting all negative edges and computing the connected components of the remaining graph yields an optimal clustering. A necessary and sufficient condition for the existence of such a clustering was given by Davis: no cycle in the graph may contain exactly one negative edge. Bansal et al. discuss the NP-completeness proof and also present both a constant factor approximation algorithm and polynomial-time approximation scheme to find the clusters in this setting. Ailon et al. propose a randomized 3-approximation algorithm for the same problem. CC-Pivot(G=(V,E+,E−)) Pick random pivot i ∈ V Set C = { i } {\displaystyle C=\{i\}} , V'=Ø For all j ∈ V, j ≠ i; If (i,j) ∈ E+ then Add j to C Else (If (i,j) ∈ E−) Add j to V' Let G' be the subgraph induced by V' Return clustering C,CC-Pivot(G') The authors show that the above algorithm is a 3-approximation algorithm for correlation clustering. The best polynomial-time approximation algorithm known at the moment for this problem achieves a ~2.06 approximation by rounding a linear program, as shown by Chawla, Makarychev, Schramm, and Yaroslavtsev. Karpinski and Schudy proved existence of a polynomial time approximation scheme (PTAS) for that problem on complete graphs and fixed number of clusters. == Optimal number of clusters == In 2011, it was shown by Bagon and Galun that the optimization of the correlation clustering functional is closely related to well known discrete optimization methods. In their work they proposed a probabilistic analysis of the underlying implicit model that allows the correlation clustering functional to estimate the

Cloudlet

A cloudlet is a mobility-enhanced small-scale cloud datacenter that is located at the edge of the Internet. The main purpose of the cloudlet is supporting resource-intensive and interactive mobile applications by providing powerful computing resources to mobile devices with lower latency. It is a new architectural element that extends today's cloud computing infrastructure. It represents the middle tier of a 3-tier hierarchy: mobile device - cloudlet - cloud. A cloudlet can be viewed as a data center in a box whose goal is to bring the cloud closer. The cloudlet term was first coined by M. Satyanarayanan, Victor Bahl, Ramón Cáceres, and Nigel Davies, and a prototype implementation is developed by Carnegie Mellon University as a research project. The concept of cloudlet is also known as follow me cloud, and mobile micro-cloud. == Motivation == Many mobile services split the application into a front-end client program and a back-end server program following the traditional client-server model. The front-end mobile application offloads its functionality to the back-end servers for various reasons such as speeding up processing. With the advent of cloud computing, the back-end server is typically hosted at the cloud datacenter. Though the use of a cloud datacenter offers various benefits such as scalability and elasticity, its consolidation and centralization lead to a large separation between a mobile device and its associated datacenter. End-to-end communication then involves many network hops and results in high latencies and low bandwidth. For the reasons of latency, some emerging mobile applications require cloud offload infrastructure to be close to the mobile device to achieve low response time. In the ideal case, it is just one wireless hop away. For example, the offload infrastructure could be located in a cellular base station or it could be LAN-connected to a set of Wi-Fi base stations. The individual elements of this offload infrastructure are referred to as cloudlets. == Applications == Cloudlets aim to support mobile applications that are both resource-intensive and interactive. Augmented reality applications that use head-tracked systems require end-to-end latencies of less than 16 ms. Cloud games with remote rendering also require low latencies and high bandwidth. Wearable cognitive assistance systems combine devices such as Google Glass with cloud-based processing to guide users through complex tasks. This futuristic genre of applications is characterized as “astonishingly transformative” by the report of the 2013 NSF Workshop on Future Directions in Wireless Networking. These applications use cloud resources in the critical path of real-time user interaction. Consequently, they cannot tolerate end-to-end operation latencies of more than a few tens of milliseconds. Apple Siri and Google Now which perform compute-intensive speech recognition in the cloud, are further examples in this emerging space. == Cloudlet vs Cloud == There is significant overlap in the requirements for cloud and cloudlet. At both levels, there is the need for: (a) strong isolation between untrusted user-level computations; (b) mechanisms for authentication, access control, and metering; (c) dynamic resource allocation for user-level computations; and, (d) the ability to support a very wide range of user-level computations, with minimal restrictions on their process structure, programming languages or operating systems. At a cloud datacenter, these requirements are met today using the virtual machine (VM) abstraction. For the same reasons they are used in cloud computing today, VMs are used as an abstraction for cloudlets. Meanwhile, there are a few but important differentiators between cloud and cloudlet. === Rapid provisioning === Different from cloud data centers that are optimized for launching existing VM images in their storage tier, cloudlets need to be much more agile in their provisioning. Their association with mobile devices is highly dynamic, with considerable churn due to user mobility. A user from far away may unexpectedly show up at a cloudlet (e.g., if he just got off an international flight) and try to use it for an application such as a personalized language translator. For that user, the provisioning delay before he is able to use the application impacts usability. === VM handoff across cloudlets === If a mobile device user moves away from the cloudlet he is currently using, the interactive response will degrade as the logical network distance increases. To address this effect of user mobility, the offloaded services on the first cloudlet need to be transferred to the second cloudlet maintaining end-to-end network quality. This resembles live migration in cloud computing but differs considerably in a sense that the VM handoff happens in Wide Area Network (WAN). == OpenStack++ == Since the cloudlet model requires reconfiguration or additional deployment of hardware/software, it is important to provide a systematic way to incentivise the deployment. However, it can face a classic bootstrapping problem. Cloudlets need practical applications to incentivize cloudlet deployment. However, developers cannot heavily rely on cloudlet infrastructure until it is widely deployed. To break this deadlock and bootstrap the cloudlet deployment, researchers at Carnegie Mellon University proposed OpenStack++ that extends OpenStack to leverage its open ecosystem. OpenStack++ provides a set of cloudlet-specific APIs as OpenStack extensions. == Commercial implementations and standardization effort == By 2015 cloudlet based applications were commercially available. In 2017 the National Institute of Standards and Technology published draft standards for fog computing in which cloudlets were defined as nodes on the fog architecture.

Lemmy (social network)

Lemmy is free and open-source, social news aggregation software for running self-hosted discussion forums. These hosts, known as "instances", communicate with each other using the ActivityPub protocol. == History == Lemmy was created by the user Dessalines on GitHub in February 2019 and licensed under the Affero General Public License. In a 2020 post, Lemmy's co-creator Dessalines wrote about the origin of the name Lemmy. "It was nameless for a long time, but I wanted to keep with the fediverse tradition of naming projects after animals. I was playing that old-school game Lemmings, and Lemmy (from Motorhead) had passed away that week, and we held a few polls for names, and I went with that." According to the Fediverse statistics sites the-federation.info and fedidb.com, Lemmy had fewer than 100 instances prior to June 2023, but grew to 455 instances with approximately 48,600 monthly active users as of 22 December 2025, with the largest instances being lemmy.world and lemmy.ml, reporting about 14,144 and 1,982 monthly active users, respectively. == Description == Lemmy is made up of a network of individual installations of the Lemmy software that can intercommunicate. This departs from the centralized, monolithic structure of other social media platforms. It has been described as a federated alternative to Reddit. Users on individual instances submit posts with links, text, or pictures to user-created forums for discussion called "communities". Discussion is in the form of threaded comments. Posts and comments can be upvoted or downvoted though the ability to downvote can be disabled by the admins of each instance. Communities are local to each instance, however users may subscribe to communities, create posts and leave comments across instances. Moderation is conducted by the administrators of each instance and moderators of specific communities. Community names begin with c/ in the URL (e.g lemmy.ml/c/simpleliving) and are mentionable using the !community@instance format. On each instance, a front page presents the user with popular posts from several communities. These posts can then be filtered according to origin: posts from the instance the user is on, or from all federated instances. It can also be made to only show posts from communities the user has subscribed to. Lemmy instances are generally supported by donations. == Relations with other social networks == ActivityPub is the protocol used to allow Lemmy instances to operate as a federated social network. It allows users to interact with compatible platforms such as Kbin and Mastodon. In June 2023, following the announcement of Reddit API service changes intended to reduce the use of third-party Reddit clients, community members discussed relocating to Lemmy and other Reddit competitors. Reddit banned a user for promoting switching to Lemmy along with the r/LemmyMigration subreddit as a whole, leading to a Streisand effect after it garnered attention on sites like Hacker News. The ban was reversed a day later. == Third-party software == Prominent third-party Reddit clients Sync and Boost which had shut down due to changes to the pricing of Reddit's API began working on Lemmy clients, with them later relaunching as Sync for Lemmy and Boost for Lemmy. Multiple other apps and browser clients have also been developed.

SeaTable

SeaTable is a no-code platform that allows users to develop and implement business processes. The cloud collaboration service SeaTable is marketed by the GmbH of the same name with headquarters in Mainz and additional offices in Berlin and Beijing, and developed by the same company as Seafile. == History == SeaTable is a collaborative database and low-code application platform developed as part of a joint venture between Seafile Ltd., a software company based in Guangzhou, China, and SeaTable GmbH, a German firm headquartered in Mainz. Founded in 2020, the project represents the international expansion of Seafile, a Chinese developer originally known for its file synchronization and sharing software. While SeaTable's cloud services and European client operations are managed by the German entity, the platform itself is developed in China by Seafile's engineering team. This cross-border structure, described by TechCrunch as an “unconventional path” for a Chinese startup expanding abroad, reflects Seafile's effort to maintain its product development in China while addressing growing scrutiny in Western markets over data governance and corporate control. In 2021, an innovation project led by the Cyber Innovation Hub at the IT School of the German Armed Forces started to evaluate the possibilities of a large-scale deployment at the German Armed Forces. The evaluation project is currently still ongoing. In 2022, SeaTable is optimizing its database backend to allow millions of records within one base in the future. The focus of development is increasingly on automation and visualization. In 2025, SeaTable introduced AI-powered automations with version 6. The update enabled the integration of large language models (LLMs) for text analysis and automated decision-making. SeaTable operates a self-hosted LLM on servers provided by Hetzner (Germany), while self-hosted deployments can connect to any compatible model. == Features == SeaTable combines the traditional capabilities of a spreadsheet such as Excel and supplements them with a wide range of functions for process automation and visualization as well as a fully comprehensive API. SeaTable is not a pure cloud solution, but can alternatively be installed on a private server and operated completely autonomously. In this way, the owner retains full control over their own data. The installation is done via Docker on a Linux server. == Security and privacy == While most no-code platforms exist only as SaaS solutions, SeaTable describes itself as a data-sparse European solution. While initially the SeaTable Cloud was hosted on Amazon AWS, the move to the German data centers of Swiss provider Exoscale then took place in May 2021. This was followed by the replacement of the Freshdesk cloud ticketing system with a self-hosted Zammad instance, and since April 2022 SeaTable has completely dispensed with all tracking cookies on its website.

Availability zone

In cloud computing, an availability region is a group of data centres that are located in the same geographical region. Availability regions comprise multiple availability zones, which are groups of data centres that are located far enough from each other to prevent large-scale outages in the event of failure of a single zone, whilst still being close enough to each other to enable low-latency connections. Distributed systems spanning multiple availability zones allow for high availability, even in the event of catastrophic failure, such as natural disasters. Services offering distinct availability zones include Amazon Web Services, Microsoft Azure and Google Cloud.

Model-based clustering

In statistics, cluster analysis is the algorithmic grouping of objects into homogeneous groups based on numerical measurements. Model-based clustering based on a statistical model for the data, usually a mixture model. This has several advantages, including a principled statistical basis for clustering, and ways to choose the number of clusters, to choose the best clustering model, to assess the uncertainty of the clustering, and to identify outliers that do not belong to any group. == Model-based clustering == Suppose that for each of n {\displaystyle n} observations we have data on d {\displaystyle d} variables, denoted by y i = ( y i , 1 , … , y i , d ) {\displaystyle y_{i}=(y_{i,1},\ldots ,y_{i,d})} for observation i {\displaystyle i} . Then model-based clustering expresses the probability density function of y i {\displaystyle y_{i}} as a finite mixture, or weighted average of G {\displaystyle G} component probability density functions: p ( y i ) = ∑ g = 1 G τ g f g ( y i ∣ θ g ) , {\displaystyle p(y_{i})=\sum _{g=1}^{G}\tau _{g}f_{g}(y_{i}\mid \theta _{g}),} where f g {\displaystyle f_{g}} is a probability density function with parameter θ g {\displaystyle \theta _{g}} , τ g {\displaystyle \tau _{g}} is the corresponding mixture probability where ∑ g = 1 G τ g = 1 {\displaystyle \sum _{g=1}^{G}\tau _{g}=1} . Then in its simplest form, model-based clustering views each component of the mixture model as a cluster, estimates the model parameters, and assigns each observation to cluster corresponding to its most likely mixture component. === Gaussian mixture model === The most common model for continuous data is that f g {\displaystyle f_{g}} is a multivariate normal distribution with mean vector μ g {\displaystyle \mu _{g}} and covariance matrix Σ g {\displaystyle \Sigma _{g}} , so that θ g = ( μ g , Σ g ) {\displaystyle \theta _{g}=(\mu _{g},\Sigma _{g})} . This defines a Gaussian mixture model. The parameters of the model, τ g {\displaystyle \tau _{g}} and θ g {\displaystyle \theta _{g}} for g = 1 , … , G {\displaystyle g=1,\ldots ,G} , are typically estimated by maximum likelihood estimation using the expectation-maximization algorithm (EM); see also EM algorithm and GMM model. Bayesian inference is also often used for inference about finite mixture models. The Bayesian approach also allows for the case where the number of components, G {\displaystyle G} , is infinite, using a Dirichlet process prior, yielding a Dirichlet process mixture model for clustering. === Choosing the number of clusters === An advantage of model-based clustering is that it provides statistically principled ways to choose the number of clusters. Each different choice of the number of groups G {\displaystyle G} corresponds to a different mixture model. Then standard statistical model selection criteria such as the Bayesian information criterion (BIC) can be used to choose G {\displaystyle G} . The integrated completed likelihood (ICL) is a different criterion designed to choose the number of clusters rather than the number of mixture components in the model; these will often be different if highly non-Gaussian clusters are present. === Parsimonious Gaussian mixture model === For data with high dimension, d {\displaystyle d} , using a full covariance matrix for each mixture component requires estimation of many parameters, which can result in a loss of precision, generalizabity and interpretability. Thus it is common to use more parsimonious component covariance matrices exploiting their geometric interpretation. Gaussian clusters are ellipsoidal, with their volume, shape and orientation determined by the covariance matrix. Consider the eigendecomposition of a matrix Σ g = λ g D g A g D g T , {\displaystyle \Sigma _{g}=\lambda _{g}D_{g}A_{g}D_{g}^{T},} where D g {\displaystyle D_{g}} is the matrix of eigenvectors of Σ g {\displaystyle \Sigma _{g}} , A g = diag { A 1 , g , … , A d , g } {\displaystyle A_{g}={\mbox{diag}}\{A_{1,g},\ldots ,A_{d,g}\}} is a diagonal matrix whose elements are proportional to the eigenvalues of Σ g {\displaystyle \Sigma _{g}} in descending order, and λ g {\displaystyle \lambda _{g}} is the associated constant of proportionality. Then λ g {\displaystyle \lambda _{g}} controls the volume of the ellipsoid, A g {\displaystyle A_{g}} its shape, and D g {\displaystyle D_{g}} its orientation. Each of the volume, shape and orientation of the clusters can be constrained to be equal (E) or allowed to vary (V); the orientation can also be spherical, with identical eigenvalues (I). This yields 14 possible clustering models, shown in this table: It can be seen that many of these models are more parsimonious, with far fewer parameters than the unconstrained model that has 90 parameters when G = 4 {\displaystyle G=4} and d = 9 {\displaystyle d=9} . Several of these models correspond to well-known heuristic clustering methods. For example, k-means clustering is equivalent to estimation of the EII clustering model using the classification EM algorithm. The Bayesian information criterion (BIC) can be used to choose the best clustering model as well as the number of clusters. It can also be used as the basis for a method to choose the variables in the clustering model, eliminating variables that are not useful for clustering. Different Gaussian model-based clustering methods have been developed with an eye to handling high-dimensional data. These include the pgmm method, which is based on the mixture of factor analyzers model, and the HDclassif method, based on the idea of subspace clustering. The mixture-of-experts framework extends model-based clustering to include covariates. == Example == We illustrate the method with a dateset consisting of three measurements (glucose, insulin, sspg) on 145 subjects for the purpose of diagnosing diabetes and the type of diabetes present. The subjects were clinically classified into three groups: normal, chemical diabetes and overt diabetes, but we use this information only for evaluating clustering methods, not for classifying subjects. The BIC plot shows the BIC values for each combination of the number of clusters, G {\displaystyle G} , and the clustering model from the Table. Each curve corresponds to a different clustering model. The BIC favors 3 groups, which corresponds to the clinical assessment. It also favors the unconstrained covariance model, VVV. This fits the data well, because the normal patients have low values of both sspg and insulin, while the distributions of the chemical and overt diabetes groups are elongated, but in different directions. Thus the volumes, shapes and orientations of the three groups are clearly different, and so the unconstrained model is appropriate, as selected by the model-based clustering method. The classification plot shows the classification of the subjects by model-based clustering. The classification was quite accurate, with a 12% error rate as defined by the clinical classification. Other well-known clustering methods performed worse with higher error rates, such as single-linkage clustering with 46%, average link clustering with 30%, complete-linkage clustering also with 30%, and k-means clustering with 28%. == Outliers in clustering == An outlier in clustering is a data point that does not belong to any of the clusters. One way of modeling outliers in model-based clustering is to include an additional mixture component that is very dispersed, with for example a uniform distribution. Another approach is to replace the multivariate normal densities by t {\displaystyle t} -distributions, with the idea that the long tails of the t {\displaystyle t} -distribution would ensure robustness to outliers. However, this is not breakdown-robust. A third approach is the "tclust" or data trimming approach which excludes observations identified as outliers when estimating the model parameters. == Non-Gaussian clusters and merging == Sometimes one or more clusters deviate strongly from the Gaussian assumption. If a Gaussian mixture is fitted to such data, a strongly non-Gaussian cluster will often be represented by several mixture components rather than a single one. In that case, cluster merging can be used to find a better clustering. A different approach is to use mixtures of complex component densities to represent non-Gaussian clusters. == Non-continuous data == === Categorical data === Clustering multivariate categorical data is most often done using the latent class model. This assumes that the data arise from a finite mixture model, where within each cluster the variables are independent. === Mixed data === These arise when variables are of different types, such as continuous, categorical or ordinal data. A latent class model for mixed data assumes local independence between the variable. The location model relaxes the local independence assumption. The clustMD approach assumes that the observed variables are manifestations of underlying continuous Gaussian latent

OpenIO

OpenIO offered object storage for a wide range of high-performance applications. OpenIO was founded in 2015 by Laurent Denel (CEO), Jean-François Smigielski (CTO) and five other co-founders; it leveraged open source software, developed since 2006, based on a grid technology that enabled dynamic behaviour and supported heterogenous hardware. In October 2017 OpenIO was completed a $5 million funding rounds. In July 2020 OpenIO had been acquired by OVH and withdrawn from the market to become the core technology of OVHcloud object storage offering. == Software == OpenIO is a software-defined object store that supports S3 and can be deployed on-premises, cloud-hosted or at the edge, on any hardware mix. It has been designed from the beginning for performance and cost-efficiency at any scale, and it has been optimized for Big Data, HPC and AI. OpenIO stores objects within a flat structure within a massively distributed directory with indirections, which allows the data query path to be independent of the number of nodes and the performance not to be affected by the growth of capacity. Servers are organized as a grid of nodes massively distributed, where each node takes part in directory and storage services, which ensures that there is no single point of failure and that new nodes are automatically discovered and immediately available without the need to rebalance data. The software is built on top of a technology that ensures optimal data placement based on real-time metrics and allows the addition or removal of storage devices with automatic performance and load impact optimization. For data protection OpenIO has synchronous and asynchronous replication with multiple copies, and an erasure coding implementation based on Reed-Solomon that can be deployed in one data center or geo-distributed or stretched clusters. The software has a feature that catches all events that occur in the cluster and can pass them up in the stack or to applications running on OpenIO nodes. This enables event-driven computing directly into the storage infrastructure. The open source code is available on Github and it is licensed under AGPL3 for server code and LGPL3 for client code. == Performance == OpenIO claimed in 2019 to have reached 1.372 Tbit/s write speed (171 GB/s) on a cluster of 350 physical machines. The benchmark scenario, conducted under production conditions with standard hardware (commodity servers with 7200 rpm HDDs), consisted in backing up a 38 PB Hadoop datalake via the DistCp command. This level of performance marked, according to analysts, the arrival of a new generation of object storage technologies oriented toward high performance and hyper-scalability.