Distributed file system for cloud

Distributed file system for cloud

A distributed file system for cloud is a file system that allows many clients to have access to data and supports operations (create, delete, modify, read, write) on that data. Each data file may be partitioned into several parts called chunks. Each chunk may be stored on different remote machines, facilitating the parallel execution of applications. Typically, data is stored in files in a hierarchical tree, where the nodes represent directories. There are several ways to share files in a distributed architecture: each solution must be suitable for a certain type of application, depending on how complex the application is. Meanwhile, the security of the system must be ensured. Confidentiality, availability and integrity are the main keys for a secure system. Users can share computing resources through the Internet thanks to cloud computing which is typically characterized by scalable and elastic resources – such as physical servers, applications and any services that are virtualized and allocated dynamically. Synchronization is required to make sure that all devices are up-to-date. Distributed file systems enable many big, medium, and small enterprises to store and access their remote data as they do local data, facilitating the use of variable resources. == Overview == === History === Today, there are many implementations of distributed file systems. The first file servers were developed by researchers in the 1970s. Sun Microsystem's Network File System became available in the 1980s. Before that, people who wanted to share files used the sneakernet method, physically transporting files on storage media from place to place. Once computer networks started to proliferate, it became obvious that the existing file systems had many limitations and were unsuitable for multi-user environments. Users initially used FTP to share files. FTP first ran on the PDP-10 at the end of 1973. Even with FTP, files needed to be copied from the source computer onto a server and then from the server onto the destination computer. Users were required to know the physical addresses of all computers involved with the file sharing. === Supporting techniques === Modern data centers must support large, heterogenous environments, consisting of large numbers of computers of varying capacities. Cloud computing coordinates the operation of all such systems, with techniques such as data center networking (DCN), the MapReduce framework, which supports data-intensive computing applications in parallel and distributed systems, and virtualization techniques that provide dynamic resource allocation, allowing multiple operating systems to coexist on the same physical server. === Applications === Cloud computing provides large-scale computing thanks to its ability to provide the needed CPU and storage resources to the user with complete transparency. This makes cloud computing particularly suited to support different types of applications that require large-scale distributed processing. This data-intensive computing needs a high performance file system that can share data between virtual machines (VM). Cloud computing dynamically allocates the needed resources, releasing them once a task is finished, requiring users to pay only for needed services, often via a service-level agreement. Cloud computing and cluster computing paradigms are becoming increasingly important to industrial data processing and scientific applications such as astronomy and physics, which frequently require the availability of large numbers of computers to carry out experiments. == Architectures == Most distributed file systems are built on the client-server architecture, but other, decentralized, solutions exist as well. === Client-server architecture === Network File System (NFS) uses a client-server architecture, which allows sharing of files between a number of machines on a network as if they were located locally, providing a standardized view. The NFS protocol allows heterogeneous clients' processes, probably running on different machines and under different operating systems, to access files on a distant server, ignoring the actual location of files. Relying on a single server results in the NFS protocol suffering from potentially low availability and poor scalability. Using multiple servers does not solve the availability problem since each server is working independently. The model of NFS is a remote file service. This model is also called the remote access model, which is in contrast with the upload/download model: Remote access model: Provides transparency, the client has access to a file. He sends requests to the remote file (while the file remains on the server). Upload/download model: The client can access the file only locally. It means that the client has to download the file, make modifications, and upload it again, to be used by others' clients. The file system used by NFS is almost the same as the one used by Unix systems. Files are hierarchically organized into a naming graph in which directories and files are represented by nodes. === Cluster-based architectures === A cluster-based architecture ameliorates some of the issues in client-server architectures, improving the execution of applications in parallel. The technique used here is file-striping: a file is split into multiple chunks, which are "striped" across several storage servers. The goal is to allow access to different parts of a file in parallel. If the application does not benefit from this technique, then it would be more convenient to store different files on different servers. However, when it comes to organizing a distributed file system for large data centers, such as Amazon and Google, that offer services to web clients allowing multiple operations (reading, updating, deleting,...) to a large number of files distributed among a large number of computers, then cluster-based solutions become more beneficial. Note that having a large number of computers may mean more hardware failures. Two of the most widely used distributed file systems (DFS) of this type are the Google File System (GFS) and the Hadoop Distributed File System (HDFS). The file systems of both are implemented by user level processes running on top of a standard operating system (Linux in the case of GFS). ==== Design principles ==== ===== Goals ===== Google File System (GFS) and Hadoop Distributed File System (HDFS) are specifically built for handling batch processing on very large data sets. For that, the following hypotheses must be taken into account: High availability: the cluster can contain thousands of file servers and some of them can be down at any time A server belongs to a rack, a room, a data center, a country, and a continent, in order to precisely identify its geographical location The size of a file can vary from many gigabytes to many terabytes. The file system should be able to support a massive number of files The need to support append operations and allow file contents to be visible even while a file is being written Communication is reliable among working machines: TCP/IP is used with a remote procedure call RPC communication abstraction. TCP allows the client to know almost immediately when there is a problem and a need to make a new connection. ===== Load balancing ===== Load balancing is essential for efficient operation in distributed environments. It means distributing work among different servers, fairly, in order to get more work done in the same amount of time and to serve clients faster. In a system containing N chunkservers in a cloud (N being 1000, 10000, or more), where a certain number of files are stored, each file is split into several parts or chunks of fixed size (for example, 64 megabytes), the load of each chunkserver being proportional to the number of chunks hosted by the server. In a load-balanced cloud, resources can be efficiently used while maximizing the performance of MapReduce-based applications. ===== Load rebalancing ===== In a cloud computing environment, failure is the norm, and chunkservers may be upgraded, replaced, and added to the system. Files can also be dynamically created, deleted, and appended. That leads to load imbalance in a distributed file system, meaning that the file chunks are not distributed equitably between the servers. Distributed file systems in clouds such as GFS and HDFS rely on central or master servers or nodes (Master for GFS and NameNode for HDFS) to manage the metadata and the load balancing. The master rebalances replicas periodically: data must be moved from one DataNode/chunkserver to another if free space on the first server falls below a certain threshold. However, this centralized approach can become a bottleneck for those master servers, if they become unable to manage a large number of file accesses, as it increases their already heavy loads. The load rebalance problem is NP-hard. In order to get a large number of chunkservers to work in collaboration, and to

Lazy learning

(Not to be confused with the lazy learning regime, see Neural tangent kernel). In machine learning, lazy learning is a learning method in which generalization of the training data is, in theory, delayed until a query is made to the system, as opposed to eager learning, where the system tries to generalize the training data before receiving queries. The primary motivation for employing lazy learning, as in the K-nearest neighbors algorithm, used by online recommendation systems ("people who viewed/purchased/listened to this movie/item/tune also ...") is that the data set is continuously updated with new entries (e.g., new items for sale at Amazon, new movies to view at Netflix, new clips at YouTube, new music at Spotify or Pandora). Because of the continuous update, the "training data" would be rendered obsolete in a relatively short time especially in areas like books and movies, where new best-sellers or hit movies/music are published/released continuously. Therefore, one cannot really talk of a "training phase". Lazy classifiers are most useful for large, continuously changing datasets with few attributes that are commonly queried. Specifically, even if a large set of attributes exist - for example, books have a year of publication, author/s, publisher, title, edition, ISBN, selling price, etc. - recommendation queries rely on far fewer attributes - e.g., purchase or viewing co-occurrence data, and user ratings of items purchased/viewed. == Advantages == The main advantage gained in employing a lazy learning method is that the target function will be approximated locally, such as in the k-nearest neighbor algorithm. Because the target function is approximated locally for each query to the system, lazy learning systems can simultaneously solve multiple problems and deal successfully with changes in the problem domain. At the same time they can reuse a lot of theoretical and applied results from linear regression modelling (notably PRESS statistic) and control. It is said that the advantage of this system is achieved if the predictions using a single training set are only developed for few objects. This can be demonstrated in the case of the k-NN technique, which is instance-based and function is only estimated locally. == Disadvantages == Theoretical disadvantages with lazy learning include: The large space requirement to store the entire training dataset. In practice, this is not an issue because of advances in hardware and the relatively small number of attributes (e.g., as co-occurrence frequency) that need to be stored. Particularly noisy training data increases the case base unnecessarily, because no abstraction is made during the training phase. In practice, as stated earlier, lazy learning is applied to situations where any learning performed in advance soon becomes obsolete because of changes in the data. Also, for the problems for which lazy learning is optimal, "noisy" data does not really occur - the purchaser of a book has either bought another book or hasn't. Lazy learning methods are usually slower to evaluate. In practice, for very large databases with high concurrency loads, the queries are not postponed until actual query time, but recomputed in advance on a periodic basis - e.g., nightly, in anticipation of future queries, and the answers stored. This way, the next time new queries are asked about existing entries in the database, the answers are merely looked up rapidly instead of having to be computed on the fly, which would almost certainly bring a high-concurrency multi-user system to its knees. Larger training data also entail increased cost. Particularly, there is the fixed amount of computational cost, where a processor can only process a limited amount of training data points. There are standard techniques to improve re-computation efficiency so that a particular answer is not recomputed unless the data that impact this answer has changed (e.g., new items, new purchases, new views). In other words, the stored answers are updated incrementally. This approach, used by large e-commerce or media sites, has long been used in the Entrez portal of the National Center for Biotechnology Information (NCBI) to precompute similarities between the different items in its large datasets: biological sequences, 3-D protein structures, published-article abstracts, etc. Because "find similar" queries are asked so frequently, the NCBI uses highly parallel hardware to perform nightly recomputation. The recomputation is performed only for new entries in the datasets against each other and against existing entries: the similarity between two existing entries need not be recomputed. == Examples of Lazy Learning Methods == K-nearest neighbors, which is a special case of instance-based learning. Local regression. Lazy naive Bayes rules, which are extensively used in commercial spam detection software. Here, the spammers keep getting smarter and revising their spamming strategies, and therefore the learning rules must also be continually updated.

Knowledge Engineering Environment

Knowledge Engineering Environment (KEE) is a frame-based development tool for expert systems. It was developed and sold by IntelliCorp, and was first released in 1983. It ran on Lisp machines, and was later ported to Lucid Common Lisp with the CLX library, an X Window System (X11) interface for Common Lisp. This version was available on several different UNIX workstations. On KEE, several extensions were offered: Simkit, a frame-based simulation library KEEconnection, database connection between the frame system and relational databases In KEE, frames are called units. Units are used for both individual instances and classes. Frames have slots and slots have facets. Facets can describe, for example, a slot's expected values, its working value, or its inheritance rule. Slots can have multiple values. Behavior can be implemented using a message passing model. KEE provides an extensive graphical user interface (GUI) to create, browse, and manipulate frames. KEE also includes a frame-based rule system. In the KEE knowledge base, rules are frames. Both forward chaining and backward chaining inference are available. KEE supports non-monotonic reasoning through the concepts of worlds. Worlds allow providing alternative slot-values of frames. Through an assumption-based truth or reason maintenance system, inconsistencies can be detected and analyzed. ActiveImages allows graphical displays to be attached to slots of Units. Typical examples are buttons, dials, graphs, and histograms. The graphics are also implemented as Units via KEEPictures, a frame-based graphics library.

Regulation of artificial intelligence in the United States

The United States federal government and state governments have developed some regulation of artificial intelligence, including executive orders, federal laws, and state laws. Federal agencies have also developed some sector-specific regulations related to AI. At the federal level, the Biden administration released an October 2023 executive order about AI safety and security, Executive Order 14110, with directives related to AI development and deployment. President Trump revoked that executive order in January 2025 and issued Executive Order 14179. In December 2025, President Trump signed Executive Order 14365, an executive order directing federal agencies to develop a unified national approach to AI policy, evaluate state AI laws for potential conflicts, challenge them through legal action, and condition certain federal funding on state compliance, while exempting state laws related to child safety, data center infrastructure, and state government procurement. In 2025, Congress passed legislation targeting AI-generated deepfakes, the TAKE IT DOWN Act. Several U.S. states have enacted laws related to artificial intelligence. Some are already in effect, including in California. Other states have AI-related legislation coming into effect in 2026 and 2027. In 2025 and 2026, the Trump administration mentioned the patchwork nature of state legislation as a motivation for its push for unified national legislation regulating AI. The administration has criticized state lawmakers, threatened to sue states, and issued letters to discourage them from regulating AI companies and products; some states have continued to propose and enact related laws. Discussions about regulating AI have included topics such as the timeliness of regulating AI, the nature of the federal regulatory framework to govern and promote AI, including what agency should lead, the regulatory and governing powers of that agency, and how to update regulations in the face of rapidly changing technology, as well as the roles of state governments and courts. == Federal government == === Obama administration (2009–2017) === As early as 2016, the Obama administration had begun to focus on the risks and regulations for artificial intelligence. In an October 2016 report titled Preparing For the Future of Artificial Intelligence, the National Science and Technology Council set a precedent to allow researchers to continue to develop new AI technologies with few restrictions. The report stated that "the approach to regulation of AI-enabled products to protect public safety should be informed by assessment of the aspects of risk". The first National Artificial Intelligence Research And Development Strategic Plan was published in October 2016. === First Trump administration (2017–2021) === On August 13, 2018, Section 1051 of the Fiscal Year 2019 John S. McCain National Defense Authorization Act (P.L. 115-232) established the National Security Commission on Artificial Intelligence "to consider the methods and means necessary to advance the development of artificial intelligence, machine learning, and associated technologies to comprehensively address the national security and defense needs of the United States." Steering on regulating security-related AI is provided by the National Security Commission on Artificial Intelligence. The Artificial Intelligence Initiative Act (S.1558) is a proposed bill that would establish a federal initiative designed to accelerate research and development on AI for, inter alia, the economic and national security of the United States. On January 7, 2019, following an Executive Order on Maintaining American Leadership in Artificial Intelligence, the White House's Office of Science and Technology Policy released a draft Guidance for Regulation of Artificial Intelligence Applications, which includes ten principles for United States agencies when deciding whether and how to regulate AI. In response, the National Institute of Standards and Technology released a position paper, and the Defense Innovation Board issued recommendations on the ethical use of AI. A year later, the administration called for comments on regulation in another draft of its Guidance for Regulation of Artificial Intelligence Applications. Other specific agencies working on the regulation of AI included the Food and Drug Administration, which created pathways to regulate the incorporation of AI in medical imaging. The National Science and Technology Council also published an updated National Artificial Intelligence Research and Development Strategic Plan in 2019, which received public scrutiny and recommendations to further improve it towards enabling Trustworthy AI. === Biden administration (2021–2025) === In March 2021, the National Security Commission on Artificial Intelligence released their final report. In the report, they stated, "Advances in AI, including the mastery of more general AI capabilities along one or more dimensions, will likely provide new capabilities and applications. Some of these advances could lead to inflection points or leaps in capabilities. Such advances may also introduce new concerns and risks and the need for new policies, recommendations, and technical advances to assure that systems are aligned with goals and values, including safety, robustness and trustworthiness." In June 2022, Senators Rob Portman and Gary Peters introduced the Global Catastrophic Risk Management Act. The bipartisan bill "would also help counter the risk of artificial intelligence... from being abused in ways that may pose a catastrophic risk". On October 4, 2022, President Joe Biden unveiled a new AI Bill of Rights, which outlines five protections Americans should have in the AI age: 1. Safe and Effective Systems, 2. Algorithmic Discrimination Protection, 3.Data Privacy, 4. Notice and Explanation, and 5. Human Alternatives, Consideration, and Fallback. The bill was formally published in October 2022 by the Office of Science and Technology Policy (OSTP), a U.S. government office that advises the President on science and technology policy matters. In July 2023, the Biden administration secured voluntary commitments from seven companies – Amazon, Anthropic, Google, Inflection, Meta, Microsoft, and OpenAI – to manage the risks associated with AI. The companies committed to ensure AI products undergo both internal and external security testing before public release; to share information on the management of AI risks with the industry, governments, civil society, and academia; to prioritize cybersecurity and protect proprietary AI system components; to develop mechanisms to inform users when content is AI-generated, such as watermarking; to publicly report on their AI systems' capabilities, limitations, and areas of use; to prioritize research on societal risks posed by AI, including bias, discrimination, and privacy concerns; and to develop AI systems to address societal challenges, ranging from cancer prevention to climate change mitigation. In September 2023, eight additional companies – Adobe, Cohere, IBM, Nvidia, Palantir, Salesforce, Scale AI, and Stability AI – subscribed to these voluntary commitments. In January 2023, the National Institute of Standards and Technology (NIST) released the Artificial Intelligence Risk Management Framework (AI RMF 1.0), providing voluntary guidance for organizations to identify, assess, and manage risks associated with AI systems. The Biden administration, in October 2023 signaled that they would release an executive order leveraging the federal government's purchasing power to shape AI regulations, hinting at a proactive governmental stance in regulating AI technologies. On October 30, 2023, President Biden released Executive Order 14110 on Safe, Secure, and Trustworthy Artificial Intelligence. The Executive Order includes directives on standards for critical infrastructure, AI-enhanced cybersecurity, and federally funded biological synthesis projects. The Executive Order provides the authority to various agencies and departments of the US government, including the Energy and Defense departments, to apply existing consumer protection laws to AI development. The Executive Order builds on the Administration's earlier agreements with AI companies to instate new initiatives to "red-team" or stress-test AI dual-use foundation models, especially those that have the potential to pose security risks, with data and results shared with the federal government. The Executive Order also recognizes AI's social challenges, and calls for companies building AI dual-use foundation models to be wary of these societal problems. For example, the Executive Order states that AI should not "worsen job quality", and should not "cause labor-force disruptions". Additionally, Biden's Executive Order mandates that AI must "advance equity and civil rights", and cannot disadvantage marginalized groups. It also called for foundation models to include "watermarks" to help the publi

StepFun

Shanghai Jieyue Xingchen Intelligent Technology Co., Ltd, known as StepFun, is an artificial intelligence (AI) company based in Shanghai, China. It has been dubbed one of China's "AI Tiger" companies by investors. == Background == StepFun was founded in April 2023 by former Microsoft employees. Investors include Tencent, Qiming Venture Partners and Shanghai State-owned Capital Investment. In July 2025 at the World Artificial Intelligence Conference, StepFun announced the "Model-Chip Ecosystem Innovation Alliance" which consisted of Chinese developers of large language models (LLMs) and AI chip manufacturers. This included companies such as Huawei, Biren Technology, Moore Threads and Enflame. Another second alliance named the "Shanghai General Chamber of Commerce AI Committee" was also established that included StepFun, SenseTime, MiniMax, MetaX and Iluvatar CoreX. On 25 February 2026, it was reported that StepFun was seeking an initial public offering on the Hong Kong Stock Exchange. StepFun focuses on multimodal models which are designed to understand multiple types of input data such as text, video and audio. == Products == In July 2024 at the World Artificial Intelligence Conference, StepFun officially launched Step-2, a trillion-parameter LLM, along with the Step-1.5V multimodal model and the Step-1X image generation model. In February 2025, StepFun and Geely jointly announced the open-sourcing of two multimodal large models to global developers. They were Step-Video-T2V and Step-Audio. In July 2025, StepFun released Step 3. The Model-Chip Ecosystem Innovation Alliance aimed to optimize Step 3 for domestic chips. In April 2025, Step-R1-V-Mini was released. It is a multimodal reasoning model designed for visual interpretation and image understanding. In February 2026, Step-3.5-Flash, a mixture-of-experts model with 196 billion parameters and 11 billion active parameters was released under the free and open-source Apache 2.0 license. It supports tool use and a 256k token context window. == Models ==

Ave!Comics

Ave!Comics Production is a privately owned French company editing comics on smartphones, tablets and computers. It was founded in 2008 and it is a subsidiary of Aquafadas, a software development company in digital publishing owned by Kobo Inc. AveComics is a comic book store for digital comic books that can be used on computers, tablets, and smartphones.(iOS, Android) Readers can buy and read comic books, manga and graphic novels in French, English and Spanish. AveComics uses a technology created by Aquafadas for comics transformation, distribution and reading, based around its AVE format. The AveComics application was also a finalist in the BlackBerry Innovation Awards 2009, in the "Entertainment" category. == Company history == Aquafadas, a company working on creative software for Flash, HTML5, photo, and video editing, created the application MyComics to allow the reading of comics on mobile in 2006. This application was made available in 2008, to enable the reading and storing of comics on iPhone and iPod Touch. A reading system adapted to low resolution screens was also available. In October of the same year, the company launched a comics library on both devices, in partnership with the Angoulême International Comics Festival, Fnac and SNCF. This library included the official selection of the festival, and was downloaded over 150 000 times. In December 2008 "The Adventures of Lucky Luke n°3", at Lucky Comics was published on both devices. The comic made a 50 000 € turnover. In April 2009, "Les Blondes" 10th volume was the top-selling comic for 10 months on the AppStore. After, in August 2009, the AveComics application was launched on iPhone, iPod Touch and BlackBerry. The company's website was launched in September when more than 100 titles were available on smartphones and computers. == Catalogue == AveComics works with over 80 international publishers including Glénat, Marsu Productions, Delcourt, Casterman, Soleil, Ubisoft, Les Humanoïdes Associés and Mad Fabrik. Comics such as "Assassin's Creed", "Talisman", "Titeuf", and "Seoul District" are sold by the company. == Award == Grand Prix Software Venture Capital - Senate 2008.

Semantic knowledge management

In computer science, semantic knowledge management is a set of practices that seeks to classify content so that the knowledge it contains may be immediately accessed and transformed for delivery to the desired audience, in the required format. This classification of content is semantic in its nature – identifying content by its type or meaning within the content itself and via external, descriptive metadata – and is achieved by employing XML technologies. The specific outcomes of these practices are: Maintain content for multiple audiences together in a single document Transform content into various delivery formats without re-authoring Search for content more effectively Involve more subject-matter experts in the creation of content without reducing quality Reduce production costs for delivery formats Reduce the manual administration of getting the right knowledge to the right people Reduce the cost and time to localize content == Notable semantic knowledge management systems == Learn eXact Thinking Cap LCMS Thinking Cap LMS Xyleme LCMS iMapping