Colossus is a supercomputer developed by xAI. Construction began in 2024 in Memphis, Tennessee; the system became operational in July 2024. It is currently the world's largest AI supercomputer. Colossus's primary purpose is to train the company's chatbot, Grok. In addition, Colossus provides computing support to the social-media platform X and to other projects of Elon Musk, such as SpaceX. In 2025, it expanded to neighboring Southaven, Mississippi across the Tennessee–Mississippi border. As of May 6, 2026, Anthropic has agreed to rent all compute capacity at the Colossus 1 data center. == Background == Colossus was launched in September 2024 at a former Electrolux site in South Memphis to train the AI language model Grok. Within 19 days of the project's conception, xAI was ready to begin construction. The site was chosen because the abandoned Electrolux building could be repurposed to expedite construction and its proximity to a nearby wastewater treatment facility provided a water source. As of February 2025, xAI plans to build an $80 million facility to process additional wastewater for use at the supercomputer. === xAI === Musk incorporated xAI in March 2023 with the stated purpose of understanding the "nature of the universe". The team includes former members of OpenAI, DeepMind, Microsoft, and Tesla. Musk was one of the founding members of the company OpenAI, investing up to US$45 million in 2015. He left OpenAI in 2018, reportedly to avoid conflicts of interest with Tesla. It has also been reported that he had made a bid for leadership at OpenAI and left when his proposal was rejected. The exact reasons for his departure from the company are unclear. Both Dell Technologies and Supermicro partnered with xAI to build the supercomputer. It was originally powered by 100,000 Nvidia graphics processing units (GPUs) and was constructed in 122 days. 3 months after the first 100,000 GPUs were deployed, xAI announced that they had increased the system to 200,000 GPUs and that they intended to continue increasing the computer's processing power to 1 million GPUs. As of April 2025, xAI claimed Colossus was the largest AI training platform in the world. == Choice of location == xAI selected Memphis, in southwestern Tennessee, as the site for Colossus in part because an existing industrial facility allowed the project to proceed more quickly than constructing a new data center. Elon Musk was initially told that building a data center would take 18–24 months. The company instead searched for a vacant facility and selected the former Electrolux factory in Memphis. Electrolux opened the facility in 2012 and operated it for about eight years before closing it in 2020 after relocating operations to Springfield, Tennessee. The building covered 785,000 sq ft (72,900 m2) and had been purchased by Phoenix Investors in December 2023 for $35 million . Because the structure was already in place, work on the supercomputer could begin immediately rather than waiting for a new facility to be constructed. According to Forbes, xAI considered seven or eight other sites before selecting Memphis, and Musk finalized the decision to build in Memphis in about a week. The decision was finalized in March 2024, after which construction began. xAI publicly announced in June 2024 that Colossus would be built in Memphis. The building itself was not the only reason xAI selected Memphis. According to the Greater Memphis Chamber, the company chose the city because of its "reliable power grid, ability to create a water recycling facility, proximity to the Mississippi River and ample land". The city was also able to provide the large amounts of electricity and water needed to operate the supercomputer. At full capacity, the system was expected to require 150 megawatts of electricity and millions of gallons of water per day. The project also relied on partnerships with local and regional organizations including Memphis Light, Gas and Water (MLGW), Tennessee Valley Authority (TVA), the City of Memphis, and Shelby County. The city also provided financial incentives for the project. == Environmental impact == AI data centers consume large amounts of energy. At the site of Colossus in South Memphis, the grid connection was only 8 MW, so xAI applied to temporarily set up more than a dozen gas turbines (Voltagrid’s 2.5 MW units and Solar Turbines’ 16 MW SMT-130s) which would steadily burn methane gas from a 16-inch natural gas main. Aerial imagery in April 2025 showed 35 gas turbines had been set up at a combined 422 MW. These turbines have been estimated to generate about "72 megawatts, which is approximately 3% of the (TVA) power grid". The higher number of gas turbines and the subsequent emissions requires xAI to have a major source permit. In Memphis, xAI was able to avoid some environmental rules in the construction of Colossus, such as operating without permits for the on-site methane gas turbines because they are "portable". The Shelby County Health Department told NPR that "it only regulates gas-burning generators if they're in the same location for more than 364 days". However, in a January 2026 ruling, the EPA revised its New Source Performance Standard and announced that large methane gas turbines require permits even for temporary operations. In November 2024, the grid connection was upgraded to 150 MW, and some turbines were removed. Along with high electricity needs, the expected water demand is over five million gallons of water per day. While xAI has stated they plan to work with MLGW on a wastewater treatment facility and the installation of 50 megawatts of large battery storage facilities, there are currently no concrete plans in place aside from a one-page factsheet shared by MLGW. == Community response == The plan to build Colossus in Memphis was unknown to residents, City Council members, and environmental agencies. Many did not find out about the project until the day before, or the day of, as they watched the announcement on the local news. Keshaun Pearson, president of Memphis Community Against Pollution, stated that there is a historical lack of transparency and communication surrounding environmental issues in Memphis. Some community members in Memphis have expressed concern about the potential for additional air and water pollution caused by the supercomputer. In a letter to the Shelby County Health Department, the Southern Environmental Law Center stated the emissions from the turbines make the facility "...likely the largest industrial emitter of NOx in Memphis..." This is due to data supplied by the manufacturer showing that "...xAI emits between 1,200 and 2,000 tons of smog-forming nitrogen oxides (NOx)..." At a public Shelby County Commissioner's hearing on April 9, 2025, residents living near the site of Colossus voiced complaints about air quality, noting that they have chronic respiratory issues related to living in a polluted section of Memphis. One woman said she smells "everything but the right thing and the right thing is the clean air." Other residents voiced frustration that Brent Mayo, the senior xAI official responsible for building out xAI's infrastructure, did not attend the meeting to discuss community concerns. Keshaun Pearson also stated that "We're getting more and more days a year where it is unhealthy for us to go outside." People living near the site of Colossus have said they were not offered the opportunity for a public review of the plans, nor were they provided with information on how their community could potentially benefit. The community is also concerned about the strain on the power grid. Memphis's peak demand is around 3 GW. In November 2024, TVA approved xAI's request for access to more than 100 megawatts of power to Colossus which is supplied by MLGW. In December 2022, MLGW imposed (then rescinded) rolling blackouts during several days of extreme cold, straining the power grid. In a letter to the TVA, the SELC "urged the agency to 'prioritize Memphis families' access to reliable power over the 'secondary purpose' of serving xAI". == Current progress == In early December 2024, Ted Townsend detailed how the power of Colossus doubled in its processing capability. When it first went online in September 2024, it was using "100,000 Nvidia H100 processing chips". This initial launch demonstrated Colossus to be the largest supercomputer globally. The maximum power consumption increased from 150 to 250 MW. As of June 2025, the supercomputer consists of 150,000 H100 GPUs, 50,000 H200 GPUs, and 30,000 GB200 GPUs. Another 110,000 GB200 GPUs are to be brought online at a second data center, also in the Memphis area. The expansion of this supercomputer has already been discussed and will be the second phase of the project. xAI also plans to increase Colossus to 1 million GPUs. Because the supercomputer currently utilizes gas turbines for power, alongside 168 Tesla Megapack battery storage units. xAI is also looking to add more
Product-family engineering
Product-family engineering (PFE), also known as product-line engineering (PLE), is based on the ideas of "domain engineering" created by the Software Engineering Institute, a term coined by James Neighbors in his 1980 dissertation at University of California, Irvine. Software product lines are quite common in our daily lives, but before a product family can be successfully established, an extensive process has to be followed. This process is known as product-family engineering. Product-family engineering can be defined as a method that creates an underlying architecture of an organization's product platform. It provides an architecture that is based on commonality as well as planned variabilities. The various product variants can be derived from the basic product family, which creates the opportunity to reuse and differentiate on products in the family. Product-family engineering is conceptually similar to the widespread use of vehicle platforms in the automotive industry. Product-family engineering is a relatively new approach to the creation of new products, recently evolving to Model-Based Product Line Engineering (MBPLE), emphasizing the centrality of a model-centric approach in PLE. It focuses on the process of engineering new products in such a way that it is possible to reuse product components and apply variability with decreased costs and time. Product-family engineering is all about reusing components and structures as much as possible, according to the ISO/IEC 26550/2015 and the latest ISO/IEC 26580/2021 that introduced the concept of feature-based Product Line Engineering. Several studies have proven that using a product-family engineering approach for product development can have several benefits. Here is a list of some of them: Higher productivity Higher quality Faster time-to-market Lower labor needs The Nokia case mentioned below also illustrates these benefits. In 2025 the publishing of the book Model-Based Product Line Engineering (MBPLE): The feature-based path to product lines success by Marco Forlingieri, Tim Weilkiens and Hugo Guillermo Chalé-Gongora formalized the foundation of the discipline, including best practices and new industrial cases. == Overall process == The product family engineering process consists of several phases. The three main phases are: Phase 1: Product management Phase 2: Domain engineering Phase 3: Product engineering The process has been modeled on a higher abstraction level. This has the advantage that it can be applied to all kinds of product lines and families, not only software. The model can be applied to any product family. Figure 1 (below) shows a model of the entire process. Below, the process is described in detail. The process description contains elaborations of the activities and the important concepts being used. All concepts printed in italic are explained in Table 1. === Phase 1: product management === The first phase is the starting up of the whole process. In this phase some important aspects are defined especially with regard to economic aspects. This phase is responsible for outlining market strategies and defining a scope, which tells what should and should not be inside the product family. ==== Evaluate business visioning ==== During this first activity all context information relevant for defining the scope of the product line is collected and evaluated. It is important to define a clear market strategy and take external market information into account, such as consumer demands. The activity should deliver a context document that contains guidelines, constraints and the product strategy. ==== Define product line scope ==== Scoping techniques are applied to define which aspects are within the scope. This is based upon the previous step in the process, where external factors have been taken into account. The output is a product portfolio description, which includes a list of current and future products and also a product roadmap. It can be argued whether phase 1, product management, is part of the product-family-engineering process, because it could be seen as an individual business process that is more focused on the management aspects instead of the product aspect. However phase 2 needs some important input from this phase, as a large piece of the scope is defined in this phase. So from this point of view it is important to include the product-management phase (phase 1) into the entire process as a base for the domain-engineering process. === Phase 2: domain engineering === During the domain-engineering phases, the variable and common requirements are gathered for the whole product line. The goal is to establish a reusable platform. The output of this phase is a set of common and variable requirements for all products in the product line. ==== Analyze domain requirements ==== This activity includes all activities for analyzing the domain with regard to concept requirements. The requirements are categorized and split up into two new activities. The output is a document with the domain analysis. As can be seen in Figure 1 the process of defining common requirements is a parallel process with defining variable requirements. Both activities take place at the same time. ==== Define common requirements ==== Includes all activities for eliciting and documenting the common requirements of the product line, resulting in a document with reusable common requirements. ==== Define variable requirements ==== Includes all activities for eliciting and documenting the variable requirements of the product line, resulting in a document with variable requirements. ==== Design domain ==== This process step consists of activities for defining the reference architecture of the product line. This generates an abstract structure for all products in the product line. ==== Implement domain ==== During this step a detailed design of the reusable components and the implementation of these components are created. ==== Test domain ==== Validates and verifies the reusability of components. Components are tested against their specifications. After successful testing of all components in different use cases and scenarios, the domain engineering phase has been completed. === Phase 3: product engineering === In the final phase a product X is being engineered. This product X uses the commonalities and variability from the domain engineering phase, so product X is being derived from the platform established in the domain engineering phase. It basically takes all common requirements and similarities from the preceding phase plus its own variable requirements. Using the base from the domain engineering phase and the individual requirements of the product engineering phase a complete and new product can be built. After the product has been fully tested and approved, the product X can be delivered. ==== Define product requirements ==== Developing the product requirements specification for the individual product and reuse the requirements from the preceding phase. ==== Design product ==== All activities for producing the product architecture. Makes use of the reference architecture from the step "design domain", it selects and configures the required parts of the reference architecture and incorporates product specific adaptations. ==== Build product ==== During this process the product is built, using selections and configurations of the reusable components. ==== Test product ==== During this step the product is verified and validated against its specifications. A test report gives information about all tests that were carried out, this gives an overview of possible errors in the product. If the product in the next step is not accepted, the process will loop back to "build product", in Figure 1 this is indicated as "[unsatisfied]". ==== Deliver and support product ==== The final step is the acceptance of the final product. If it has been successfully tested and approved to be complete, it can be delivered. If the product does not satisfy to the specifications, it has to be rebuilt and tested again. The next figure shows the overall process of product-family engineering as described above. It is a full process overview with all concepts attached to the different steps. == Process data diagram == On the left side the entire process from the top to bottom has been drawn. All activities on the left side are linked to the concepts on the right side through dotted lines. Every concept has a number, which reflects the association with other concepts. == List of concepts == Below the list with concepts will be explained. Most concept definitions are extracted from Pohl, Bockle, & Linden (2005) and also some new definitions have been added. Table 1: List of concepts == Example == There are some good examples of the use of product family engineering, which were quite successful. The abstract model of product family engineering allows different kinds of uses, most of them are related to the consumer electronics m
Sunrise Calendar
Sunrise is a discontinued electronic calendar application for mobile and desktop. The service was launched in 2013 by designers Pierre Valade and Jeremy Le Van. In October 2015, Microsoft announced that they had merged the Sunrise Calendar team into the larger Microsoft Outlook team where they will work closely with the Microsoft Outlook Mobile service. == History == The first iteration of Sunrise launched in 2012 and was a daily email digest of appointments, events and birthdays. Sunrise was launched initially as an iPhone application on February 19, 2013. In June 2013, Sunrise raised $2.2 million (~$2.91 million in 2024) in venture funding from Resolute.vc, NextView Ventures, Lerer Hippeau Ventures, SV Angel, and other angel investment firms like Loïc Le Meur, Dave Morin, Fabrice Grinda. In May 2014, Sunrise launched on Android as well as on the web via a web application. In July 2014, Sunrise announced it had raised $6 million (~$7.81 million in 2024) Series A from Balderton Capital. Bernard Liautaud joined the board. On February 11, 2015, Sunrise Atelier, Inc. was acquired by Microsoft for US$100 million (~$129 million in 2024). On October 28, 2015, Microsoft announced that Sunrise would be discontinued, and its functionality merged into Outlook Mobile. Microsoft later stated that the app would permanently cease functioning on August 31, 2016, but the shutdown was delayed to September 13, 2016, to coincide with an update to Outlook Mobile that incorporates aspects of Sunrise into its calendar interface. == Features == Sunrise allowed users to connect with Google Calendar, iCloud calendar and with Exchange Server. The following third-party services featured integration with Sunrise: Foursquare, GitHub, TripIt, Asana, Evernote, Google Tasks, Trello, Songkick, and Wunderlist. As a web app, users could sign-in and use Sunrise in a web browser, with no downloads required. A native Sunrise app could also be downloaded for OS X 10.9 and later, iOS 8.0 and later (both iPhone and iPad) as well as Android phones and tablets. In May 2015, Sunrise launched Meet, a keyboard for Android and iOS that lets users select available time slots in their calendar to schedule one-to-ones.
MetroHero
MetroHero is a semi-defunct real-time transit tracking and performance analysis application for the Washington Metro rapid transit system. Originally available on iOS, Android, and the web, it allows users to view live maps of all trains on a specific line, summary statistics relating to real-time system performance, and user feedback on current Metro conditions. The app launched in 2015, followed by ARIES for Transit, a related project from the same developers, and continued functioning until its original developers shut it down in 2023. Afterwards, forks of the application went live to allow for its continued public use, and the Washington Metropolitan Area Transit Authority (WMATA), Metro's operator, announced that it would launch a similar app. The app has been described by local news media as popular and well-liked among Washington, D.C.-area residents. == History and main development == MetroHero was initially developed by James and Jennifer Pizzurro, who both attended George Washington University and studied computer science. They said that they were inspired to create the app after experiencing train delays and searching for an app to track a train after boarding; such an app did not exist for the Washington Metro. The development of the app was not endorsed by WMATA, but it did use publicly available data from the agency. MetroHero launched as an Android application in September 2015, followed by the release of an iOS-compatible web app in December of that year. A standalone iOS app launched in April 2018, but the web app remained supported. By April 2018, MetroHero had approximately 13,000 monthly active users. James Pizzurro has stated that the app's intended audience was regular Metro commuters who wanted to communicate with each other about active problems, as opposed to tourists and riders who only wanted train time data. Throughout the application's development, the Pizzurros had been advocates for Metro's transparency with riders and the community by providing more high-quality data and taking on the feedback of developers. In particular, they criticized Metro's reluctance to uniquely identify individual train trips and its decision to obscure data under certain circumstances, which have posed problems for MetroHero's data collection. In addition to their work on MetroHero, the app's developers led or participated in other initiatives related to transit in the Greater Washington area. In 2019, MetroHero partnered with a local transit group to analyze Metrobus data and publish a "Metrobus Report Card", along with proposed goals and recommendations based on the report's findings. Based on this experience, MetroHero's developers began a sister project, the Adherence + Reliability + Integrity Evaluation System for Transit (ARIES for Transit), which displays data and issues grades for Washington- and Baltimore-area transit systems. Separately, James Pizzurro used MetroHero data to inform Rail Transit OPS, an independent Metro oversight group, and assist in its documentation of Metro system incidents. == Application == The MetroHero application uses several interfaces, including an overall dashboard and a live map, to display data to its users. On the dashboard, system-wide train summary data, such as the number of operating trains and headway adherence, is visible. The map offers a visual representation of all trains' positions throughout the system, filtered by line. Individual stations and trains can be selected to see ratings and comments provided by other users, including both positive and negative notes like cleanliness and crowdedness. Additionally, a list of train wait times is given, along with aggregate data like average wait time. Any train delays or service incidents are visible in the app. MetroHero uses several data sources for the various components of its application. Train positions and other operational data are provided by WMATA as part of its initiative to release open data for third-party developers. However, MetroHero's developers noted that the Metro-provided information is sometimes inaccurate and incomplete, thereby limiting the accuracy of MetroHero. The app also collects crowdsourced data from its users, who can report conditions in train cars and stations and add to reports sent by other people. Additionally, MetroHero parses data from Twitter feeds to learn about system incidents, including delays and fires. In addition to the web app, Android app, and iOS app, MetroHero's initial developers maintained automated social media accounts that alerted customers about Metro service; these accounts were discontinued upon the original app's eventual shutdown. MetroHero also hosts archived performance data for later review, a feature that is sometimes used after major incidents. == Shutdown and future == In February 2023, James Pizzurro announced that MetroHero would be shut down on July 1, 2023, citing "positive changes ... in the app landscape and in WMATA's data management and communication" and the costs and time associated with maintaining the app. Shortly before the application's end date, the Pizzurros shared MetroHero's source code on GitHub, which prompted others to fork the code and begin maintaining new instances of MetroHero to succeed the original app. The original website went offline on July 1, as planned. Historically, WMATA has not offered its own real-time map or similar service, citing other apps from third parties which accomplished the same task. However, on June 30, 2023, Randy Clarke, WMATA's general manager, announced that Metro would begin offering a similar service as MetroHero did. The app, initially named MetroMeter, was planned to begin operating in early July and would provide real-time information on trains, headways, and service schedules. Metro also noted its intentions to extend this service to Metrobus and MetroAccess. On July 20, Metro announced that the app had been renamed to MetroPulse and launched it in beta. MetroHero's other project, ARIES for Transit, was not affected by the shutdown. == Reception == MetroHero was generally well-received and has been recognized for its usage among Washington-area commuters. DCist called it one of the "most praised" Metro tracking apps, and WMATA publicly acknowledged its popularity when announcing its decision to establish MetroPulse. Chris Barnes, a member of the Metro Riders' Advisory Council, said that the app is considered important among riders because it fulfills a need for riders to have reliable and transparent transit information, albeit somewhat hindered by flaws in WMATA's data.
Mastodon (social network)
Mastodon is a free and open-source software platform for decentralized social networking with microblogging features similar to Twitter. It operates as a federated network of independently managed servers that communicate using the ActivityPub protocol, allowing users to connect across different instances within the Fediverse. Each Mastodon instance establishes its own moderation policies and content guidelines, distinguishing it from centrally controlled social media platforms. First released in 2016 by Eugen Rochko, Mastodon has positioned itself as an alternative to mainstream social media, particularly for users seeking decentralized, community-driven spaces. The platform has experienced multiple surges in adoption, most notably following the Twitter acquisition by Elon Musk in 2022, as users sought alternatives to Twitter. It is part of a broader shift toward decentralized social networks, including Bluesky and Lemmy. Mastodon emphasizes user privacy and moderation flexibility, offering features such as granular post visibility controls, content warning options, and local community-driven moderation. The software is written in Ruby on Rails and Node.js, with a web interface built using React and Redux. It is interoperable with other ActivityPub-based platforms, such as Threads, and supports various third-party applications on desktop and mobile devices. == Functionality == Users post short-form status messages, historically known as "toots", for others to see and interact with. On a standard Mastodon instance, these messages can include up to 500 text-based characters, greater than Twitter's 280-character limit. Some instances support even longer messages. Images, audio files, videos or polls can also be added to a message. Users join a specific Mastodon server, rather than a single centralized website or application. The servers are connected as nodes in a network, and each server can administer its own rules, account privileges, and whether to share messages to and from other servers. Users can communicate and follow each other across connected Mastodon servers with usernames similar in format to full email addresses. Since version 2.9.0, Mastodon's web user interface has offered a single-column mode for new users by default. In advanced mode, the interface approximates the microblogging interface of TweetDeck. === Privacy === Mastodon includes a number of specific privacy features. Each message has a variety of privacy options available, and users can choose whether the message is public or private. Messages can display public on a global feed, known as a timeline, or can be shared only to the user's followers. Messages can also be marked as unlisted from timelines or direct between users. Users can also mark their accounts as completely private. In the timeline, messages can display with an optional content warning feature, which requires readers to click on the hidden main body of the message to reveal it. Mastodon servers have used this feature to hide spoilers, trigger warnings, and not safe for work (NSFW) content, though some accounts use the feature to hide links and thoughts others might not want to read. Mastodon aggregates messages in local and federated timelines in real time. The local timeline shows messages from users on a singular server, while the federated timeline shows messages across all participating Mastodon servers. === Content moderation === In early 2017, journalists like Sarah Jeong distinguished Mastodon from Twitter for its approach to combating harassment. Mastodon uses community-based moderation, in which each server can limit or filter out undesirable types of content, while Twitter uses a single, global policy on content moderation. Servers can choose to limit or filter out messages with disparaging content. The founder of Mastodon, Eugen Rochko, believes that small, closely related communities deal with unwanted behavior more effectively than a large company's small safety team. In Move Slowly and Build Bridges, Robert W. Gehl argues that predominantly white participation has shaped Mastodon in ways that affect how reports of racism are received and limit its ability to replicate Black Twitter on Twitter. Users can also block and report others to administrators, much like on Twitter. Instance administrators can block other instances from interacting with their own, an action called defederation. By posting toots hashtagged with #fediblock, some instance administrators and users alert others of issues requiring moderation. === Searching === Mastodon by default allows searching for hashtags and mentioned accounts in the Fediverse. Server administrators can optionally enable Elasticsearch to search the full-text of public posts that have opted in to being indexed. == Versions == In September 2018, with the release of version 2.5 with redesigned public profile pages, Mastodon marked its 100th release. Mastodon 2.6 was released in October 2018, introducing the possibilities of verified profiles and live, in-stream link previews for images and videos. Version 2.7, in January 2019, made it possible to search for multiple hashtags at once, instead of searching for just a single hashtag, with more robust moderation capabilities for server administrators and moderators, while accessibility, such as contrast for users with sight issues, was improved. The ability for users to create and vote in polls, as well as a new invitation system to manage registrations was integrated in April 2019. Mastodon 2.8.1, released in May 2019, made images with content warnings blurred instead of completely hidden. In version 2.9 in June 2019, an optional single-column view was added. This view became the default displayed to new users, with a user "preferences" option to switch to a multiple-column-based view. In August 2020, Mastodon 3.2 was released. It included a redesigned audio player with custom thumbnails and the ability to add personal notes to one's profile. In July 2021, an official client for iOS devices was released. According to the project's then CEO, Eugen Rochko, the release was part of an effort to attract new users. Mastodon 4.0 was released in November 2022, including language support for translating posts, editing posts and following hashtags. Mastodon 4.5 was released in November 2025. Among other features it introduced quote posts, which were previously rejected from being implemented due to concerns about toxicity and harassment. To mitigate these issues Mastodon's quote post feature has been designed in a way that lets users decide if and by whom their posts can be quoted. == Software == Mastodon is published as free and open-source software under the Affero GPL license, allowing anyone to use the software or modify it as they wish. Servers can be run by any individual or organization, and users can join these servers as they wish. The server software itself is powered by Ruby on Rails and Node.js, with its web client being written in React.js and Redux. The only database software supported is PostgreSQL, with Redis being used for job processing and various actions that Mastodon needs to process. The service is interoperable with the fediverse, a collection of social networking services which use the ActivityPub protocol for communication between each other, with previous versions containing support for OStatus. Client apps for interacting with the Mastodon API are available for desktop computer operating systems, including Windows, macOS and the Linux family of operating systems, as well as mobile phones running iOS and Android. The API is open for anyone to utilize, allowing clients to be built for any operating system that can connect to the internet. === Integration with Fediverse === Mastodon uses the ActivityPub protocol for federation; this allows users to communicate between independent Mastodon instances and other ActivityPub compatible services. Thus, Mastodon is generally considered to be a part of the Fediverse. Services utilizing the ActivityPub protocol exist which allow for searching all posts on all instances as long as users opt-in. For similar reasons, only hashtags can appear in a Mastodon instance's trending topics, not arbitrary popular words. Trending topics vary between instances, since individual instances are aware of different subsets of posts from the whole fediverse. === Security concerns === While Mastodon's decentralized structure is one of its most distinctive features, it also poses additional security challenges. Since many Mastodon instances are run by volunteers, some security experts are concerned about data security and responsiveness to new threats and vulnerabilities across the network, considering the difficulty of configuring and maintaining an instance as well as uneven skill levels among administrators. Administrators of an instance also have access to the private information of any users that are either registered with that instance or have federated
LRE Map
The LRE Map (Language Resources and Evaluation) is a freely accessible large database on resources dedicated to Natural language processing. The original feature of LRE Map is that the records are collected during the submission of different major Natural language processing conferences. The records are then cleaned and gathered into a global database called "LRE Map". The LRE Map is intended to be an instrument for collecting information about language resources and to become, at the same time, a community for users, a place to share and discover resources, discuss opinions, provide feedback, discover new trends, etc. It is an instrument for discovering, searching and documenting language resources, here intended in a broad sense, as both data and tools. The large amount of information contained in the Map can be analyzed in many different ways. For instance, the LRE Map can provide information about the most frequent type of resource, the most represented language, the applications for which resources are used or are being developed, the proportion of new resources vs. already existing ones, or the way in which resources are distributed to the community. == Context == Several institutions worldwide maintain catalogues of language resources (ELRA, LDC, NICT Universal Catalogue, ACL Data and Code Repository, OLAC, LT World, etc.) However, it has been estimated that only 10% of existing resources are known, either through distribution catalogues or via direct publicity by providers (web sites and the like). The rest remains hidden, the only occasions where it briefly emerges being when a resource is presented in the context of a research paper or report at some conference. Even in this case, nevertheless, it might be that a resource remains in the background simply because the focus of the research is not on the resource per se. == History == The LRE Map originated under the name "LREC Map" during the preparation of LREC 2010 conference. More specifically, the idea was discussed within the FlaReNet project, and in collaboration with ELRA and the Institute of Computational Linguistics of CNR in Pisa, the Map was put in place at LREC 2010. The LREC organizers asked the authors to provide some basic information about all the resources (in a broad sense, i.e. including tools, standards and evaluation packages), either used or created, described in their papers. All these descriptors were then gathered in a global matrix called the LREC Map. The same methodology and requirements from the authors has been then applied and extended to other conferences, namely COLING-2010, EMNLP-2010, RANLP-2011, LREC 2012, LREC 2014 and LREC 2016. After this generalization to other conferences, the LREC Map has been renamed as the LRE Map. == Size and content == The size of the database increases over time. The data collected amount to 4776 entries. Each resource is described according to the following attributes: Resource type, e.g. lexicon, annotation tool, tagger/parser. Resource production status, e.g. newly created finished, existing-updated. Resource availability, e.g. freely available, from data center. Resource modality, e.g. speech, written, sign language. Resource use, e.g. named entity recognition, language identification, machine translation. Resource language, e.g. English, 23 European Union languages, official languages of India. == Uses == The LRE map is a very important tool to chart the NLP field. Compared to other studied based on subjective scorings, the LRE map is made of real facts. The map has a great potential for many uses, in addition to being an information gathering tool: It is a great instrument for monitoring the evolution of the field (useful for funders), if applied in different contexts and times. It can be seen as a huge joint effort, the beginning of an even larger cooperative action not just among few leaders but among all the researchers. It is also an "educational" means towards the broad acknowledgment of the need of meta-research activities with the active involvement of many. It is also instrumental in introducing the new notion of "citation of resources" that could provide an award and a means of scholarly recognition for researchers engaged in resource creation. It is used to help the organization of the conferences of the field like LREC. == Derived matrices == The data were then cleaned and sorted by Joseph Mariani (CNRS-LIMSI IMMI) and Gil Francopoulo (CNRS-LIMSI IMMI + Tagmatica) in order to compute the various matrices of the final FLaReNet reports. One of them, the matrix for written data at LREC 2010 is as follows: English is the most studied language. Secondly, come French and German languages and then Italian and Spanish. == Future == The LRE Map has been extended to Language Resources and Evaluation Journal and other conferences.
List of publications in data science
This is a list of publications in data science, generally organized by order of use in a data analysis workflow. See the list of publications in statistics for more research-based and fundamental publications; while this list is more applied, business oriented, and cross-disciplinary. General article inclusion criteria are: Papers from notable practitioners or notable professors, either with a Wikipedia page or reference to their notability Common knowledge all data professionals should know, with references validating this claim Highly cited applied statistics and machine learning publications Discussion-facilitating papers on the field of data science as a whole (for example, the Attention Is All You Need paper is arguably a landmark paper that can be added here, but it is specific to generative artificial intelligence, not for all practitioners of data) Some reasons why a particular publication might be regarded as important: Topic creator – A publication that created a new topic Breakthrough – A publication that changed scientific knowledge significantly Influence – A publication which has significantly influenced the world or has had a massive impact on the teaching of data science. When possible, a reference is used to validate the inclusion of the publication in this list. == History == Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) Author: Leo Breiman Publication data: Online version: https://projecteuclid.org/journals/statistical-science/volume-16/issue-3/Statistical-Modeling--The-Two-Cultures-with-comments-and-a/10.1214/ss/1009213726.pdf Description: Describes two cultures of statistics, one using a parsimonious and generative stochastic model, while the other is an algorithmic model with no known mechanism for how the data is generated. Breiman argues that while statistics has traditionally favored using the stochastic model, there is value in expanding the methods that statisticians can use to study phenomenon. Importance: Influence on the philosophies of statisticians right before the increased use of machine learning and deep learning methods. In a 20-year retrospective on this article, "Breiman's words are perhaps more relevant than ever". Notable statisticians at the time wrote opinion pieces about the publication. Although overall critical of the publication, David Cox writes that the publication "contains enough truth and exposes enough weaknesses to be thought-provoking." Bradley Efron commented that this publication is a "stimulating paper". Emanuel Parzen also comments about this publication that "Breiman alerts us to systematic blunders (leading to wrong conclusions) that have been committed applying current statistical practice of data modeling". Data Scientist: The Sexiest Job of the 21st Century Author: Thomas H. Davenport and DJ Patil Publication data: Online version: hbr.org/2022/07/is-data-scientist-still-the-sexiest-job-of-the-21st-century Description: Describes the new role at companies that is coined "Data scientist", what they do, how an organization might recruit one to their organization, and how to work with one effectively. Importance: This publication has been an influence on the data community as mentioned near the time it was published in 2012 by institutions like IEEE Spectrum, but also mentioned nearly a decade later asking the same question the title poses. In a retrospective response to their own publication 10 years earlier, authors Davenport and Patil have reflected that the role of a data scientist has "become better institutionalized, the scope of the job has been redefined, the technology it relies on has made huge strides, and the importance of non-technical expertise, such as ethics and change management, has grown". 50 Years of Data Science Author: David Donoho Publication data: Online version: https://www.tandfonline.com/doi/full/10.1080/10618600.2017.1384734 Description: Retrospective discussion paper on the history and origins of data science, with a number of commentary from notable statisticians. Importance: This has been described as "the first in the field to present such a comprehensive and in-depth survey and overview", and helps to define the field that has many definitions. The Composable Data Management System Manifesto Author: Pedro Pedreira, Orri Erling, Konstantinos Karanasos, Scott Schneider, Wes McKinney, Satya R Valluri, Mohamed Zait, Jacques Nadeau Publication data: Online version: https://www.vldb.org/pvldb/vol16/p2679-pedreira.pdf Description: The vision paper advocating for a paradigm shift in how data management systems are designed using standard, composable, interoperable tools rather than siloed software tools. Importance: A paradigm shifting view on how future data science software tools should be designed for more efficient workflows, the principles of which "will be especially crucial for addressing fragmentation, improving interoperability, and promoting user-centricity as data ecosystems grow increasingly complex". == Data collection and organization == Tidy Data Author: Hadley Wickham Publication data: Online version: https://www.jstatsoft.org/article/view/v059i10/ https://vita.had.co.nz/papers/tidy-data.pdf Description: Describes a framework for data cleaning that is summarized in the quote, "each variable is a column, each observation is a row, and each type of observational unit is a table". This allows a standard data structure for which data analysis tools can be consistently built around. Importance: Cited over 1,500 times, this effort for tidy data has been described by David Donoho as having "more impact on today's practice of data analysis than many highly regarded theoretical statistics articles". In the context of data visualization, this publication is said to support "efficient exploration and prototyping because variables can be assigned different roles in the plot without modifying anything about the original dataset". Data Organization in Spreadsheets Author: Karl W. Broman and Kara H. Woo Publication data: Online version: https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989 Description: This article offers practical recommendations for organizing data in spreadsheets, like Microsoft Excel and Google Sheets, to reduce errors and lower the barrier for later analyses due to limitations in spreadsheets or quirks in the software. Importance: Influences teaching both data and non-data practitioners to create more analysis-friendly spreadsheets, and has been described to outline "spreadsheet best practices". == Data visualizations == Quantitative Graphics in Statistics: A Brief History Author: James R. Beniger and Dorothy L. Robyn Publication data: Online version: https://www.jstor.org/stable/2683467 Description: Outlines history and evolution of quantitative graphics in statistics, going through spatial organization (17th and 18th centuries), discrete comparison (18th and 19th centuries), continuous distribution (19th century), and multivariate distribution and correlation (late 19th and 20th centuries). Importance: Helps put into perspective for learning data practitioners the recency of graphics that are used. A later publication "Graphical Methods in Statistics" by Stephen Fienberg in 1979 writes that his publication "owes much to the work of Beniger and Robyn". == Practice == Data Science for Business Author: Foster Provost and Tom Fawcett Publication data: Online version: N/A Description: Broadly outlines principles of data science and data-analytic thinking for businesses. Importance: Cited over 3,000 times, it is "highly recommended for students" but also it is also recommended due to its "relevance to senior management leaders who want to build and lead a team of data scientists and implement data science in solving complex business problems". == Tooling == Hidden Technical Debt in Machine Learning Systems Author: D. Sculley, Gary Holy, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-François Crespo, Dan Dennison Publication data: Online version: https://proceedings.neurips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf Description: This paper argues that it is "dangerous to think of [complex machine learning] quick wins as coming for free" and overviews risk factors to account for when implementing a machine learning system. Importance: All authors worked for Google, article is cited over 2,000 times, and helped practitioners thinking about quickly implementing a machine learning tool without understanding the long-term maintenance of the tool. A few useful things to know about machine learning Author: Pedro Domingos Publication data: Online version: https://dl.acm.org/doi/10.1145/2347736.2347755 https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf Description: The purpose of this paper is to distill inaccessible "folk knowledge" to effectively implement machine learning projects because "machin