Transdermal optical imaging

Transdermal optical imaging

Transdermal optical imaging, also known as transdermal optical imagery or TOI, is a method of detecting blood flow of the face by measuring hemoglobin concentration using a digital video camera. Because of the translucent property of skin, light can travel beneath the skin and re-emit. The re-emitted light from underneath the skin is affected by chromophores, mainly hemoglobin and melanin, which differ in color. The color difference allows TOI machine learning software to separate the images into layers, which are known as bitplanes. It extracts signals rich in hemoglobin and signals rich in melanin, then discards the melanin-rich signals to obtain a recording of hemoglobin changes under the skin. Transdermal optical imaging has been proposed as an alternative to cuff-based methods of measuring blood pressure because it is able to measure heart rate accurately in a "contactless and non-invasive" way. Transdermal optical imaging may be able to detect hidden emotions using the patterns of blood flow in the face.

Open Syllabus Project

The Open Syllabus Project (OSP) is an online open-source platform that catalogs and analyzes millions of college syllabi. Founded by researchers from the American Assembly at Columbia University, the OSP has amassed the most extensive collection of searchable syllabi. Since its beta launch in 2016, the OSP has collected over 7 million course syllabi from over 80 countries, primarily by scraping publicly accessible university websites. The project is directed by Joe Karaganis. == History == The OSP was formed by a group of data scientists, sociologists, and digital-humanities researchers at the American Assembly, a public-policy institute based at Columbia University. The OSP was partly funded by the Sloan Foundation and the Arcadia Fund. Joe Karaganis, former vice-president of the American Assembly, serves as the project director of the OSP. The project builds on prior attempts to archive syllabi, such as H-Net, MIT OpenCourseWare, and historian Dan Cohen's defunct Syllabus Finder website (Cohen now sits on the OSP's advisory board). The OSP became a non-profit and independent of the American Assembly in November 2019. In January 2016, the OSP launched a beta version of their "Syllabus Explorer," which they had collected data for since 2013. The Syllabus Explorer allows users to browse and search texts from over one million college course syllabi. The OSP launched a more comprehensive version 2.0 of the Syllabus Explorer in July 2019. The newer version includes an interactive visualization that displays texts as dots on a knowledge map. As of 2022, the OSP has collected over 7 million course syllabi. The Syllabus Explorer represents the "largest collection of searchable syllabi ever amassed." == Methodology == The OSP has collected syllabi data from over 80 countries dating to 2000. The syllabi stem from over 4,000 worldwide institutions. Most of the OSP's data originates from the United States. Canada, Australia, and the U.K also have large datasets. The OSP primarily collects syllabi by scraping publicly accessible university websites. The OSP also allows syllabi submissions from faculty, students, and administrators. The OSP developers use machine learning and natural language processing to extract metadata from such syllabi. Since only metadata is collected, no individual syllabus or personal identifying information is found in the OSP database. The OSP classifies the syllabi into 62 subject fields – corresponding to the U.S. Department of Education's Classification of Instructional Programs (CIP). Additionally, the OSP assigns each text a "teaching score" from 0–100. This score represents the text's percentile rank among citations in the total citation count and is a numerical indicator of the relative frequency of which a particular work is taught. The OSP also has data on which texts are most likely to be assigned together. The developers behind the OSP admit that the database is incomplete and likely contains "a fair number of errors." Karaganis estimates that 80–100 million syllabi exist in the United States alone. The OSP is unable to access syllabi behind private course-management software like Blackboard. == Notable findings == === Anthropology === Using data from the OSP, anthropologist Laurence Ralph uncovered that black anthropologists are "woefully under-represented in (if not erased from) most anthropology syllabi." Black authors wrote less than 1 percent of the top 1,000 assigned works. === Economics === The database indicates Greg Mankiw is the most frequently cited author for college economics courses. === English literature === The OSP found that Mary Shelley's Frankenstein was the most widely taught novel in college courses. Additionally, the majority of novels published after 1945 taught in English classes were historical fiction. === Female writers === The most read female writer on college campuses is Kate L. Turabian for her A Manual for Writers of Research Papers, Theses, and Dissertations . Turabian is followed by Diana Hacker, Toni Morrison, Jane Austen, and Virginia Woolf. === Film === The most assigned film according to the OSP is the 1929 Soviet documentary film, Man with a Movie Camera. English filmmaker Alfred Hitchcock is the most assigned director in college courses. === History === Historians George Brown Tindall and David Emory Shi's America: A Narrative History is the number one assigned textbook for history, followed by Anne Moody's memoir, Coming of Age in Mississippi. === Philosophy === The most assigned texts in the field of philosophy include Aristotle's Nicomachean Ethics, John Stuart Mill's Utilitarianism, and Plato's Republic. Plato's Republic was also the second most assigned text in universities in the English-speaking world (only behind Strunk and White's Elements of Style). === Physics === David Halliday's et al. Fundamentals of Physics is the number one ranked physics textbook in the OSP's database. === Political science === Data from the OSP indicates that the dominant political science texts are written almost exclusively by white men and scholars based in the West. In the top 200 most-frequently assigned works, 15 are authored by at least one woman. === Public administration === American president Woodrow Wilson's article "The Study of Administration" was the most frequently assigned text in public affairs and administration syllabi. == Reception == According to William Germano et al., the OSP is a "fascinating resource but is also prone to misrepresenting or at least distracting us from the most important business of a syllabus: communicating with students." Historian William Caferro remarks that the OSP is a "tacit experience of sharing, but a useful one." English professor Bart Beaty writes that, "Despite the many reservations about the completeness of its data, the OSP provides a rare opportunity for scholars to move beyond the anecdotal in discussions of canon-formation in teaching." Media theorist Elizabeth Losh opines that "big data approaches", like the OSP, may "raise troubling questions for instructors about informed consent, pedagogical privacy, and quantified metrics."

Collabora Online

Collabora Online (often abbreviated as COOL) is an open-source online office suite developed by Collabora, based on LibreOffice Online, the web-based edition of the LibreOffice office suite. It enables real-time collaborative editing of documents, spreadsheets, presentations, and vector graphics in a web browser. Optional applications are available for offline use on Android, ChromeOS, iOS, iPadOS, Linux distributions, macOS, and Windows. It supports the OpenDocument format and is compatible with other major formats, including those used by Microsoft Office. The Document Foundation (TDF), the nonprofit organization behind LibreOffice, states that a majority of the LibreOffice software development is done by its partners like Collabora. Collabora Online is an open-source alternative to proprietary cloud office platforms such as Google Workspace and Microsoft 365. Unlike these services, it can be self-hosted or hosted by third-party providers. The platform is marketed particularly toward enterprises and public institutions seeking greater digital sovereignty and independence from U.S.-based "big tech" companies. Collabora also develops Collabora Office, a standalone desktop and mobile app suite based on LibreOffice. Although Collabora Online has increasingly taken on a central role, both products may be used in parallel, similar to Microsoft Office and Microsoft 365. In November 2025, Collabora released Collabora Office Desktop and renamed the previous product Collabora Office Classic. The new product shares code with Collabora Online and brings the same user interface to the desktop on Linux, Windows and MacOS. A separate version, the Collabora Online Development Edition (CODE), is offered free of charge and is recommended for individuals, small teams, and developers. CODE provides early access to new features and serves as a testing and development platform for open-source community contributors. As TDF does not offer a free version of LibreOffice Online, CODE represents the primary freely available option for organizations and individuals interested in deploying LibreOffice in a web-based, collaborative setting. == Applications == Collabora Online includes several applications for document editing, available through the web-based interface and optional desktop and mobile apps: Collabora Writer – A word processor based on LibreOffice Writer, comparable to Microsoft Word and Google Docs. It supports WYSIWYG editing, styles, formatting tools, comment threads, and change tracking. Collabora Calc – A spreadsheet editor based on LibreOffice Calc, similar to Microsoft Excel and Google Sheets. Features include pivot tables, formulas, data validation, conditional formatting, advanced sorting and filtering, charts, and support for up to 16,000 columns. Compatible with some macros written in VBA. Collabora Impress – A presentation program based on LibreOffice Impress, comparable to Microsoft PowerPoint and Google Slides. It supports master slides, transitions, speaker notes, and multimedia elements. Collabora Draw is not a separate application, most of the functionality of the Draw application is now integrated in Writer and Impress – vector graphics editor based on LibreOffice Draw, comparable to Microsoft Visio and Google Drawings. == Features == Collabora Online can be accessed from modern web browsers without the need for plug-ins or add-ons. It supports real-time collaborative editing of word processing documents, spreadsheets, presentations, and vector graphics. Collaboration features include commenting, version tracking with document comparison and restoration, and integration with communication tools such as chat or video calls. These functions are often enabled through integration with enterprise open-source cloud platforms like Nextcloud, ownCloud, Seafile, EGroupware, GroupOffice and others. Collabora Online can also be embedded or integrated into a variety of third-party applications. Although client apps are not required to use the web-based suite, optional applications are available for offline use on Android, ChromeOS, iOS, iPadOS, Linux distributions, macOS, and Windows. These apps share the same LibreOffice-based core as the server version, ensuring document compatibility across platforms. Development of the LibreOffice core benefits both the online server and the client applications simultaneously. The mobile apps offer touch-optimized interfaces that adapt to different screen sizes and can be used offline, with optional integration into cloud storage services. Collabora Online supports OpenDocument formats (ODF; .odt, .odp, .ods, .odg) in accordance with ISO/IEC 26300. It is also compatible with Microsoft Office formats, including Office Open XML (.docx, .pptx, .xlsx) and legacy binary formats (.doc, .ppt, .xls). Additional supported formats include PDF, PNG, CSV, TSV, RTF, EPUB, and others. The suite can import a range of formats supported by LibreOffice, including Microsoft Visio and Publisher files, Apple Keynote, Numbers, and Pages files, as well as legacy formats used by Lotus 1-2-3, Microsoft Works, and Quattro Pro. The core of Collabora Online is written in C++ and utilizes LibreOfficeKit, a programming interface that enables reuse of much of LibreOffice's existing code for document saving, loading, and rendering. Collabora Online operates on the principle that documents remain on the server, with users viewing tile-rendered images of the document and sending their edits back to the server. The user interface is implemented in JavaScript. For file access and authentication with file hosting services, Collabora Online uses Microsoft's WOPI protocol, allowing compatibility with any service supporting Microsoft 365 integration. == Server == The server component can be self-hosted or deployed through third-party enterprise open-source cloud platforms, allowing organizations to maintain control over data and infrastructure. It is available for various Linux distributions and as a Docker image. The server enables features such as in-browser document editing, file synchronization, and real-time communication. These third-party cloud platforms typically offer additional functionality comparable to services such as Dropbox, Google Workspace, Microsoft 365, or Zoom, including file sharing, calendars, email, contacts, chat, and video conferencing. Collabora Online can be integrated into these applications, as well as with other services such as learning management systems and enterprise content platforms, through open APIs and an SDK. == Reception == Various online and print publications have discussed Collabora Online. In December 2016 the technology website Softpedia mentioned the availability of collaborative editing in version 2.0 and the integration with ownCloud, Nextcloud, and other file synchronization and sharing solutions. In June 2020, ZDNET reported that Collabora Online would be included as the standard office suite in Nextcloud version 19, noting that direct document editing was added to the native video conferencing software Talk. The technology blog OMG! Ubuntu! covered the release of Collabora's Android and iOS apps, emphasizing their offline functionality. In September 2020, Linux Magazine compared Collabora Online with OnlyOffice, noting the flexibility and platform independence of both tools and highlighting Collabora's extensive feature set derived from LibreOffice. === Digital sovereignty === Collabora Online's open-source design and support for self-hosting have made it notable in discussions about digital sovereignty—the ability of users and organizations to control their own data. This is particularly relevant in Europe, where concerns about dependence on U.S.-based "big tech" companies and data privacy have grown in recent years. On 10th June 2025, Microsoft executives under oath in the French Senate admitted that they cannot guarantee data sovereignty and would be compelled to pass French (and by implication the wider European Union) information to the US administration if requested via a warrant or subpoena. The Cloud Act is a law that gives the US government authority to obtain digital data held by US-based tech corporations, irrespective of whether that data is stored on servers at home or on foreign soil. A 2020 briefing by the European Parliament highlighted risks associated with reliance on major technology companies that collect and exploit user data. Legal decisions such as the Schrems II ruling have further underscored these concerns. Several European government agencies have adopted private cloud solutions using Collabora Online and related platforms to enhance data security and maintain control over sensitive information. == History == The former LibreOffice development team from SUSE joined Collabora in September 2013, forming the subsidiary Collabora Productivity. In 2015 Collabora and IceWarp announced the development of an enterprise-ready version of LibreOffice Online to compete wi

CEITON

CEITON is a web-based software system for facilitating and automating business processes such as planning, scheduling, and payroll using workflow technologies. The system is used by several media companies such as MDR, Yle, RAI and Red Bull Media House. In December 2018, the first CEITON User Group Meeting took place in Leipzig, Germany. == Architecture == The software runs on a server (on premises) or in the cloud and is scalable on parallel servers. Data security is warranted by role-based access control (RBAC). The software is used via web-browsers and not dependent on particular system software. == Structure and Features == CEITON combines the two classical approaches of production planning and control and workflow management. === Project Management === The scheduling system plans, manages, bills, and analyzes projects or tasks. It manages human and technical resources, material, and locations on a single GUI. The system uses a gantt chart to assign tasks to be done to available and eligible resources (i.e. staff), automatically or by drag-and-drop. The scheduling module includes material management, resource management/ human resource management, integration of freelancers, clients and suppliers, long-term budget planning, time-tracking, shift scheduling, quality management, delivery and logistics, document management, archive, analysis and controlling, business reporting, as well as all accounting and documentation processes. === Workflow === The workflow management system module coordinates business processes. Processes are defined once as a workflow and then repeatedly executed. Human resources are automatically assigned to steps (tasks) and integrated in workflow forms. Systems are integrated with an EAI/SOAP module, allowing data exchange with arbitrary external systems which are also involved in the business process. It also features a 3-D workflow overview in which the status of each project step can be determined by its color in the overview. === Process Management === For project and order processing management, business processes are designed as workflows, and coordinate communication automatically. Different user interfaces for staff, customers or suppliers can be created so each gets only relevant information. Different workflow forms are associated with different log-ins. The main application for the system is knowledge-based business processes, in which many people are involved and virtual results are produced, e.g. in research, or development of media products, such as TV and movies. Broadcasters and media companies such as MDR and Yle use CEITON to control their production processes for products and services and coordinate complex workflows with all kinds of resources. === Integrations === An integrated EAI module allows CEITON to integrate every external system in any business process without programming, using SOAP and similar technologies. Aspera and FileCatalyst were integrated for faster data transfer, yet complex ERP systems and numerous SAP modules have also been integrated, for example, to extract working times to payroll. === Mobile Working === Since Version 7, released in 2015, CEITON includes a time-tracking module allowing employees to enter their times from mobile devices such as tablets running Android, iPhones etc. == History == Ceiton Technologies (SME tech firm), the company developing CEITON, was founded in Leipzig, Germany in 2000, staffing solutions for the Bureau of Internal Revenue in Manila, Philippines, were implemented in 2000 together with the Deutsche Gesellschaft für Technische Zusammenarbeit of the German government. The first version (1.0) of the software was released in July 2001. The product was originally developed for German broadcasting companies. CEITON is named after the Japanese concept Seiton, one of the principles of Japanese workplace design methodology known as 5S. Since version 7, released in 2015, CEITON includes a time-tracking module allowing employees to enter their times from mobile devices such as tablets running Android, iPhones etc. In May 2005 CEITON won the IQ innovation award, sponsored by Siemens, in the category Excellent innovation in the IT-sector. Since 2007, CEITON has been present at the broadcast trade fairs NAB in Las Vegas and IBC in Amsterdam. In 2020, the company celebrated its 20th anniversary.

ISPConfig

ISPConfig is an open source hosting control panel for Linux, licensed under BSD license and developed by the company ISPConfig UG. The ISPConfig project was started in autumn 2005 by Till Brehm from the German company projektfarm GmbH. == Overview == Using the dashboard, administrators have the ability to manage websites, email addresses, MySQL and MariaDB as well as PostgreSQL (since version 3.3) databases, FTP accounts, Shell accounts and DNS records through a web-based interface. The software has 4 login levels: administrator, reseller, client, and email-user, each with a different set of permissions. == Operating Systems == ISPConfig is only available on Linux, with CentOS, Debian, and Ubuntu being among the supported distributions. == Features == The following services and features are supported: Management of a single or multiple servers from one control panel. Web server management for Apache HTTP Server and Nginx. Mail server management (with virtual mail users) with spam and antivirus filter using Postfix (software) and Dovecot (software). DNS server management (BIND, Powerdns). Configuration mirroring and clusters. Administrator, reseller, client and mail-user login. Virtual server management for OpenVZ Servers. Website statistics using Webalizer and AWStats

Serverless computing

Serverless computing is "a cloud service category where the customer can use different cloud capability types without the customer having to provision, deploy and manage either hardware or software resources, other than providing customer application code or providing customer data. Serverless computing represents a form of virtualized computing", according to ISO/IEC 22123-2. Serverless computing is a broad ecosystem that includes the cloud provider, function as a service (FaaS), managed services, tools, frameworks, engineers, stakeholders, and other interconnected elements. == Overview == Serverless is a misnomer in the sense that servers are still used by cloud service providers to execute code for developers. The definition of serverless computing has evolved over time, leading to varied interpretations. According to Ben Kehoe, serverless represents a spectrum rather than a rigid definition. Emphasis should shift from strict definitions and specific technologies to adopting a serverless mindset, focusing on leveraging serverless solutions to address business challenges. Serverless computing does not eliminate complexity but shifts much of it from the operations team to the development team. However, this shift is not absolute, as operations teams continue to manage aspects such as identity and access management (IAM), networking, security policies, and cost optimization. Additionally, while breaking down applications into finer-grained components can increase management complexity, the relationship between granularity and management difficulty is not strictly linear. There is often an optimal level of modularization where the benefits outweigh the added management overhead. According to Yan Cui, serverless techniques should be adopted only when they help to deliver customer value faster. And while adopting, organizations should take small steps and de-risk along the way. == Challenges == Serverless applications are prone to fallacies of distributed computing. In addition, they are prone to the following fallacies: Versioning is simple Compensating transactions always work Observability is optional === Monitoring and debugging === Monitoring and debugging serverless applications can present unique challenges due to their distributed, event-driven nature and proprietary environments. Traditional tools may fall short, making it difficult to track execution flows across services. However, modern solutions such as distributed tracing tools (e.g., AWS X-Ray, Datadog), centralized logging, and cloud-agnostic observability platforms are mitigating these challenges. Emerging technologies like OpenTelemetry, AI-powered anomaly detection, and serverless-specific frameworks are further improving visibility and root cause analysis. While challenges persist, advancements in monitoring and debugging tools are steadily addressing these limitations. === Security === According to OWASP, serverless applications are vulnerable to variations of traditional attacks, insecure code, and some serverless-specific attacks (like denial of wallet). So, the risks have changed and attack prevention requires a shift in mindset. === Vendor lock-in === Serverless computing is provided as a third-party service. Applications and software that run in the serverless environment are by default locked to a specific cloud vendor. This issue is exacerbated in serverless computing, as with its increased level of abstraction, public vendors only allow customers to upload code to a FaaS platform without the authority to configure underlying environments. More importantly, when considering a more complex workflow that includes backend-as-a-service (BaaS), a BaaS offering can typically only natively trigger a FaaS offering from the same provider. This makes the workload migration in serverless computing virtually impossible. Therefore, considering how to design and deploy serverless workflows from a multi-cloud perspective could mitigate this. == High-performance computing == Serverless computing may not be ideal for certain high-performance computing (HPC) workloads due to resource limits often imposed by cloud providers, including maximum memory, CPU, and runtime restrictions. For workloads requiring sustained or predictable resource usage, bulk-provisioned servers can sometimes be more cost-effective than the pay-per-use model typical of serverless platforms. However, serverless computing is increasingly capable of supporting specific HPC workloads, particularly those that are highly parallelizable and event-driven, by leveraging its scalability and elasticity. The suitability of serverless computing for HPC continues to evolve with advancements in cloud technologies. == Anti-patterns == The grain of sand anti-pattern refers to the creation of excessively small components (e.g., functions) within a system, often resulting in increased complexity, operational overhead, and performance inefficiencies. Lambda pinball is a related anti-pattern that can occur in serverless architectures when functions (e.g., AWS Lambda, Azure functions) excessively invoke each other in fragmented chains, leading to latency, debugging and testing challenges, and reduced observability. These anti-patterns are associated with the formation of a distributed monolith. These anti-patterns are often addressed through the application of clear domain boundaries, which distinguish between public and published interfaces. Public interfaces are technically accessible interfaces, such as methods, classes, API endpoints, or triggers, but they do not come with formal stability guarantees. In contrast, published interfaces involve an explicit stability contract, including formal versioning, thorough documentation, a defined deprecation policy, and often support for backward compatibility. Published interfaces may also require maintaining multiple versions simultaneously and adhering to formal deprecation processes when breaking changes are introduced. Fragmented chains of function calls are often observed in systems where serverless components (functions) interact with other resources in complex patterns, sometimes described as spaghetti architecture or a distributed monolith. In contrast, systems exhibiting clearer boundaries typically organize serverless components into cohesive groups, where internal public interfaces manage inter-component communication, and published interfaces define communication across group boundaries. This distinction highlights differences in stability guarantees and maintenance commitments, contributing to reduced dependency complexity. Additionally, patterns associated with excessive serverless function chaining are sometimes addressed through architectural strategies that emphasize native service integrations instead of individual functions, a concept referred to as the functionless mindset. However, this approach is noted to involve a steeper learning curve, and integration limitations may vary even within the same cloud vendor ecosystem. Reporting on serverless databases presents challenges, as retrieving data for a reporting service can either break the bounded contexts, reduce the timeliness of the data, or do both. This applies regardless of whether data is pulled directly from databases, retrieved via HTTP, or collected in batches. Mark Richards refers to this as the reach-in reporting anti-pattern. A possible alternative to this approach is for databases to asynchronously push the necessary data to the reporting service instead of the reporting service pulling it. While this method requires a separate contract between services and the reporting service and can be complex to implement, it helps preserve bounded contexts while maintaining a high level of data timeliness. == Principles == Adopting DevSecOps practices can help improve the use and security of serverless technologies. In serverless applications, the distinction between infrastructure and business logic is often blurred, with applications typically distributed across multiple services. To maximize the effectiveness of testing, integration testing is emphasized for serverless applications. Additionally, to facilitate debugging and implementation, orchestration is used within the bounded context, while choreography is employed between different bounded contexts. Ephemeral resources are typically kept together to maintain high cohesion. However, shared resources with long spin-up times, such as AWS RDS clusters and landing zones, are often managed in separate repositories, deployment pipeline, and stacks.

Apache Parquet

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem inspired by Google Dremel interactive ad-hoc query system for analysis of read-only nested data. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides data compression and encoding schemes with enhanced performance to handle complex data in bulk. == History == The open-source project to build Apache Parquet began as a joint effort between Twitter and Cloudera using the record shredding and assembly algorithm as described in Google's Dremel. Parquet was designed as an improvement on the Trevni columnar storage format created by Doug Cutting, the creator of Hadoop. The name 'parquet' (lit. 'small compartment') refers to a style of decorative flooring and was chosen to "evoke the bottom layer of a database with an interesting layout". The first version, Apache Parquet 1.0, was released in July 2013. Since April 27, 2015, Apache Parquet has been a top-level Apache Software Foundation (ASF)-sponsored project. == Features == Apache Parquet is implemented using the record-shredding and assembly algorithm, which accommodates the complex data structures that can be used to store data. The values in each column are stored in contiguous memory locations, providing the following benefits: Column-wise compression is efficient in storage space Encoding and compression techniques specific to the type of data in each column can be used Queries that fetch specific column values need not read the entire row, thus improving performance Apache Parquet is implemented using the Apache Thrift framework, which increases its flexibility; it can work with a number of programming languages like C++, Java, Python, PHP, etc. As of August 2015, Parquet supports the big-data-processing frameworks including Apache Hive, Apache Drill, Apache Impala, Apache Crunch, Apache Pig, Cascading, Presto and Apache Spark. It is one of the external data formats used by the pandas Python data manipulation and analysis library. == Compression and encoding == In Parquet, compression is performed column by column, which enables different encoding schemes to be used for text and integer data. This strategy also keeps the door open for newer and better encoding schemes to be implemented as they are invented. Parquet supports various compression formats: snappy, gzip, LZO, brotli, zstd, and LZ4. === Dictionary encoding === Parquet has an automatic dictionary encoding enabled dynamically for data with a small number of unique values (i.e. below 105) that enables significant compression and boosts processing speed. === Bit packing === Storage of integers is usually done with dedicated 32 or 64 bits per integer. For small integers, packing multiple integers into the same space makes storage more efficient. === Run-length encoding (RLE) === To optimize storage of multiple occurrences of the same value, run-length encoding is used, which is where a single value is stored once along with the number of occurrences. Parquet implements a hybrid of bit packing and RLE, in which the encoding switches based on which produces the best compression results. This strategy works well for certain types of integer data and combines well with dictionary encoding. == Cloud Storage and Data Lakes == Parquet is widely used as the underlying file format in modern cloud-based data lake architectures. Cloud storage systems such as Amazon S3, Azure Data Lake Storage, and Google Cloud Storage commonly store data in Parquet format due to its efficient columnar representation and retrieval capabilities. Data lakehouse frameworks—including Apache Iceberg, Delta Lake, and Apache Hudi —build an additional metadata layer on top of Parquet files to support features such as schema evolution, time-travel queries, and ACID-compliant transactions. In these architectures, Parquet files serve as the immutable storage layer while the table formats manage data versioning and transactional integrity. == Comparison == Apache Parquet is comparable to RCFile and Optimized Row Columnar (ORC) file formats — all three fall under the category of columnar data storage within the Hadoop ecosystem. They all have better compression and encoding with improved read performance at the cost of slower writes. In addition to these features, Apache Parquet supports limited schema evolution, i.e., the schema can be modified according to the changes in the data. It also provides the ability to add new columns and merge schemas that do not conflict. Apache Arrow is designed as an in-memory complement to on-disk columnar formats like Parquet and ORC. The Arrow and Parquet projects include libraries that allow for reading and writing between the two formats. == Implementations == Known implementations of Parquet include: