HDCRS IEEE GRSS Working Group: Research Topics
Supercomputing and Distributed Computing
High Performance Computing
High performance computing (HPC) is a multidisciplinary field combining hardware technologies and architecture, operating systems, programming tools, software, and end-user problems and algorithms. It engages a class of electronic digital machines referred to as ”supercomputers” to perform a wide array of computational problems or ”applications” (alternatively ”workloads”) as fast as is possible. The action of performing an application on a supercomputer is widely termed ”supercomputing” and is synonymous with HPC.
HDCRS Topics & Activities
The purpose of HPC is to derive answers to questions that cannot be adequately addressed alone through means of empiricism, theory, or even widely available or accessible commercial computers (e.g., laptops, desktop computers, enterprise servers). Historically supercomputers have been applied to science and engineering, and the methodology has been described as the “third pillar of science” alongside and complementing both experimentation (empiricism) and mathematics (theory). But the range of problems that supercomputers can tackle extends far beyond classical scientific and engineering studies to include challenges in socioeconomics, big-data management and learning, process control, and national security.
Performance and Metrics
There is no single measure of performance that fully reflects all aspects of the quality of computer operation. A ”metric” is a quantifiable observable operational parameter of a supercomputer. Multiple perspectives and related metrics are routinely applied to characterize the behavioral properties and capabilities of an HPC system. Two basic measures are employed individually or in combination and in differing contexts to formulate the values used to represent the quality of a supercomputer. These two fundamental measures are ”time” and ”number of operations” performed, both under prescribed conditions.
For HPC the most widely used metric is ”floating-point operations per second” or ”flops”. A floating-point operation is an addition or multiplication of two real (or floating-point) numbers represented in some machine-readable and manipulatable form. Because supercomputers are so ”powerful”, to describe their capability would require phrases like ”a trillion or quadrillion operations per second”. The field adopts the same system of notation as science and engineering, using the Greek prefixes kilo, mega, giga, tera, and peta to represent 1000, 1 million, 1 billion, 1 trillion, and 1 quadrillion, respectively. The first supercomputers barely achieved 1 kiloflops (Kflops). Today’s fastest supercomputer (Fig. 1: Fugaku) exhibits a peak performance in the order of 450 petaflops.
The true capability of a supercomputer is its ability to perform real work, to achieve useful results toward an end goal such as simulating a particular physical phenomenon (e.g., colliding neutron stars to determine resulting electromagnetic burst signatures). A better measure than flops is how long a given problem takes to complete. But because there are literally thousands (millions?) of such problems, this measure is not particularly useful broadly. Thus the HPC community selects specific problems around which to standardize. Such standardized application programs are ”benchmarks”.
One particularly widely used supercomputer benchmark is ”Linpack” or more precisely the ”highly parallel Linpack” (HPL), which solves a set of linear equations in dense matrix form. A benchmark gives a means of comparative evaluation between two independent systems by measuring their respective times to perform the same calculation. Thus a second way to measure performance is time to completion of a fixed problem. The HPC community has selected HPL as a means of ranking supercomputers, as represented by the ”Top 500 list” begun in 1993 (Fig. 2). But other benchmarks are also employed to stress certain aspects of a supercomputer or represent a certain class of programs.
Cloud computing offers flexible dynamic IT infrastructures, quality of service (QoS) guaranteed computing environments and configurable software services. According to the definition of the National Institute of Standards and Technology (NIST), cloud computing is regarded as ”a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction”. Cloud computing has five essential characteristics: on-demand self-service, broad network access, resource pooling, rapid elasticity and measured service. Based on the definition of NIST, cloud computing has three service models and four deployment models.
Cloud Service Model
Cloud computing employs a service-driven model and the services of clouds can be grouped into three categories:
(1) Infrastructure as a Service (IaaS): adopts virtualization technologies to provide consumers with on-demand provisioning of infrastructural resources (e.g., networks, storages, virtual servers, etc.) on a pay-as-you-go basis. IaaS helps consumers avoid the expense and complexity of buying and managing physical servers and other datacenter infrastructure. Consumers are able to quickly scale up and down infrastructural resources on demand and only pay for what they use.
(2) Platform as a Service (PaaS): cloud providers manage and deliver a broad collection of middleware services (including development tools, libraries and database management systems, etc.). Consumers adopt PaaS to create and deploy applications without considering the expense and complexity of buying and managing software licenses, the underlying application infrastructure and middleware or the development tools and other resources.
(3) Software as a Service (SaaS): a model for the distribution of software where customers access software over the Internet on a pay-as-you-go basis. Normally, consumers access software using a thin client via a web browser.
Cloud Deployment Models
Cloud deployment refers to a cloud that is designed to provide specific services based on demands of users. A deployment model may embrace diversified parameters such as storage size, accessibility and proprietorship, etc. There are four common cloud deployment models that differ significantly:
(1) Public Cloud: cloud that service providers offer their resources as services to the general public or a large industry group. In order to ensure the quality of cloud services, Service Level Agreements (SLAs) are adopted to specify a number of requirements between a cloud services provider and a cloud services consumer. However, public clouds lack ne-grained control over data, network and security settings.
(2) Private Cloud: private clouds are designed for exclusive use by a particular institution, organization or enterprise. Comparing with public clouds, private clouds offer the highest degree of control over performance, reliability and substantial security for services (applications, storage, and other resources) provided by service providers.
(3) Community Cloud: Community clouds are built and operated specifically for a particular group that have similar cloud requirements (security, compliance, jurisdiction, etc.).
(4) Hybrid Cloud: Hybrid clouds are a combination of two or more clouds (public, private or community) to offer the benefits of multiple deployment models. Hybrid clouds offer more flexibility than both public and private clouds.
Security in the Cloud
A data security lifecycle includes five stages: create, store, use, archive, and destruct. In the create stage, data is created by client or server in the cloud. The Store stage means generated data or uploaded data are stored in the cloud across a number of machines. During the Use stage, data is searched and extracted from the cloud environment. Rarely used data is archived in an other place in the cloud. In the Destroy stage, users have the ability to delete data with certain permissions.
Based on the data security lifecycle, three aspects need to be taken into consideration when talking about the security in a cloud: confidentiality, integrity, and availability. Confidentiality means the valuable data in the cloud can only be accessed by authorized parties or systems. With the incremental number of parties or systems in the cloud, there is an increase in the number of points of access, resulting in huge threats for data stored or archived in the cloud. In the cloud, resources and services are provided in a pay-as-you-go fashion; integrity protects resources and services paid by consumers from unauthorized deletion, modification or fabrication. Availability is the metric to describe the ability of a cloud to provide resources and services for consumers.
Access to Computing Resources
Partnership for Advanced Computing in Europe (PRACE)
PRACE provides access to distributed persistent pan-European world-class HPC computing and data management systems and services. The main “Tier-0” systems offered by PRACE Hosting Members can be found on the HPC Systems page. Access to these systems is available through Calls for Proposals of the four different access types offered by PRACE as listed below.
Extreme Science and Engineering Discovery Environment (XSEDE)
XSEDE is an advanced, powerful, and robust collection of integrated advanced digital resources and services, giving researchers access to large supercomputers and related services such as storage and science gateways.
European Grid Infrastructure (EGI)
EGI is a series of efforts to provide access to high-throughput computing resources across Europe using grid computing techniques. The EGI links centres in different European countries to support international research in many scientific disciplines
Online Courses and Tutorials
Morris Riedel, Cloud Computing and Big Data, University of Iceland.
Petros Koumoutsakos and Sergio Martin, High Performance Computing for Science and Engineering, ETH Zürich.
Podcasts and Videos
High Performance Computing
A. Plaza and C. I. Chang, ”High Performance Computing in Remote Sensing”, Routledge, 2007.
G. Hager, and G. Wellein, ”Introduction to High Performance Computing for Scientists and Engineers”, Chapman and Hall/CRC, 2010.
T. Sterling, M. Anderson and M. Brodowicz, ”High Performance Computing: Modern Systems and Practices”, Morgan Kaufmann, 2017.
Tal Ben-Nun and Torsten Hoefler, “Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis“, Association for Computing Machinery, 2019.
K. Hwang, J. Dongarra and G. Fox, ”Distributed and Cloud Computing”, Morgan Kaufmann, 2011.
O. Terzo and L. Mossucca, ”Cloud Computing with e-Science Applications”, CRC Press, 2015.
I. Foster and D. B. Gannon, ”Cloud Computing for Science and Engineering”, The MIT Press, 2017.
K. Hwang, ”Cloud Computing for Machine Learning and Cognitive Applications” The MIT Press, 2017.
L. Wang, J. Yan, Y. Ma, ”Cloud Computing in Remote Sensing”, Chapman and Hall/CRC, 2019.