The intent of this group is to support and engage users of the REPACSS resource by sharing updates about system status, events, and opportunities. The group will also be used to respond to user questions and provide general support. Communication will primarily occur via the REPACSS email platform (repacss.support [at] ttu.edu (repacss[dot]support[at]ttu[dot]edu)), with messages sent as needed for announcements, notifications, and user inquiries.
 

Goals

The focus of the REPACSS project is on improvements to data center and infrastructure control to provide adaptability to emergent conditions and ability to adjust workloads to match data center load conditions including the availability and cost of electrical power. REPACSS also features advanced remote management capabilities and automation tools to manage scientific workflows that are specifically targeted to be adopted at scale by other resource facilities and industry.

The REmotely-managed Power Aware Computing Systems and Services (REPACSS) resource is a high-performance computing (HPC) cluster supported by multiple forms of energy developed to support research into advanced data center control for running scalable scientific workflows and data-intensive research in remotely managed settings. The focus of the project is on improvements to data center and infrastructure control to provide adaptability to emergent conditions and ability to adjust workloads to match data center load conditions including the availability and cost of electrical power. The CPU infrastructure comprises 110 AMD EPYC 9754 compute nodes with access to high-speed cluster-wide storage. Each CPU compute node offers 256 cores and 1.5TB of DDR5 memory, supported by local NVMe swap and temporary storage (1.92TB) to support high-speed checkpoint and restore and local ephemeral usage. The cpu nodes are interconnected with the rest of the cluster and with storage by NVIDIA ConnectX-7 network NDR Infiniband adapters running at 200 Gbps per card with two Infiniband cards per node. The Hammerspace storage provides nearly 3PB of combined NVMe and HDD storage, supporting large-scale data throughput. All nodes are controlled and provisioned through high-bandwidth Dell PowerSwitch S5248-ON and S5232-ON Ethernet switches at 25 Gbps per node. The cluster supports intelligent workload placement and adaptive scheduling tools to align computational activity with the goal to match as much of the workload as possible to low-cost energy availability. REPACSS also features advanced remote management capabilities and automation tools to manage scientific workflows that are specifically targeted to be adopted at scale by other resource facilities and industry.

The REPACSS system is designed to support researchers in developing more efficient science and engineering workflows that can make use of lower costs that can be achieved through matching workloads to energy availability. The local platform supports a range of workloads, including small- to mid-scale jobs (from one to several thousand cores), making it ideal for applications in computational fluid dynamics, weather and climate modeling, long-tail science, data analytics, and AI/ML model training using H100 GPUs. It is designed to be representative of a large class of similar resources that provide a hybrid compute architecture, modern accelerators, and multi-tiered high-throughput storage systems that are suited for deep learning, data-driven discovery, and complex simulations, with the extra added feature that the facility is designed to be operated almost entirely remotely. REPACSS is particularly recommended for researchers interested in exploring adaptive energy-aware computing, workload-energy scheduling trade-offs, and renewable-powered compute scenarios. It is also a strong fit for academic-industry collaboration and education-focused HPC, including undergraduate and broader engagement programs that focus on advanced data center control and associated methods. This resource is optimal for users looking to find ways to apply REPACSS methods to their own resources to reduce the cost and improve the uptime of their computational science while supporting cutting-edge computing capabilities.

Associated Resources

The REmotely-managed Power Aware Computing Systems and Services (REPACSS) resource is a high-performance computing (HPC) cluster supported by multiple forms of energy developed to support research into advanced data center control for running scalable scientific workflows and data-intensive research in remotely managed settings. The focus of the project is on improvements to data center and infrastructure control to provide adaptability to emergent conditions and ability to adjust workloads to match data center load conditions including the availability and cost of electrical power. The GPU nodes feature dual-socket Intel Xeon Gold 6448Y processors, 512GB RAM, and 4 H100 GPUs connected as two H100-NVL pairs per node. The GPU nodes are interconnected with the rest of the cluster and with storage by NVIDIA ConnectX-7 network NDR Infiniband adapters running at 200 Gbps per card with two Infiniband cards per node. The Hammerspace storage provides nearly 3PB of combined NVMe and HDD storage, supporting large-scale data throughput. All nodes are also controlled and provisioned through high-bandwidth Dell PowerSwitch S5248-ON and S5232-ON Ethernet switches at 25 Gbps per node. The cluster supports intelligent workload placement and adaptive scheduling tools to align computational activity with the goal to match as much of the workload as possible to low-cost energy availability. REPACSS also features advanced remote management capabilities and automation tools to manage scientific workflows that are specifically targeted to be adopted at scale by other resource facilities and industry.

The REmotely-managed Power Aware Computing Systems and Services (REPACSS) resource is a high-performance computing (HPC) cluster supported by multiple forms of energy developed to support research into advanced data center control for running scalable scientific workflows and data-intensive research in remotely managed settings. The focus of the project is on improvements to data center and infrastructure control to provide adaptability to emergent conditions and ability to adjust workloads to match data center load conditions including the availability and cost of electrical power. The CPU infrastructure comprises 110 AMD EPYC 9754 compute nodes with access to high-speed cluster-wide storage. Each CPU compute node offers 256 cores and 1.5TB of DDR5 memory, supported by local NVMe swap and temporary storage (1.92TB) to support high-speed checkpoint and restore and local ephemeral usage. The cpu nodes are interconnected with the rest of the cluster and with storage by NVIDIA ConnectX-7 network NDR Infiniband adapters running at 200 Gbps per card with two Infiniband cards per node. The Hammerspace storage provides nearly 3PB of combined NVMe and HDD storage, supporting large-scale data throughput. All nodes are controlled and provisioned through high-bandwidth Dell PowerSwitch S5248-ON and S5232-ON Ethernet switches at 25 Gbps per node. The cluster supports intelligent workload placement and adaptive scheduling tools to align computational activity with the goal to match as much of the workload as possible to low-cost energy availability. REPACSS also features advanced remote management capabilities and automation tools to manage scientific workflows that are specifically targeted to be adopted at scale by other resource facilities and industry.

Members get updates about announcements, events, and outages.

REPACSS ACCESS RP Graphic

Upcoming Events

No upcoming events.
See past events

Announcements

Coordinators