Deep Learning performance analysis: libraries and frameworks for optimization and parallelization

Training details

Location

Barcelona, Spain (In person)

Date

20/04/2026

Target Audiance

Scientist

Teaching language(s)

English

Organizing institution

BSC

Delivery mode

On-site

Level

Intermediate

Format

Hands-on session, Lecture

Topics / Keywords

Deep Learning, Performance Analysis, Parallelism, HPC, Performance tools, Tracing, Profiling, Optimization

The number of AI for science use cases is rising and more and more scientists and engineers are training their own deep learning models. This increases the demand for HPC resources and cluster computers, as training better models require more, and distributed resources, either to reduce the training time, or to fit larger models. But are our training workloads performant? Are we efficiently using our resources? A one-solution-for-all is not a good strategy when the datasets, the model architectures, and the AI libraries are so different and tailored for each case, and new solutions appear every month.

The purpose of this training is to gain the skills and knowledge to understand how the training of a deep neural network uses the resources in our computer. The training will guide the students into getting insight of the execution using HPC performance tools, and by making informed decisions, apply optimizations and use more advanced libraries.

What You Will Learn

Pinpoint root efficiency problems of DL training workloads
Run distributed training using state-of-the-art frameworks and libraries
Differentiate between parallel strategies
Quantify the effect of optimizations and parallel strategies
Conduct end-to-end performance analysis of AI training workloads

Agenda

Performance methodology & baseline setup: DL workload structure, profiling workflow, tools usage and visualizations (4h)
Compute and data efficiency: Mixed precision, batch size, data loading pipeline, gradient accumulation (2h)
Distributed data parallel training: Distributed data parallel strategies, including PyTorch’s DDP and more advanced libraries. Analyzing the impact of communication. (4h)
Advanced distributed optimization: Other parallelization strategies with Megatron-LM, (maybe also analyze DeepSeek training traces) (3h)
Holistic performance analysis and case study (2h)

Instructor name(s)

Marc Clascà

Other TBC

Instructor's biography

Marc Clascà is a research engineer at the Barcelona Supercomputing Center since 2020. His research interests include programming models, performance tools, performance analysis and specialized hardware and accelerators. He works in HPC parallel performance analysis at the Best Practices for Performance and Programmability (BePPP) group, aiming to provide the scientific community with the best practices in programming portable and performant codes. His current research focus is on analyzing the performance of scientific applications and AI workloads that use GPUs, with the aim of deriving new efficiency metrics and analysis methodologies. This includes exploring the potentials of GPU specific tracing and visualization tools, and understanding new programming models and communication patterns used in LLM training and inference.

Course Description

The path will go from a basic DL training to a multi-strategy parallel distributed training. For each step, there will be a lecture on the problem that we are trying to solve, and a guided analysis of execution traces to obtain qualitative and quantitative information of the problem. With this incremental training schema, each observation will lead to the presentation of the next optimization, parallel strategy or DL library.

During the first day, we will present the performance tools, the analysis methodology, and a very basic training workload so the students can get comfortable with the extraction and inspection of traces. After this first case, we will jump to our baseline code that we will incrementally update. This code will be Axolotl, a fine-tuning framework already prepared with optimizations and integrations with relevant libraries.
During the second day, we will focus on the compute and memory limits of a sequential training, and will thoroughly present the different data parallel strategies and libraries and their performance impacts.
Finally, during the third day we will cover other parallelization strategies and apply all the skills obtained to do a case study following the methodology presented.

This is an advanced, HPC-oriented performance analysis and optimization course for researchers, users and engineers of deep learning applications that want to better understand the needs and bottlenecks of their training workloads, optimize the usage of resources and decrease training time, and ultimately scale-up their executions to more resources. The main enabler of this training will be the analysis of execution traces using Paraver, a parallel trace visualization tool developed at BSC. The course will focus mainly on the parallel strategies and behavior of training workloads.

Prerequisites

Basic knowledge of the deep neural network training process
Basic knowledge of the computation behind neural networks
Basic HPC usage skills: connecting to machines, bash command line, launching jobs to SLURM
It is recommended that participants previously attend the trainings on Performance Tools and Introduction to AI of the AI Factories Scientific Training catalogue.

Certificate/badge details

Certificate of Achievement