Neural Language, Speech, and Multimodal Models

Training details

Location

UPC North Campus

Start Date

05/10/2026

End Date

18/12/2026

Target Audiance

Student-Focused

Teaching language(s)

English / Spanish

Organizing institution

Universitat Politècnica de Catalunya

Delivery mode

Hybrid

Level

Advanced

Capacity or seats limit

Industrial domains

Agriculture, Energy

Topics / Keywords

Natural Language Processing, Word Embeddings, RNNs, Attention, Transformers, Machine Translation, Large Language Models, Speech Recognition, Text-to-Speech, Self-Supervised Learning, Multimodal AI.

This 3 ECTS course provides a comprehensive overview of modern neural models for natural language processing, speech, and multimodal AI. Starting from foundational concepts such as word embeddings and recurrent neural networks, the course progresses through attention mechanisms and transformer architectures, culminating in large language models, self-supervised speech representations, and multimodal systems.

Through a combination of theory, practical examples, and model analysis, participants will gain a solid understanding of how state-of-the-art NLP and speech systems are built, trained, and applied. The course emphasizes conceptual clarity, architectural understanding, and informed model selection rather than low-level implementation from scratch.

The course is delivered in a hybrid format and concludes with projects focused on analyzing or applying modern pretrained models to realistic tasks.

What You Will Learn

Learning objectives (bulleted, action verbs)
- Explain the evolution of neural models for language and speech processing
- Describe and compare tokenization and word representation methods
- Analyze RNN, attention-based, and transformer architectures
- Explain the principles behind pretrained language and translation models
- Describe self-supervised learning approaches for speech
- Compare modern ASR and TTS systems
Learning outcomes
- Interpret the architecture and behavior of modern NLP and speech models
- Select appropriate pretrained models for language and speech tasks
- Explain the trade-offs between different neural architectures
- Analyze the strengths and limitations of large language models
- Understand end-to-end pipelines for speech recognition and synthesis
- Reason about multimodal AI systems and their use cases

Agenda

UNIT 1 — Foundations of Neural Language Models (Weeks 1–3)

Week 1 — Tokenization and Word Representations

word2vec: CBOW and Skip-gram
Subword tokenization methods
Byte Pair Encoding (BPE) and modern tokenizers

Week 2 — Recurrent Neural Networks for Language Modeling

Language modeling as a fundamental NLP task
RNNs, LSTMs, and GRUs
Training and evaluation of neural language models

Week 3 — Attention and Neural Machine Translation

Machine Translation
Encoder–decoder architectures
Motivation for attention mechanisms
Attention-based neural machine translation

UNIT 2 — Transformers and Large Language Models (Weeks 4–5)

Week 4 — Transformer Architectures

Self-attention and multi-head attention
Transformer encoder and decoder blocks
Efficiency, scalability, and parallelization

Week 5 — Pretrained Models and Large Language Models

From transformers to large language models
Encoder-based vs. autoregressive models
BERT, GPT, T5, and mBART
Prompting, in-context learning, and limitations of LLMs

UNIT 3 — Speech Models and Self-Supervised Learning (Weeks 6–7)

Week 6 — Self-Supervised Speech Representations

The speech signal
Self-supervised learning paradigms
wav2vec and HuBERT

Week 7 — Neural Speech Recognition and Text-to-Speech

End-to-end ASR systems
Whisper and large-scale weak supervision
Conformer architectures
Neural TTS pipelines
WaveNet and GAN-based vocoders (HiFi-GAN)

UNIT 4 — Multimodal Models and Integration (Weeks 8–9)

Week 8 — Multimodal Learning and Multimodal LLMs

Multimodal data: text, speech, vision
Joint embedding spaces
Multimodal transformers
Architecture patterns for multimodal LLMs
Capabilities, risks, and limitations

Week 9 — Final Project and Presentation

Project development and mentoring
Application of models of language and speech
Final presentation and discussion

Instructor name(s)

Javier Hernando

Instructor's biography

Javier Hernando received the M.S. and Ph.D. degrees in telecommunication engineering from the Technical University of Catalonia (UPC), Barcelona, Spain, in 1988 and 1993, respectively. He is currently a Full Professor and the Director of the UPC Research Center for Language and Speech. He is also Head of Research on Speech and Speaker Technologies in the Barcelona Supercomputing Center (BSC). During the 2002-03 academic year, he stayed as a consultant at the Panasonic Speech Technology Laboratory in Santa Barbara, California. His research work focuses on digital signal processing and machine learning techniques, and their application to automatic speech and speaker recognition. His research results have been published in more than two-hundred journal articles and conference papers.

Course Description

The course is delivered in a hybrid format and concludes with projects focused on analyzing or applying modern pretrained models to realistic tasks.

Neural models for language and speech have undergone a rapid evolution, culminating in large-scale pretrained and multimodal systems that underpin many modern AI applications. This course is designed to provide participants with a structured understanding of this evolution, focusing on the architectural and conceptual breakthroughs that enabled current state-of-the-art models.

Starting from classic word representations and recurrent neural networks, the course introduces attention mechanisms and transformer architectures, which form the backbone of modern NLP and speech systems. Participants will explore large language models, multilingual and pretrained approaches, and recent advances in speech representation learning, automatic speech recognition, and text-to-speech synthesis.

The final part of the course addresses multimodal models that integrate language, speech, and other modalities, highlighting current research directions and practical considerations. By the end of the course, participants will be equipped to understand, analyze, and responsibly apply modern neural language and speech models in research and industry contexts.

Prerequisites

Comfortable programming in Python (functions, basic data structures, and use of standard libraries)
Basic knowledge of linear algebra (vectors, matrices, basic operations)
Introductory understanding of probability and statistics
Basic familiarity with machine learning concepts (training/validation, supervised vs. unsupervised learning)

Certificate/badge details

Certificate of Achievement

Required readings or materials

Jurafsky, D. & Martin, J. – Speech and Language Processing (3rd ed., draft) https://web.stanford.edu/~jurafsky/slp3/
Vaswani et al. (2017) – Attention Is All You Need https://arxiv.org/abs/1706.03762
Devlin et al. (2019) – BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding https://arxiv.org/abs/1810.04805
Baevski et al. (2020) – wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations https://arxiv.org/abs/2006.11477