Neural Language, Speech, and Multimodal Models

Training details

Location

UPC North Campus

Start Date

05/10/2026

End Date

18/12/2026

Target Audiance

Student-Focused

Teaching language(s)

English / Spanish

Organizing institution

Universitat Politècnica de Catalunya

Delivery mode

Hybrid

Level

Advanced

Capacity or seats limit

30

Industrial domains

Agriculture, Energy

Topics / Keywords

Natural Language Processing, Word Embeddings, RNNs, Attention, Transformers, Machine Translation, Large Language Models, Speech Recognition, Text-to-Speech, Self-Supervised Learning, Multimodal AI.

This 3 ECTS course provides a comprehensive overview of modern neural models for natural language processing, speech, and multimodal AI. Starting from foundational concepts such as word embeddings and recurrent neural networks, the course progresses through attention mechanisms and transformer architectures, culminating in large language models, self-supervised speech representations, and multimodal systems.

Through a combination of theory, practical examples, and model analysis, participants will gain a solid understanding of how state-of-the-art NLP and speech systems are built, trained, and applied. The course emphasizes conceptual clarity, architectural understanding, and informed model selection rather than low-level implementation from scratch.

The course is delivered in a hybrid format and concludes with projects focused on analyzing or applying modern pretrained models to realistic tasks.

What You Will Learn

  • Learning objectives (bulleted, action verbs)
    • Explain the evolution of neural models for language and speech processing
    • Describe and compare tokenization and word representation methods
    • Analyze RNN, attention-based, and transformer architectures
    • Explain the principles behind pretrained language and translation models
    • Describe self-supervised learning approaches for speech
    • Compare modern ASR and TTS systems
  • Learning outcomes
    • Interpret the architecture and behavior of modern NLP and speech models
    • Select appropriate pretrained models for language and speech tasks
    • Explain the trade-offs between different neural architectures
    • Analyze the strengths and limitations of large language models
    • Understand end-to-end pipelines for speech recognition and synthesis
    • Reason about multimodal AI systems and their use cases

Agenda

UNIT 1 — Foundations of Neural Language Models (Weeks 1–3)

Week 1 — Tokenization and Word Representations

  • word2vec: CBOW and Skip-gram
  • Subword tokenization methods
  • Byte Pair Encoding (BPE) and modern tokenizers

Week 2 — Recurrent Neural Networks for Language Modeling

  • Language modeling as a fundamental NLP task
  • RNNs, LSTMs, and GRUs
  • Training and evaluation of neural language models

Week 3 — Attention and Neural Machine Translation

  • Machine Translation
  • Encoder–decoder architectures
  • Motivation for attention mechanisms
  • Attention-based neural machine translation

UNIT 2 — Transformers and Large Language Models (Weeks 4–5)

Week 4 — Transformer Architectures

  • Self-attention and multi-head attention
  • Transformer encoder and decoder blocks
  • Efficiency, scalability, and parallelization

Week 5 — Pretrained Models and Large Language Models

  • From transformers to large language models
  • Encoder-based vs. autoregressive models
  • BERT, GPT, T5, and mBART
  • Prompting, in-context learning, and limitations of LLMs

UNIT 3 — Speech Models and Self-Supervised Learning (Weeks 6–7)

Week 6 — Self-Supervised Speech Representations

  • The speech signal
  • Self-supervised learning paradigms
  • wav2vec and HuBERT

Week 7 — Neural Speech Recognition and Text-to-Speech

  • End-to-end ASR systems
  • Whisper and large-scale weak supervision
  • Conformer architectures
  • Neural TTS pipelines
  • WaveNet and GAN-based vocoders (HiFi-GAN)

UNIT 4 — Multimodal Models and Integration (Weeks 8–9)

Week 8 — Multimodal Learning and Multimodal LLMs

  • Multimodal data: text, speech, vision
  • Joint embedding spaces
  • Multimodal transformers
  • Architecture patterns for multimodal LLMs
  • Capabilities, risks, and limitations

Week 9 — Final Project and Presentation

  • Project development and mentoring
  • Application of models of language and speech
  • Final presentation and discussion

Instructor name(s)

Javier Hernando

Instructor's biography

Javier Hernando received the M.S. and Ph.D. degrees in telecommunication engineering from the Technical University of Catalonia (UPC), Barcelona, Spain, in 1988 and 1993, respectively. He is currently a Full Professor and the Director of the UPC Research Center for Language and Speech. He is also Head of Research on Speech and Speaker Technologies in the Barcelona Supercomputing Center (BSC). During the 2002-03 academic year, he stayed as a consultant at the Panasonic Speech Technology Laboratory in Santa Barbara, California. His research work focuses on digital signal processing and machine learning techniques, and their application to automatic speech and speaker recognition. His research results have been published in more than two-hundred journal articles and conference papers.

Course Description

This 3 ECTS course provides a comprehensive overview of modern neural models for natural language processing, speech, and multimodal AI. Starting from foundational concepts such as word embeddings and recurrent neural networks, the course progresses through attention mechanisms and transformer architectures, culminating in large language models, self-supervised speech representations, and multimodal systems.

Through a combination of theory, practical examples, and model analysis, participants will gain a solid understanding of how state-of-the-art NLP and speech systems are built, trained, and applied. The course emphasizes conceptual clarity, architectural understanding, and informed model selection rather than low-level implementation from scratch.

The course is delivered in a hybrid format and concludes with projects focused on analyzing or applying modern pretrained models to realistic tasks.

Neural models for language and speech have undergone a rapid evolution, culminating in large-scale pretrained and multimodal systems that underpin many modern AI applications. This course is designed to provide participants with a structured understanding of this evolution, focusing on the architectural and conceptual breakthroughs that enabled current state-of-the-art models.

Starting from classic word representations and recurrent neural networks, the course introduces attention mechanisms and transformer architectures, which form the backbone of modern NLP and speech systems. Participants will explore large language models, multilingual and pretrained approaches, and recent advances in speech representation learning, automatic speech recognition, and text-to-speech synthesis.

The final part of the course addresses multimodal models that integrate language, speech, and other modalities, highlighting current research directions and practical considerations. By the end of the course, participants will be equipped to understand, analyze, and responsibly apply modern neural language and speech models in research and industry contexts.

Prerequisites

  • Comfortable programming in Python (functions, basic data structures, and use of standard libraries)
  • Basic knowledge of linear algebra (vectors, matrices, basic operations)
  • Introductory understanding of probability and statistics
  • Basic familiarity with machine learning concepts (training/validation, supervised vs. unsupervised learning)

Certificate/badge details

Certificate of Achievement

Required readings or materials