CS 5470

CS 5470

Course information provided by the 2025-2026 Catalog.

Systems for Large-scale ML is a new advanced course at Cornell University designed to equip students with hands-on expertise in designing and operating scalable machine learning systems. With the rise in popularity of large ML models like GPT, LLaMA, and DeepSeek, tackling systems-level challenges of distributing training and inference workloads on multi-accelerator hardware while ensuring fault tolerance has become a crucial skill for both graduate and undergraduate students in computer science. The course will teach students to tackle systems challenges in both training and inferring from large-scale ML models. We will combine theory and hands-on teaching through lectures, assignments, and projects.


Last 3 terms offered (None)

Learning Outcomes REF-FA25

  • Distribute the training and inference of large ML models across multiple GPUs.
  • Build efficient strategies for sharding ML models.
  • Debug communication overheads of distributed ML.
  • Develop fault tolerant and elastic ML pipelines.

View Enrollment Information

Syllabi: none
  •   Regular Academic Session. 

  • 3 Credits GradeNoAud

  • 20549 CS 5470   LEC 001

    • MW
    • Aug 25 - Dec 8, 2025
    • Singh, R

  • Instruction Mode: In Person

    For Bowers Computer and Information Science (CIS) Course Enrollment Help, please see: https://tdx.cornell.edu/TDClient/193/Portal/Home/