Artwork

Το περιεχόμενο παρέχεται από το Demetrios Brinkmann. Όλο το περιεχόμενο podcast, συμπεριλαμβανομένων των επεισοδίων, των γραφικών και των περιγραφών podcast, μεταφορτώνεται και παρέχεται απευθείας από τον Demetrios Brinkmann ή τον συνεργάτη της πλατφόρμας podcast. Εάν πιστεύετε ότι κάποιος χρησιμοποιεί το έργο σας που προστατεύεται από πνευματικά δικαιώματα χωρίς την άδειά σας, μπορείτε να ακολουθήσετε τη διαδικασία που περιγράφεται εδώ https://el.player.fm/legal.
Player FM - Εφαρμογή podcast
Πηγαίνετε εκτός σύνδεσης με την εφαρμογή Player FM !

Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing // Bernie Wu // #270

55:18
 
Μοίρασέ το
 

Manage episode 446382145 series 3241972
Το περιεχόμενο παρέχεται από το Demetrios Brinkmann. Όλο το περιεχόμενο podcast, συμπεριλαμβανομένων των επεισοδίων, των γραφικών και των περιγραφών podcast, μεταφορτώνεται και παρέχεται απευθείας από τον Demetrios Brinkmann ή τον συνεργάτη της πλατφόρμας podcast. Εάν πιστεύετε ότι κάποιος χρησιμοποιεί το έργο σας που προστατεύεται από πνευματικά δικαιώματα χωρίς την άδειά σας, μπορείτε να ακολουθήσετε τη διαδικασία που περιγράφεται εδώ https://el.player.fm/legal.

Bernie Wu is VP of Business Development for MemVerge. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies including companies such as Conner/Seagate, Cheyenne Software, Trend Micro, FalconStor, Levyx, and MetalSoft. Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing // MLOps Podcast #270 with Bernie Wu, VP Strategic Partnerships/Business Development of MemVerge. // Abstract Limited memory capacity hinders the performance and potential of research and production environments utilizing Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) techniques. This discussion explores how leveraging industry-standard CXL memory can be configured as a secondary, composable memory tier to alleviate this constraint. We will highlight some recent work we’ve done in integrating of this novel class of memory into LLM/RAG/vector database frameworks and workflows. Disaggregated shared memory is envisioned to offer high performance, low latency caches for model/pipeline checkpoints of LLM models, KV caches during distributed inferencing, LORA adaptors, and in-process data for heterogeneous CPU/GPU workflows. We expect to showcase these types of use cases in the coming months. // Bio Bernie is VP of Strategic Partnerships/Business Development for MemVerge. His focus has been building partnerships in the AI/ML, Kubernetes, and CXL memory ecosystems. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies including companies such as Conner/Seagate, Cheyenne Software, Trend Micro, FalconStor, Levyx, and MetalSoft. He is also on the Board of Directors for Cirrus Data Solutions. Bernie has a BS/MS in Engineering from UC Berkeley and an MBA from UCLA. // MLOps Swag/Merch https://mlops-community.myshopify.com/ // Related Links Website: www.memverge.com Accelerating Data Retrieval in Retrieval Augmentation Generation (RAG) Pipelines using CXL: https://memverge.com/accelerating-data-retrieval-in-rag-pipelines-using-cxl/ Do Re MI for Training Metrics: Start at the Beginning // Todd Underwood // AIQCON: https://youtu.be/DxyOlRdCofo Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // MLOps Podcast #228: https://youtu.be/6MY-IgqiTpg

Compute Express Link (CXL) FPGA IP: https://www.intel.com/content/www/us/en/products/details/fpga/intellectual-property/interface-protocols/cxl-ip.htmlUltra Ethernet Consortium: https://ultraethernet.org/

Unified Acceleration (UXL) Foundation: https://www.intel.com/content/www/us/en/developer/articles/news/unified-acceleration-uxl-foundation.html

RoCE networks for distributed AI training at scale: https://engineering.fb.com/2024/08/05/data-center-engineering/roce-network-distributed-ai-training-at-scale/ --------------- ✌️Connect With Us ✌️ ------------- Join our slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Catch all episodes, blogs, newsletters, and more: https://mlops.community/ Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Connect with Bernie on LinkedIn: https://www.linkedin.com/in/berniewu/

Timestamps: [00:00] Bernie's preferred coffee [00:11] Takeaways [01:37] First principles thinking focus [05:02] Memory Abundance Concept Discussion [06:45] Managing load spikes [09:38] GPU checkpointing challenges [16:29] Distributed memory problem solving [18:27] Composable and Virtual Memory [21:49] Interactive chat annotation [23:46] Memory elasticity in AI [27:33] GPU networking tests [29:12] GPU Scheduling workflow optimization [32:18] Kubernetes Extensions and Tools [37:14] GPU bottleneck analysis [42:04] Economical memory strategies [45:14] Elastic memory management strategies [47:57] Problem solving approach [50:15] AI infrastructure elasticity evolution [52:33] RDMA and RoCE explained [54:14] Wrap up

  continue reading

396 επεισόδια

Artwork
iconΜοίρασέ το
 
Manage episode 446382145 series 3241972
Το περιεχόμενο παρέχεται από το Demetrios Brinkmann. Όλο το περιεχόμενο podcast, συμπεριλαμβανομένων των επεισοδίων, των γραφικών και των περιγραφών podcast, μεταφορτώνεται και παρέχεται απευθείας από τον Demetrios Brinkmann ή τον συνεργάτη της πλατφόρμας podcast. Εάν πιστεύετε ότι κάποιος χρησιμοποιεί το έργο σας που προστατεύεται από πνευματικά δικαιώματα χωρίς την άδειά σας, μπορείτε να ακολουθήσετε τη διαδικασία που περιγράφεται εδώ https://el.player.fm/legal.

Bernie Wu is VP of Business Development for MemVerge. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies including companies such as Conner/Seagate, Cheyenne Software, Trend Micro, FalconStor, Levyx, and MetalSoft. Boosting LLM/RAG Workflows & Scheduling w/ Composable Memory and Checkpointing // MLOps Podcast #270 with Bernie Wu, VP Strategic Partnerships/Business Development of MemVerge. // Abstract Limited memory capacity hinders the performance and potential of research and production environments utilizing Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) techniques. This discussion explores how leveraging industry-standard CXL memory can be configured as a secondary, composable memory tier to alleviate this constraint. We will highlight some recent work we’ve done in integrating of this novel class of memory into LLM/RAG/vector database frameworks and workflows. Disaggregated shared memory is envisioned to offer high performance, low latency caches for model/pipeline checkpoints of LLM models, KV caches during distributed inferencing, LORA adaptors, and in-process data for heterogeneous CPU/GPU workflows. We expect to showcase these types of use cases in the coming months. // Bio Bernie is VP of Strategic Partnerships/Business Development for MemVerge. His focus has been building partnerships in the AI/ML, Kubernetes, and CXL memory ecosystems. He has 25+ years of experience as a senior executive for data center hardware and software infrastructure companies including companies such as Conner/Seagate, Cheyenne Software, Trend Micro, FalconStor, Levyx, and MetalSoft. He is also on the Board of Directors for Cirrus Data Solutions. Bernie has a BS/MS in Engineering from UC Berkeley and an MBA from UCLA. // MLOps Swag/Merch https://mlops-community.myshopify.com/ // Related Links Website: www.memverge.com Accelerating Data Retrieval in Retrieval Augmentation Generation (RAG) Pipelines using CXL: https://memverge.com/accelerating-data-retrieval-in-rag-pipelines-using-cxl/ Do Re MI for Training Metrics: Start at the Beginning // Todd Underwood // AIQCON: https://youtu.be/DxyOlRdCofo Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // MLOps Podcast #228: https://youtu.be/6MY-IgqiTpg

Compute Express Link (CXL) FPGA IP: https://www.intel.com/content/www/us/en/products/details/fpga/intellectual-property/interface-protocols/cxl-ip.htmlUltra Ethernet Consortium: https://ultraethernet.org/

Unified Acceleration (UXL) Foundation: https://www.intel.com/content/www/us/en/developer/articles/news/unified-acceleration-uxl-foundation.html

RoCE networks for distributed AI training at scale: https://engineering.fb.com/2024/08/05/data-center-engineering/roce-network-distributed-ai-training-at-scale/ --------------- ✌️Connect With Us ✌️ ------------- Join our slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Catch all episodes, blogs, newsletters, and more: https://mlops.community/ Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Connect with Bernie on LinkedIn: https://www.linkedin.com/in/berniewu/

Timestamps: [00:00] Bernie's preferred coffee [00:11] Takeaways [01:37] First principles thinking focus [05:02] Memory Abundance Concept Discussion [06:45] Managing load spikes [09:38] GPU checkpointing challenges [16:29] Distributed memory problem solving [18:27] Composable and Virtual Memory [21:49] Interactive chat annotation [23:46] Memory elasticity in AI [27:33] GPU networking tests [29:12] GPU Scheduling workflow optimization [32:18] Kubernetes Extensions and Tools [37:14] GPU bottleneck analysis [42:04] Economical memory strategies [45:14] Elastic memory management strategies [47:57] Problem solving approach [50:15] AI infrastructure elasticity evolution [52:33] RDMA and RoCE explained [54:14] Wrap up

  continue reading

396 επεισόδια

Tüm bölümler

×
 
Loading …

Καλώς ήλθατε στο Player FM!

Το FM Player σαρώνει τον ιστό για podcasts υψηλής ποιότητας για να απολαύσετε αυτή τη στιγμή. Είναι η καλύτερη εφαρμογή podcast και λειτουργεί σε Android, iPhone και στον ιστό. Εγγραφή για συγχρονισμό συνδρομών σε όλες τις συσκευές.

 

Οδηγός γρήγορης αναφοράς

Ακούστε αυτήν την εκπομπή ενώ εξερευνάτε
Αναπαραγωγή