Vielen Dank für die Zusendung Ihrer Anfrage! Eines unserer Teammitglieder wird Sie in Kürze kontaktieren.
Vielen Dank, dass Sie Ihre Buchung abgeschickt haben! Eines unserer Teammitglieder wird Sie in Kürze kontaktieren.
Schulungsübersicht
EXO Infrastructure as Code
- Overview of EXO deployment patterns: single-node, multi-node, and RDMA clusters
- Automating dependency installation (Xcode, uv, Node.js, Rust) with configuration management
- Using Nix flakes for reproducible EXO builds and developer environments
- Writing Ansible playbooks or shell scripts for unattended cluster provisioning
Reproducible Builds and CI Integration
- Pinning dependencies and building the dashboard in CI pipelines
- Running EXO smoke tests in GitHub Actions or GitLab CI runners
- Creating golden images and snapshot-based rollback workflows for macOS and Linux VMs
- Versioning custom model cards alongside application code
Cluster Discovery and Networking Automation
- Configuring mDNS and static DNS for reliable libp2p node discovery
- Automating network profile creation and Thunderbolt bridge management on macOS
- Using custom namespaces (EXO_LIBP2P_NAMESPACE) to separate dev, staging, and prod clusters
- Firewall rules and network segmentation for multi-tenant environments
Storage and Model Lifecycle Management
- Designing EXO_MODELS_DIRS and EXO_MODELS_READ_ONLY_DIRS strategies
- Mounting NFS or SAN shares as read-only model repositories for fast provisioning
- Garbage collection of stale caches and versioned weight retention policies
- Automating model pre-downloads and health checks before rolling updates
Monitoring and Alerting
- Shipping EXO logs to centralized logging (ELK, Loki, or Splunk)
- Building Grafana dashboards from EXO_TRACING_ENABLED output
- Alerting on cluster membership changes, OOM events, and inference latency spikes
- Correlating macmon hardware telemetry with model performance regressions
Update, Rollback, and Disaster Recovery
- Staging EXO binary updates in a canary node before fleet-wide rollout
- Model-level rollback: switching between quantized versions without re-downloading
- Backing up and restoring cluster state, custom namespaces, and cached weights
- Documenting recovery runbooks for total cluster rebuild scenarios
Security Hardening and Compliance
- Applying TLS at the reverse proxy layer (nginx, traefik) for the dashboard and API
- Implementing API rate limiting and IP whitelisting for EXO endpoints
- Isolating clusters with VLANs and zero-trust network policies
- Auditing access and maintaining an inventory of deployed models and versions
Voraussetzungen
- Experience with DevOps practices (CI/CD, IaC, container orchestration)
- Familiarity with macOS or Linux system administration and package management
- Understanding of networking, DNS, and storage concepts
Audience
- DevOps engineers
- Infrastructure architects
- SREs responsible for on-premise AI workloads
21 Stunden
Erfahrungsberichte (2)
Craig war extrem engagiert im Training und hat stets darauf geachtet, dass wir aufmerksam sind. Er passte die Beispiele an unsere täglichen Aktivitäten an und gab immer eine Antwort, wenn danach gefragt wurde, auch wenn die Information nicht im Präsentationsmaterial enthalten war.
Ecaterina Ioana Nicoale - BOOKING HOLDINGS ROMANIA SRL
Kurs - DevOps Foundation®
Maschinelle Übersetzung
Hoher Einsatz und Fachwissen des Trainers
Jacek - Softsystem
Kurs - DevOps Engineering Foundation (DOEF)®
Maschinelle Übersetzung