The ML Engineer
Move to buttondown, ML careers and engineering
This week's newsletter might look a bit different, as blaze's email sending service has been moved to https://buttondown.com/. This comes with lots of benefits to subscribers including
- searchable newsletter archive at https://ml.blaze.email
 - full-text newsletter RSS feeds at https://ml.blaze.email/rss
 - nicer looking email!
 
Everything else about your newsletter stays the same, and you can can change your subscription any time you like 😊
🔧 Company Engineering Blogs
Beyond Founder Mode: Mission Mode (blog.palantir.com). Mission Mode organizes entire company around customer missions, embedding engineers with teams and prioritizing outcomes over founder involvement
Disaggregated Scheduled Fabric: Scaling Meta’s AI Journey (engineering.fb.com). DSF disaggregates line/fabric cards to scale AI training networks with VOQ-based packet spraying and FBOSS; regional interconnects and input-balanced mode discussed
Agentforce’s Agent Graph: Toward Guided Determinism with Hybrid Reasoning (engineering.salesforce.com). How Agent Graph tackles drop-off and LLM unpredictability with hybrid reasoning, FSMs, and Agent Script for enterprise AI agents
Google Cloud C4 Brings a 70% TCO improvement on GPT OSS with Intel and Hugging Face (huggingface.co). Google Cloud C4 with Intel Xeon Granite Rapids improves GPT OSS MoE inference throughput and 1.7x TCO savings over C3 via Intel-Hugging Face optimizations
Identify User Journeys at Pinterest (medium.com/pinterest-engineering). Explore dynamic extraction, hierarchical clustering, journey naming, ranking with diversification, stage prediction, and LLM-based evaluation for Pinterest's user journeys
🌱 Careers & Culture
Vantage on management: Engineering is science is engineering (tiendil.org). Engineering and science share model-building and experimentation, converging through experimentation, reassembly of tech stacks, and predictive practices like AI/ML
Future-Proofing Your AI Engineering Career in 2026 (machinelearningmastery.com). Future-proof AI engineering in 2026 via math foundations, system automation, cross-domain fluency, open source, and ethics
Review: Genius Makers (blog.piaw.net). Biographical look at Geoff Hinton and students, tracing neural networks history and AI industry moves through Google and China
From Columns to Rewards: Automating the Two Pillars That Drive Modern AI (tomtunguz.com). Reinforcement learning basics, feature engineering history, AutoML vs AutoRL, and the shift toward automated reward design
Engineering in the Age of Agents with Yechezkel Rabinovich (softwareengineeringdaily.com). eBPF-powered observability with groundcover; BYOC model, kernel sensors, AI’s impact on code review and root-cause analysis
Stop Feeling Lost : How to Master ML System Design (towardsdatascience.com). Practical framework for ML system design: business problem, data acquisition, feature engineering, model selection, deployment, and monitoring
Data and AI culture: How Nu’s philosophy became a competitive advantage (building.nubank.com). Nu's data and AI culture, Data Mesh, autonomy, and 100+ ML models powering customer-centric decisions
🤖 LLM Engineering Experiments
modded-nanogpt medium world record: Re-using intermediate activations in the output latents (snimu.github.io). Modded-nanogpt medium record reuses layer-11 activations in output latents with learned weights, exploring backout hypothesis and multi-layer skip experiments
Writing an LLM from scratch, part 22 -- finally training our LLM! (gilesthomas.com). Training an LLM from scratch, comparing GPT-2 weights, using AdamW, temperature and top-k sampling, and cost considerations
Scaling LLM Inference: Innovations in Tensor Parallelism, Context Parallelism, and Expert Parallelism (engineering.fb.com). Meta shares tensor, context, and expert parallelism innovations for scalable LLM inference and long-context handling
supplement to 0.430 (aarnphm.xyz). DeepSeek MLA: multi-head latent attention, KV compression, and on-device serving with vLLM and RoPE-enhanced queries
NVIDIA DGX Spark + Apple Mac Studio = 4x Faster LLM Inference with EXO 1.0 (simonwillison.net). EXO 1.0: NVIDIA DGX Spark vs M3 Ultra Mac Studio for LLM prefill and decode; 4x faster inference on Llama-3.1 8B via streaming KV cache over 10Gb Ethernet
The case for the return of fine-tuning (welovesota.com). Fine-tuning resurges with LoRA, Tinker, PEFT, and open-weight ecosystems enabling modular, controlled AI with personal hardware touches
⚙️ ML Infrastructure
Modernising Grab’s model serving platform with NVIDIA Triton Inference Server (engineering.grab.com). Grab adopts NVIDIA Triton Inference Server with a Triton manager to consolidate engines, achieving 50% online deployments migrated and tail-latency reductions
VERA-X: Introducing the First Native Vectorized Apache Flink® Engine (ververica.com). VERA-X is a native vectorized Apache Flink engine improving throughput, latency, and resource usage with full Flink API compatibility
Scaling Large MoE Models with Wide Expert Parallelism on NVL72 Rack Scale Systems (developer.nvidia.com). Wide Expert Parallelism with NVL72 GB200, NVLink, and TensorRT-LLM boosts MoE inference throughput and lowers TCO
Fast PEFT Serving at Scale (databricks.com). Databricks builds a proprietary inference runtime to scale PEFT and LoRA serving with FP8 quantization, hybrid attention, kernel fusion, and multi-stream GPU scheduling
Configure and verify a distributed training cluster with AWS Deep Learning Containers on Amazon EKS (aws.amazon.com). Configure and verify a distributed training cluster on Amazon EKS using AWS DLCs with PyTorch, NCCL, EFA, FSx Lustre, and etcd/Kubeflow tools
Accelerate large-scale AI training with Amazon SageMaker HyperPod training operator (aws.amazon.com). Deploy and manage large-scale AI training with SageMaker HyperPod, EKS add-on, fault resilience, log monitoring, and FSDP-based multi-node PyTorch
Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system (tomshardware.com). Alibaba Cloud’s Aegaeon pooling cuts Nvidia GPU use by 82%, enabling up to 9x output with 213 GPUs for multiple LLMs
🧰 Practical ML Engineering
Tracking Down Mysterious ML Training Stalls (medium.com/@Pinterest_Engineering). Pinterest's ML training stalls traced to torch.compile interactions and a Ray monitoring task, resolved by removing psutils.memory_full_info
Order Book Imbalance Analysis with QuestDB Arrays (questdb.com). Learn to analyze order book imbalance data using QuestDB arrays with Grafana visualizations
How We Stress Test Credit Models (barnesanalytics.com). Five-p pillar framework for stress-testing credit models (data checks, backtesting, macro scenarios, sensitivity tests, explainability and governance) with SHAP, calibration, and TPRM relevance
Growing a 454-page ML reference manual in 5 days: permacomputer harvest (russell.ballestrini.net). Harvesting a 454-page ML reference manual in 5 days via permacomputer automation and multi-language seed implementations
ML quacks: Combining duckdb and mlpack (dirk.eddelbuettel.com). Combining duckdb with mlpack to run adaboost on rectangular datasets via a duckdb extension
🔬 Applied ML Research
High-throughput optical neuromorphic graphic processing at millions of images per second (elight.springeropen.com). OGPU uses a vertical VCSEL array and MI-DNNs for high-speed, low-light optical computation delivering 25 MS/s and up to 98.6% accuracy on MNIST subsets
Bringing AI to the next generation of fusion energy (deepmind.google). Google DeepMind and Commonwealth Fusion Systems use TORAX and reinforcement learning to simulate and optimize SPARC tokamak plasma for fusion energy
Johns Hopkins researchers demonstrate the advantages of using machine learning in medicine (cs.jhu.edu). Johns Hopkins researchers use reinforcement learning to optimize septic shock vasopressor timing and multi-version ML trial designs
Enabling Scalable AI-Driven Molecular Dynamics Simulations (developer.nvidia.com). Integrating PyTorch-based MLIPs with LAMMPS via ML-IAP-Kokkos for scalable, multi-GPU MD simulations and Python-based model loading
Joint Learning of Depth and Appearance for Portrait Images (studios.disneyresearch.com). Joint learning of depth and appearance for portrait images using a diffusion-based generator with a reference identity network and channel-expanded backbone
📊 Modeling & Explainability
Compositional modeling of plant communities with Dirichlet regression (ecogambler.netlify.app). Dirichlet regression with Gaussian process smooths for plant-community composition across elevation and temperature using brms and Hilbert-space GP approximations
Introducing LightSHAP (lorentzen.ch). LightSHAP: a lightweight, framework-agnostic SHAP implementation for tabular data with explain_tree and explain_any examples using CatBoost and linear models
Rfuzzycoco released on CRAN (kforner.netlify.app). Rfuzzycoco released on CRAN demonstrates integration of C++ with R and cooperative-coevolutionary fuzzy modeling for explainable ML
GAN-like Synthetic Data Generation Examples with DistroSimulator (thierrymoudiki.github.io). GAN-like synthetic data using DistroSimulator across univariate/multivariate distributions, digits, Fashion-MNIST, and Olivetti faces
Saving a Trained Kernel Ridge Regression Model Using C# (jamesmccaffreyblog.com). Kernel ridge regression with RBF kernel, Cholesky inverse training, and saving/loading model weights in C#
🧮 Mathematical Foundations
Spherical Ensemble (djalil.chafai.net). Spherical Ensemble: Coulomb gas on S^2, determinantal structure, Möbius invariance, Kostlan observation, and spectral radius analysis
“We can obtain less rigorous but more convincing results by other means” (new paper) (noncommutativeanalysis.wordpress.com). Experimental bounds for dilations of free unitaries; universal commuting dilation constant; semidefinite programming; Copilot; random unitary pairs
Linkage (11011110.github.io). Overview of geometric constructions, puzzles, and algorithms, including Herschel enneahedron truncation, origami links, cuckoo hashing, and Hermite interpolation in curves
Deep neural networks provably solve Bellman equations for Markov decision processes without the curse of dimensionality (ecmiindmath.org). High-dimensional MDPs solved without curse of dimensionality using Monte Carlo, MLP methods, Q-learning, and deep neural networks
Distribution of coordinates on a sphere (johndcook.com). Uniformly distributed sphere coordinates x, y, z show zero linear correlation but non-independence; uses normal samples, normalization, and distance correlation
Lecture - Theories of Deep Learning MT25, IV, Data classes for which DNNs can overcome the curse of dimensionality (ollybritton.com). Overview of data classes where DNNs can beat the curse of dimensionality, including hidden manifold model, ReLU expressivity, and trajectory-based complexity
Lecture - Theories of Deep Learning MT25, II, Why deep learning (ollybritton.com). Overview of theories behind deep learning, architectures, and key papers shaping modern DL practice
📚 Academic Research
GRank: Towards Target-Aware and Streamlined Industrial Retrieval with a Generate-Rank Framework (arxiv:cs). GRank proposes a generator→rank retrieval pipeline that removes item-centric indices, boosting Recall@500 and P99 QPS. Production-validated; simplifies maintenance and raises large-scale retrieval throughput
MTmixAtt: Integrating Mixture-of-Experts with Multi-Mix Attention for Large-Scale Recommendation (arxiv:cs). MTmixAtt combines AutoToken clustering with MoE and Multi-Mix Attention to model heterogeneous features, improving CTR/CTCVR and scaling to 1B parameters. Production A/B tests at Meituan show tangible commercial gains
PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training (arxiv:cs). PRISM models operator-level runtime variability with probabilistic latencies and Monte Carlo simulation to predict full training-time distributions. Useful for engineers optimizing parallelization, prioritizing kernels, and reducing fleet variability
Unbiased Gradient Low-Rank Projection (arxiv:cs). GUM (GaLore Unbiased with Muon) gives an unbiased, memory-efficient low-rank optimizer with convergence guarantees matching Muon, improving empirical fine-tuning and pretraining performance while saving memory
Secure Sparse Matrix Multiplications and their Applications to Privacy-Preserving Machine Learning (arxiv:cs). Introduces MPC algorithms for secret sparse matrix multiplication that avoid dense blowup and cut communication (sometimes ×1000) for realistic sparse ML workloads, enabling practical privacy-preserving recommender/genomics pipelines