The ML Engineer 11-11-2025
invisible watermarking, AI-driven data pipelines, geospatial embeddings
🔧 Company Engineering Blogs
Video Invisible Watermarking at Scale (engineering.fb.com). Meta discusses scalable invisible watermarking for video using CPU-first pipelines, FFmpeg filters, and frame-selection to balance quality, detection accuracy, and BD-Rate
Inside Kaiju (blog.character.ai). Kaiju: fast, efficient in-house LLMs (13B–110B) with MQA, sliding window attention, int8 training, and safety-focused alignment
What 986 million code pushes say about the developer workflow in 2025 (github.blog). 986 million commits reveal faster, smaller, continuous shipping and evolving developer workflows in 2025
A Decade of AI Platform at Pinterest (medium.com/pinterest-engineering). A decade-long look at Pinterest’s unified ML platform, covering Linchpin, Scorpion, EzFlow, Galaxy, UFR, MLEnv, TabularML, and GPU-centric innovations across GPUs, Ray, and large embedding models
Introducing Nested Learning: A new ML paradigm for continual learning (research.google). Nested Learning proposes multi-level optimization to tackle catastrophic forgetting in ML models, introducing CMS memory and Hope architecture (Lang: Python-like concepts)
Build better software to build software better (slack.engineering). Slack engineers describe speeding up build pipelines using Bazel, caching, layering, and Starlark to cut 60-minute builds down to tens of minutes
🧭 Practitioner Perspectives
2 Years of ML vs. 1 Month of Prompting (levs.fyi). Two years of ML vs. one month prompting; ML pipelines, labeling, and XGBoost vs. LLMs for warranty claim classification
Types of AI Models (cognitiveinheritance.com). Explores AI model types—logical, probabilistic/learning, optimization, and hybrids—along with explainability, use cases, and examples in Barry S. Stahl’s insights
An Approval Model That Finally Got Approved (tiago.rio.br). Will Bank experiments to boost non-credit user activation using SageMaker, FastAPI, and data storytelling to drive onboarding improvements
Fifth anniversary of Evidence-based Software Engineering book (shape-of-code.com). Fifth anniversary reflection on Evidence-based Software Engineering, LLM impact, statistical methods, and future model analysis
unifiedml in R: A Unified Machine Learning Interface (thierrymoudiki.github.io). A unified ML interface for R with sklearn-like API, automatic task detection, cross-validation, and model interpretability across glmnet, randomForest, and e1071
🔁 Data Pipelines
Building an FX Liquidity Stress Analysis Workflow with QuestDB (questdb.com). FX liquidity stress analysis workflow using QuestDB, Python, and XGBoost for second-level data, feature engineering, and model training
The Day We Broke the Memory Bank (timeplus.com). Hybrid Hash Join cuts memory to 1.1–1.6 GB vs 3.2–6.2 GB by keeping hot data in memory and spilling cold data to disk
Lessons learned - 5x throughput on data pipelines with adaptive batching (cocoindex.io). Adaptive batching boosts GPU throughput in CocoIndex pipelines using Python, sentence-transformers, and custom functions
Vector search benchmarking: Setting up embeddings, insertion, and retrieval with PostgreSQL® (instaclustr.com). Benchmarking vector search with PostgreSQL (pgvector) using embeddings, ingestion, and retrieval in a real-world RAG pipeline
Data Engineering in the Age of AI (oreilly.com). AI-driven change in data engineering: real-time pipelines, RAG, governance, and the role of engineers with GenAI adoption
🌍 Geospatial & Environmental
DeepForest 2.0! (jabberwocky.weecology.org). DeepForest 2.0 release updates ML workflows, HuggingFace model sharing, Hydra configs, and visualization improvements for biodiversity imaging
Day 6: Dimensions (dewey.dunnington.ca). Dimensions day explores M values in XYM/XYZM, with R and Python workflows using argodata, sedonaDB, and Parquet, plus Argo ocean data
A deep learning-based model for endorsing predictive accuracies of landslide prediction: insights into soil moisture dynamics (geoenvironmental-disasters.springeropen.com). DL-based framework predicts soil moisture VWC for shallow landslides using LSTM, Hypertuning, SHAP, and interval predictions in Hong Kong case studies
Reversed (r.iresmi.net). Reversing Mediterranean bathymetry with R (terra, sf) to reveal inverted coastlines and basins
A discussion of geospatial embeddings (spatialists.ch). Geospatial embeddings, AI-driven EO, transparency, and interpretability discussed by Ed Parsons and Ralph Straumann, with AlphaEarth Foundations noted
🔧 Modeling & Training
RL Learning with LoRA: A Diverse Deep Dive (kalomaze.bearblog.dev). RL training with LoRA for SFT and RL finetuning in prime-rl using rsLoRA scaling and multi-environment experiments
Building a Large Language Model from scratch: A learning journey (velvetshark.com). Learning how to build an LLM from scratch: PyTorch basics, tokenization, embeddings, attention, GPT architecture, and parameter breakdowns in Raschka's approach
Validating Models with Limited Data (barnesanalytics.com). Strategies for validating models with limited data using bootstrap, nested CV, SMOTE, Bayesian Methods, and scenario testing
Navigating Metric Trade-offs in Machine Learning (probableodyssey.blog). Balancing precision and recall in ML classifiers, threshold tuning, multi-metric trade-offs, and strategic compromises
Nested Learning: How Your Neural Network Already Learns at Multiple Timescales (rewire.it). Nested Learning reframes optimization as multi-timescale memory, with HOPE architecture and frequency-aware updates for transformers
How Neural Machine Translation Actually Works: A Developer's Guide (mostlylucid.net). Deep dive into neural machine translation using embeddings, attention, and transformer encoders/decoders with C#-style examples
🔬 Math & Theory
Mathematical exploration and discovery at scale (terrytao.wordpress.com). Terence Tao and collaborators use AlphaEvolve (Google DeepMind) to evolve code for solving high-dimensional math problems in analysis, combinatorics, geometry, with Python, Lean proofs, and insights on conjectures
Massively parallel and universal approximation of nonlinear functions using diffractive processors (elight.springeropen.com). Massively parallel universal nonlinear function approximation with linear-diffractive processors using phase-only wavefront encoding and SLMs
Condensation (lesswrong.com). Condensation theory of concepts, latent variables, information theory, and interpretability for AI, with notes by Sam Eisenstat, John Wentworth, and various LW commentators
Toward Statistical Mechanics Of Interfaces Under Selection Pressure (lesswrong.com). ML-inspired training of coupled interfaces, API constraints, and a stat mech view on parameter entropy across components
The Spacetime of Large Language Models (medium.com/data-science-collective). Geometric view of Transformers: curvature, parallel transport, and QKV attention in language models
The shadows lurking in the equations (gods.art). FuzzyGraph visualizes equations in non-binary mode, revealing shadows, black holes, and near-solutions ignored by conventional graphs
📚 Academic Research
DANIEL: A Distributed and Scalable Approach for Global Representation Learning with EHR Applications (arxiv:cs). DANIEL develops a distributed, privacy-preserving Ising-model framework using bi-factored gradient descent for large, heterogeneous EHR datasets. Python ML engineers should care because it offers scalable federated representation learning and practical non-convex optimization techniques for multi-institution clinical applications
NOWS: Neural Operator Warm Starts for Accelerating Iterative Solvers (arxiv:cs). NOWS uses learned neural operators to generate high-quality warm starts for Krylov solvers, reducing PDE iteration counts and end-to-end runtime by up to 90% while preserving solver guarantees. This is directly relevant to engineers integrating ML surrogates with established numerical solvers to speed scientific workflows reliably
High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes (arxiv:stat). Provides rigorous high-dimensional scaling limits for SGD with Polyak momentum and adaptive step-sizes, characterizing dynamics and stability trade-offs in modern regimes. Practical impact: theory-guided insight for optimizer design, hyperparameter choices, and understanding failure modes in high-dimensional training
The Principles of Diffusion Models (arxiv.org). Diffusion models fundamentals, training, sampling, and theoretical insights using ML concepts and diffusion processes
Parametric Hierarchical Matrix Approximations to Kernel Matrices (arxiv:math). Introduces parametric hierarchical (H / H^2) matrix constructions enabling offline/online kernel approximations across hyperparameter spaces, yielding 100x+ online speedups. Engineers working with Gaussian processes or kernel models gain dramatic efficiency for hyperparameter sweeps and many-query tasks
PerfDojo: Automated ML Library Generation for Heterogeneous Architectures (arxiv:cs). PerfLLM and PerfDojo cast performance tuning as an RL optimization problem over human-readable, mathematically-inspired code transformations to achieve cross-architecture gains. This matters for ML engineers seeking automated, portable optimization across CPUs, GPUs, and accelerators without deep hardware expertise