# Mathematics of Machine Learning and Data Science Seminar

### PAST MML RESEARCH GROUP SEMINAR (Nov -- Dec 2023)

The research group of Philipp Petersen organizes a research seminar, which takes places, regularly weekly on Wednesdays.

Date | Presenter | Title |
---|---|---|

Nov 8 2023 (online -- 15:00) | A. Martina Neuman (UWien) | Transferability of GNNs using sampling and graphon theories |

Dec 6 2023 (in person -- 15:00) SR9, Kolingasse 14 | Philipp Petersen (UWien) | Optimal learning of piecewise smooth functions |

Dec 13 2023 (in person -- 15:00) SR9, Kolingasse 14 | Adeyemi D. Adeoye (Scuola IMT and UWien) | Self-concordant regularization in machine learning |

# JOINT RESEARCH SEMINAR: MATH OF MACHINE LEARNING AND DATA SCIENCE

**From Jan 2024 onward, the research groups of Philipp Petersen and Philipp Grohs organize a joint research group seminar, the "Math of Machine Learning and Data Science" seminar.

The joint seminar takes places regularly weekly on Wednesdays. Talks will not be held during Austrian national or public holidays as well as semester breaks. In the events that a talk is held online or hybrid, the Zoom details will be given on this page.

Organizers: Martina Neuman and Sebastian Schmutzhard-Höfler

The talks are approximately one hour long. Everybody is welcome! If you want to be kept up to date about the weekly seminar talks, you can subscribe to the seminar's mailing list here. The schedule is as follows.

## ***Anthony Gruber's upcoming talk***

Zoom ID: univienna.zoom.us/j/8482968091

Passcode: 517762

## SCHEDULE

Date | Presenter | Title |
---|---|---|

Jan 10 2024 (in person -- 11:00) 7.21, Kolingasse 14 | Andrés Felipe Lerma Pineda (UWien) | Deep neural networks can stably solve high-dimensional, noisy, non-linear inverse problems |

Jan 17 2024 (in person -- 11:00) 7.21, Kolingasse 14 | Bernhard Einberger (AVL) | Machine learning model to predict fuel cell degradation effects |

Jan 24 2024 (in person -- 11:00) 7.21, Kolingasse 14 | Yuanyuan Li (Fundan U) | Two-layer networks with the ReLU^k activation function: Barron spaces and derivative approximation |

**Special Session** Jan 31 2024 (online -- 16:00) | Anastasis Kratsios (McMaster U and Vector Institute) | An approximation theory for metric-space-valued functions with a view towards transformers |

March 06 2024 (in person -- 10:00) SR 12, Kolingasse 14 | [Organization meeting] | [Organization meeting] |

**Special Session** March 13 2024 (online -- 10:00) | Antoine Maillard (ETH Zurich) | From disordered systems to the mathematics of data science |

March 20 2024 (in person -- 10:00) SR 12, Kolingasse 14 | Lukas Liehr (UWien) | Declipping and saturation recovery |

April 10 2024 (in person -- 10:00) SR 12, Kolingasse 14 | Martin Rathmair (UWien) | Gabor phase retrieval via semidefinite programming |

**Special Session** April 17 2024 (online -- 16:00) | Wonjun Lee (UoMinnesota) | Monotone generative modeling via a geometry-preserving mapping |

April 24 2024 (in person -- 10:00) SR 12, Kolingasse 14 | Thomas Dittrich (RICAM) | Operator Learning via Approximation of Symbols |

May 15 2024 (in person -- 10:00) SR 12, Kolingasse 14 | Martina Neuman (UWien) | Efficient Learning Using Spiking Neural Networks Equipped With Affine Encoders and Decoders |

May 22 2024 (in person -- 10:00) SR 12, Kolingasse 14 | Manjot Singh (LMU) | Expressivity of Spiking Neural Networks |

May 29 2024 (in person -- 10:00) SR 12, Kolingasse 14 | Adám Gosztolai (EPFL) | Interpretable geometric description of dynamical systems |

June 05 2024 (in person -- 10:00) SR 12, Kolingasse 14 | Daniel Bartl (UWien) | Statistical behaviour of the max-sliced Wasserstein distance |

June 12 2024 (in person -- 10:00) SR 12, Kolingasse 14 | Martina Neuman (UWien) | Tighter Generalization Bounds on Digital Computers via Discrete Optimal Transport |

**Special Session** June 19 2024 (online -- 16:00) | Anthony David Gruber (Sandia National Lab) | Learning metriplectic systems and other bracket-based dynamics |

## ABSTRACTS

### Transferability of Graph Neural Networks using Graphon and Sampling Theories

##### Martina Neuman

Nov 08 2023

Graph neural networks (GNNs) have become powerful tools for processing graph-based information in various domains. A desirable property of GNNs is transferability, where a trained network can swap in information from a different graph without retraining and retain its accuracy. A recent method of capturing transferability of GNNs is through the use of graphons, which are symmetric, measurable functions representing the limit of large dense graphs. In this talk, I will present an explicit two-layer graphon neural network (WNN) architecture for approximating bandlimited signals, and explain how a related GNN guarantees transferability between both deterministic weighted graphs and simple random graphs. The proposed WNN and GNN architectures overcome issues related to the curse of dimensionality and offer practical solutions for handling graph data of varying sizes.

### Optimal learning of piecewise smooth functions

##### Philipp Petersen

Dec 06 2023

Deep learning has established itself as, by far, the most successful machine learning approach in sufficiently complex tasks. Nowadays, it is used in a wide range of highly complex applications such as natural language processing or even scientific applications. Its first major breakthrough, however, was achieved by shattering the state-of-the-art in image classification. We revisit the problem of classification or more general learning of functions with jumps by deep neural networks and attempt to find an answer to why deep networks are remarkably effective in this regime. We will interpret the learning of classifiers as finding piecewise constant functions from labelled samples. Piecewise constant/smooth functions also appear in many applications associated with physical processes such as in transport problems where shock fronts develop.

We then precisely link the hardness of the learning problem to the complexity of the regions where the function is smooth. Concretely, we will establish fundamental lower bounds on the learnability of certain regions. Finally, we will show that in many cases, these optimal bounds can be achieved by deep-neural-network-based learning.

In quite realistic settings, we will observe that deep neural networks can learn high-dimensional classifiers without a strong dependence of the learning rates on the dimension.

### Self-concordant regularization in Machine Learning

##### Adeyemi D. Adeoye

Dec 13 2023

Regularization techniques have proved to be key in reducing overfitting issues in machine learning, ultimately improving the machine learning model's ability to generalize well. Such techniques, leading to a so-called "structural risk minimization" problem, can help to customize efficient methods for solving the resulting problem, particularly when one seeks (approximate) second-order methods that account for curvature information in the solution steps for faster convergence. Precisely, when the regularization function is self-concordant, Newton-type methods can select learning rates or step-sizes that exploit the self-concordant property of the regularization function for faster convergence and globalization of the method when the problem is convex.

The generalized Gauss-Newton method exploiting this property will be presented in this talk, showing a promising approximation scheme that can lead to cheap iteration steps, especially when the problem is overparameterized or in the mini-batch setting. I will also present a smoothing framework for constructing such algorithmic self-concordant regularization functions from penalty functions that are integral parts of the optimization problem, such as those added to promote specific structures in the solution estimates. Numerical examples that demonstrate the efficiency of this approach and its superiority over existing approaches will be shown. While our framework currently covers the convex setting, a toy example of a neural network training problem shows that our approach is promising for non-convex optimization. Finally, I will present a Julia package associated with our framework.

### Deep neural networks can stably solve high-dimensional, noisy, non-linear inverse problems

##### Andrés Felipe Lerma Pineda

Jan 10 2024

We present a neural-network-based method for the solution of inverse problems when only noisy measurements are available. This method solves problems modeled via an infinite-dimensional continuous operator with a discontinuous inverse. Our proposed method restricts this forward operator to finite-dimensional spaces in such a way that the inverse is Lipschitz continuous. First, we restrict the operator's domain to a finite dimensional vector space and then we evaluate the output only at a finite set of sampling points. We prove that this restricted operator has a Lipschitz continuous inverse when a sufficient number of samples is chosen. For the class of Lipschitz continuous functions, we construct a neural network which dampens the noise when tested with perturbed data. Besides, this neural network can be trained with noisy data and performs well when tested with additional noisy data. We provide numerical examples from real-life applications to illustrate the feasibility of our method.

### Machine Learning Model to Predict Fuel Cell Degradation Effects

##### Bernhard Einberger

Jan 17 2024

Fuel Cell technology could take a key role in the industries transformation towards decarbonization. Therefore, the component must be well designed and long-lasting. Several simulation tools are developed by AVL in order to increase the engineering capability of that technology. This thesis introduces a workflow that describes how to generate Fuel Cell degradation data with a commercial 3D-CFD solver (FireM©) and process the data by means of Machine Learning tools. A detailed simulation model is utilized to calibrate a simplified simulation model. The Electrocatalytic Surface Area suggested by the simplified simulation model differs with measurements by roughly 30%. Two ML models were set up to predict degradation effects, a Kernel Ridge Regression, and a Dense Neural Network model. The following degradation effects are predicted: current density reduction, equivalent weight (membrane), platinum particle number (CAT-CL), specific platinum surface (CAT-CL), membrane thickness, CAT-CL thickness and ANO-CL thickness. The ML model’s prediction error is less than 4% for each degradation effect. A way to de-feature time series data (e.g., load profile) with utilizing Principal Component Analysis is introduced and proven to be applicable at this task. The purpose of the ML model is to better understand the effect of operating conditions on Fuel Cell degradation and therefore increase the engineering capability.

### Two-layer networks with the ReLU^k activation function: Barron spaces and derivative approximation

##### Yuanyuan Li

Jan 24 2024

We investigate the use of two-layer networks with the rectified power unit, which is called the ReLU^k activation function, for function and derivative approximation. By extending and calibrating the corresponding Barron space, we show that two-layer networks with the ReLU^k activation function are well-designed to approximate an unknown function and its derivatives simultaneously. When the measurement is noisy, we propose a Tikhonov-type regularization method and provide error bounds when the regularization parameter is chosen appropriately. Several numerical examples support the efficiency of the proposed approach.

### An Approximation Theory for Metric Space-Valued Functions With A View Towards Deep Learning

##### Anastasis Kratsios

Jan 31 2024

We build universal approximators of continuous maps between arbitrary Polish metric spaces X and Y using universal approximators between Euclidean spaces as building blocks. Earlier results assume that the output space Y is a topological vector space. We overcome this limitation by "randomization": our approximators output discrete probability measures over Y. When X and Y are Polish without additional structure, we prove very general qualitative guarantees; when they have suitable combinatorial structure, we prove quantitative guarantees for Hölder-like maps, including maps between finite graphs, solution operators to rough differential equations between certain Carnot groups, and continuous non-linear operators between Banach spaces arising in inverse problems. In particular, we show that the required number of Dirac measures is determined by the combinatorial structure of X and Y. For barycentric Y, including Banach spaces, R-trees, Hadamard manifolds, or Wasserstein spaces on Polish metric spaces, our approximators reduce to Y-valued functions. When the Euclidean approximators are neural networks, our constructions generalize transformer networks, providing a new probabilistic viewpoint of geometric deep learning.

As an application, we show that the solution operator to an RDE can be approximated within our framework.

### From disordered systems to the mathematics of data science

##### Antoine Maillard

Mar 13 2024

I will introduce a connection between statistical learning or high-dimensional inference, and models of statistical mechanics in which a large number of elementary units interact with each other in a disordered manner. This relationship is the basis of a vast mathematical program leveraging tools and insights from the physics of disordered systems to improve our understanding of high-dimensional probabilistic models, with applications e.g. in machine learning and constraint satisfaction problems. I will then detail concrete examples of this program by presenting recent advances concerning the phase retrieval problem, as well as a challenging open problem of high-dimensional probability known as the ellipsoid fitting conjecture.

### Declipping and saturation recovery

##### Lukas Liehr

Mar 20 2024

A fundamental problem in signal analysis is the recovery of an object from clipped or saturated measurements. The latter refers to the effect that measurement devices can only measure quantities within a certain range. A classical example is the recording of audio, where sound can only be captured within a specific decibel range. The problem of recovering an object from clipped or saturated measurements is known as declipping or saturation recovery. The authors of the recent paper [W. Alharbi, D. Freeman, D. Ghoreishi, B. Johnson, N. L. Randrianarivony, Declipping and the recovery of vectors from saturated measurements, arXiv:2402.03237 (2024)] establish a rigorous mathematical framework for this problem in terms of frame theory. The purpose of this talk is to provide a self-contained introduction to saturation recovery and to discuss the main results of the aforementioned article.

### Gabor phase retrieval via semidefinite programming

##### Martin Rathmair

April 10 2024

Given X a function space consisting of complex-valued functions, the corresponding phase retrieval problem is to reconstruct F ∈ X from phase-less observations, that is from (samples of) its modulus |F|. Among other difficulties, the lack of convexity makes such a problem a challenging task. We discuss the case where X is the image of L2 under the short-time Fourier transform (STFT) with Gaussian window, i.e., reconstruction of the STFT from the spectrogram. We propose a reconstruction scheme which is based on solving two convex problems (both of which can be reformulated as semidefinite programs) and argue that the method accurately reconstructs F (up to a constant phase factor) provided that |F| is connected in a certain graph-theoretic sense (note that the problem at hand requires some sort of connectivity to be well-posed). This is joint work with P. Jaming (U. Bordeaux).

### Monotone generative modeling via a geometry-preserving mapping

##### Wonjun Lee

April 17 2024

Generative Adversarial Networks (GANs) are powerful tools for creating new content, but they face challenges such as sensitivity to starting conditions and mode collapse. To address these issues, we propose a deep generative model that utilizes the Gromov-Monge embedding (GME). It helps identify the low-dimensional structure of the underlying measure of the data and then map it, while preserving its geometry, into a measure in a low-dimensional latent space, which is then optimally transported to the reference measure. We guarantee the preservation of the underlying geometry by the GME and c-cyclical monotonicity of the generative map, where c is an intrinsic embedding cost employed by the GME. The latter property is a first step in guaranteeing better robustness to initialization of parameters and mode collapse. Numerical experiments demonstrate the effectiveness of our approach in generating high-quality images, avoiding mode collapse, and exhibiting robustness to different starting conditions.

### Operator Learning via Approximation of Symbols

##### Thomas Dittrich

April 24 2024

Operator learning is a recent development in the simulation of partial differential equations (PDEs) by means of neural networks. The idea behind this approach is to learn the behavior of an operator, such that the resulting neural network is an (approximate) mapping in infinite-dimensional spaces that is capable of simulating the PDE. In our work we study some general approximation capabilities for linear differential operators by approximating the corresponding symbol in the Fourier domain. Analogous to the structure of the class of Hörmander symbols, we consider the approximation with respect to a topology that is induced by a sequence of semi-norms. In that sense, we measure the approximation error in terms of a Fréchet metric and our main result identifies sufficient conditions for achieving a predefined approximation error. Secondly, we then focus on a natural extension of our main theorem, in which we manage to reduce the assumptions on the sequence of semi-norms. Based on some existing approximation result for the exponential spectral Barron space, we then present a concrete example of symbols that can be approximated well and we also show the analogy of this approximation to the design of digital filters in Signal Processing.

### Efficient Learning Using Spiking Neural Networks Equipped With Affine Encoders and Decoders

##### Martina Neuman

May 15 2024

We study the learning problem associated with spiking neural networks. Specifically, we consider hypothesis sets of spiking neural networks with affine temporal encoders and decoders and simple spiking neurons having only positive synaptic weights. We demonstrate that the positivity of the weights continues to enable a wide range of expressivity results, including a striking sorting property, a rate-optimal approximation of smooth functions or approximation without the curse of dimensionality. Moreover, positive-weight spiking neural networks are shown to depend continuously on their parameters which facilitates classical covering number-based generalization statements. Finally, we observe that from a generalization perspective, contrary to feedforward neural networks or previous results for general spiking neural networks, the depth has little to no adverse effect on the generalization capabilities.

### Expressivity of Spiking Neural Networks

##### Manjot Singh

May 22 2024

The synergy between spiking neural networks and neuromorphic hardware holds promise for the development of energy-efficient AI applications. Inspired by this potential, we revisit the foundational aspects to study the capabilities of spiking neural networks where information is encoded in the firing time of neurons. Under the Spike Response Model as a mathematical model of a spiking neuron with a linear response function, we compare the expressive power of artificial and spiking neural networks, where we initially show that they realize piecewise linear mappings. In contrast to ReLU networks, we prove that spiking neural networks can realize both continuous and discontinuous functions. Moreover, we provide complexity bounds on the size of spiking neural networks to emulate multi-layer (ReLU) neural networks. Restricting to the continuous setting, we also establish complexity bounds in the reverse direction for one-layer spiking neural networks.

### Interpretable geometric description of dynamical systems

##### Adám Gosztolai

May 29 2024

It is increasingly recognised that computations in the brain and artificial neural networks can be understood as outputs of a high-dimensional dynamical system conformed by the activity of large neural populations. Yet revealing the underpinning latent dynamical processes from data and interpreting their relevance in computational tasks remains challenging. A prominent line of research has observed that neural activity is often confined to low-dimensional smooth manifolds. However, there is a lack of theoretical frameworks for the unsupervised representation of neural dynamics that are interpretable based on behavioural variables, comparable across systems, and decodable to behaviour with high accuracy. In this talk, I will introduce Manifold Representation Basis Learning (MARBLE), an unsupervised representation-learning framework for non-linear dynamical systems. Our approach combines empirical dynamical modelling and geometric deep learning to decompose state space dynamics into a statistical distribution of local flow fields. I will show that this decomposition allows the development of unsupervised learning algorithms and obtaining latent representations that are preserved under different manifold embeddings. I will show that MARBLE representations give rise to a well-defined similarity metric between dynamical systems, e.g., different instances of recurrent neural networks and animals. Moreover, they are expressive enough to compare computations and detect continuous and discontinuous changes due to external variables. In neuroscience, I will show that these properties allow the discovery of interpretable representations of neural activity in motor, navigation and cognitive tasks and lead to significantly higher decoding performance than state-of-the-art. Our results suggest that using the manifold structure yields a new class of algorithms with higher performance and the ability to assimilate data across experiments.

### Statistical behaviour of the max-sliced Wasserstein distance

##### Daniel Bartl

June 05 2024

I will begin by defining the Wasserstein distance and discussing the challenges posed by the curse of dimensionality when estimating a high-dimensional distribution from i.i.d. samples thereof. I will then explain how the max-sliced Wasserstein distance overcomes these challenges. Finally, I will elaborate on the proof, which relies on a uniform version of the Dvoretzky-Kiefer-Wolfowitz inequality. (Joint work with S. Mendelson.)

### Tighter Generalization Bounds on Digital Computers via Discrete Optimal Transport

##### Martina Neuman

June 12 2024

Machine learning models with inputs in a d-dimensional Euclidean space, when implemented on digital computers, generalize, and their generalization gap converges to 0 at a rate of c/N^{1/2} concerning the sample size N. However, the constant c>0 obtained through classical methods can be large in terms of the ambient dimension d and the machine precision, posing a challenge when N is small to realistically large. In this paper, we derive a family of generalization bounds c_m/N^{1/(2\vee m)} tailored for learning models on digital computers, which adapt to both the sample size N and the so-called geometric representation dimension of the discrete learning problem. Adjusting the parameter m according to N results in significantly tighter generalization bounds for practical sample sizes N, while setting m small maintains the optimal dimension-free worst-case rate of O(1/N^{1/2}).

Furthermore, our adaptive generalization bounds are formulated based on our new non-asymptotic result for concentration of measure in discrete optimal transport, established via leveraging metric embedding arguments.

### Learning metriplectic systems and other bracket-based dynamics

##### Anthony Gruber

June 19 2024

The metriplectic formalism is a useful framework for constructing and explaining phenomenological models of physical phenomena that conserve energy and generate entropy. However, general metriplectic equations of motion are highly complicated, relying on delicate compatibility conditions involving the kernels of algebraic brackets. This talk discusses a recent method for machine-learning provably metriplectic dynamics from data in a way that is (1) universally approximating, (2) admits an error estimate, and (3) scales optimally with respect to the number of learnable parameters. Through finite-dimensional benchmark examples, it is shown that the proposed method is fully expressive and capable of reliably learning metriplectic dynamics, even in cases where only partial state data is observed.