Filip Szatkowski

My research focuses on efficient machine learning techniques. This page outlines ongoing projects as well as broader research ideas I am exploring. If you would like to learn more, discuss these topics, or collaborate, please feel free to reach out. I am also open to brainstorming about related directions.

🔄 Ongoing projects

These are the projects that I am currently involved in to some capacity.

Activation sparsity in LLMs

I am still conducting additonal research on extending the ideas from “Universal Properties of Activation Sparsity in Modern Large Language Models”, mostly towards reasoning models and robustness induced by actication sparsity.

Efficient multilingual drafters for speculative decoding

Most current speculative decoding approaches are developed using purely english training sources, so we are trying to come up with effieicnt way of obtaining multilingaul drafters to match increasingly multilingual capabilities of current frontier LLMs.

Calibration and failure prediction in early-exit models

The project is centered about the interesting properties of failure prediciton outlined in “Failure Prediction Is a Better Performance Proxy for Early-Exit Networks Than Calibration” paper.

Accelerating diffusion LLMs via activation sparisty

As slightly touched in “Universal Properties of Activation Sparsity in Modern Large Language Models”, current diffusion LLMs exhibit highly sparse activation patterns, so we are investigating activation sparsity for their acceleration.

✨ Loose research topics and ideas

Modular and adaptive computation in continual learning models

My “Improving Continual Learning Performance and Efficiency with Auxiliary Classifiers” paper shows that CL techniques benefit a lot from early-exit network architecture, and this was done without any specific CL tricks; maybe it can be further improved to unlock even better CL performance?

Efficient tokenization for small language models

Large models are increasingly multilingual and benefit from larger and larger vocabularies, as this enables fewer inference steps; however, for small budgets, such large vocabularies are inefficient and consume most of the model capacity. Some ideas to take this further:

In some applications (e.g., speculative decoding, knowledge distillation) we need matching vocabularies between models — can this somehow be bridged?
How can we most efficiently determine vocabulary size for a given compute budget, balancing model capabilities and inference efficiency?
Can we somehow distill large vocabularies into smaller ones, e.g., to prepare larger models for KD into smaller ones?

Eealy-exits for large-vocab models

Early-exits are cool, but not super compatible with the current transformers due to the following issues:

Most models nowadays, especially LLMs, have very large vocabs, while for early-exiting we ideally use many densly spaced classifiers. With large output size, using many classifiers requires a lot of additional parameters, and also costs a lot of compute. Therefore, we need some smarter classifier design.
If we exit on a per-token basis, the tokens for which we exit early do not have the “ground truth” attention scores in the following layers, which means that such scores have to be somehow recomputed or propagated at low cost.
(somehow related to the previos point) Efficient autoregresive inference is heavily reliant on KV caching, and we do not compute the KV for the layers that follow the early exit. Therefore, can early-exits be explored as (lossy) KV-cache reduction technique? We can either just re-use the cache from the exited layers (and save memory on a part of cache), or come up with some fast way (e.g. - linear transform) to project the cache from early layers to better, deep representations.

Continual training for foundation models

How can we efficiently add new knowledge (e.g., capabilities in new languages) to pretrained models, without destroying existing capacity? This is very broad but also very interesting and potentially high impact research.

Quantifying knowledge transfer capabilities in out-of-distribution knowledge distillation scenarios

In some cases we have the trained teacher model, but do not have any access to its training data. This leads to some iteresting research questions:

How can we determine which data to use for the most efficient knowledge transfer to student models?
How can we maximize the gains from KD in scenarios where we do not necessarily care about performance on the teacher’s training distribution?

This was broadly inspired by arXiv:2112.00725, which explores distilling image models from augmented views of a singled high-resolution image.