Filip Szatkowski

☕️

Filip Szatkowski

(he/him)

PhD Student

Mistral AI

Warsaw University of Technology

About Me

I am an AI Scientist at Mistral AI and PhD student at the Warsaw University of Technology, supervised by professor Tomasz Trzciński. My research focuses on efficiency in deep learning, spanning adaptive computation, early-exits, activation sparsity, speculative decoding, and continual learning.

During my PhD, I have published at top conferences such as NeurIPS, ICML and ICLR, and collaborated with European institutions including the Computer Vision Center in Barcelona and Sapienza University of Rome. I also have industry experience, currently working as an AI Scientist at Mistral AI in Warsaw, and previously as an Applied Scientist Intern at Amazon AWS AI and as NLP Intern at Samsung R&D in Warsaw. I am also active in the Polish ML community, organizing major events such as the ML in PL conferences and summer schools, as well as the ELLIS Doctoral Symposium 2025 in Warsaw.

The downloadable CV is not always kept up to date; this website is the best place for current information.

Featured Publications

Maciej Chrabąszcz , Filip Szatkowski , Bartosz Wójcik , Jan Dubiński , Tomasz Trzciński , Sebastian Cygert (2026). Efficient LLM Moderation with Multi-Layer Latent Prototypes. In ICML 2026.

We develop an efficient approach LLM input safety moderation using latent prototypes and demonstrate that safe and unsafe inputs are separable in the model’s latent space.

#Large Language Models #LLM Safety #Moderation #Deep Learning

Piotr Kubaty , Filip Szatkowski , Grzegorz Choczyński , Eric Nalisnick , Bartosz Wójcik (2026). Rethinking Calibration for Early-Exit Neural Networks. In ICML 2026.

We challenge the use of calibration metrics in early-exit models and show cases where calibration fails to accurately reflect the network performance. We argue for failure prediction as a more reliable performance proxy that better correlates with efficiency gains in early-exit networks.

#Early-Exits #Adaptive Computation #Calibration #Failure Prediction #Efficiency #Deep Learning

Filip Szatkowski , Patryk Będkowski , Alessio Devoto , Jan Dubiński , Pasquale Minervini , Mikołaj Piórczyński , Simone Scardapane , Bartosz Wójcik (2026). Universal Properties of Activation Sparsity in Modern Large Language Models. In ICLR 2026.

We propose a general framework for assessing sparsity robustness in modern LLMs and conduct a systematic study of activation sparsity such models. Our study reveals universal patterns of sparsity in LLMs and provides practical guidelines for model acceleration and design.

#Large Language Models #Activation Sparsity #Efficiency #Deep Learning

Filip Szatkowski , Yaoyue Zheng , Fei Yang , Bartłomiej Twardowski , Tomasz Trzciński , Joost van de Weijer (2025). Improving Continual Learning Performance and Efficiency with Auxiliary Classifiers. In ICML 2025.

We investigates intermediate representations in neural networks during class-incremental learning and propose to leverage them via auxiliary early-exit classifiers. Interestingly, we find out that in continual learning scenarios networks enhanced with such classiers are not only more efficient, but also show improved performance and reduced forgetting across task sequences.

#Continual Learning #Adaptive Computation #Early Exits #Efficiency #Deep Learning

PDF Code Preprint

Filip Szatkowski , Bartosz Wójcik , Mikołaj Piórczyński , Simone Scardapane (2024). Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion. In NeurIPS 2024.

We propose a method to convert dense transformers to dynamic Mixture-of-Experts models, which leverages natural activation sparsity in the neural networks. Crucially, we propose to enforce activation sparsity during short (continual) training process via additional sparsity regularization, and argue for use of dynamic-k expert routing in MoEfied models. Finally, we show how with efficient implementation our method achieves computational efficiency while maintaining the performance.

#Mixture of Experts #Adaptive Computation #Activation Sparsity #Efficiency #Deep Learning

PDF Code Preprint

See all publications