🔬 Research projects

My academic roots lie in {theoretical computer science} ∩ {mathematics}, where I was trained to think about structure, complexity, and rigorous analysis. Since 2014, I have shifted toward data science and machine learning, entering a highly competitive and fast-moving field. My non-standard background allows me to approach ML problems with different intuitions—often focusing on structure, interpretability, and principled reasoning. Not having grown up within the standard ML pipeline has also freed me from committing to a narrow niche, and I enjoy working across diverse problem domains. Yet, as varied as these directions may seem, they are unified by a common thread: a search for underlying low-dimensional structure in complex systems, and the development of methods that uncover and exploit it.

1. Modeling human behavior & social systems

We develop computational frameworks to understand structured human behavior in social, digital, and organizational systems. A key concept is semantic dimension analysis, extending classical dimensionality reduction into interpretable behavioral axes. Applications include social networks, administrative data, gaming and e-learning platforms.

Recently, this line of work has expanded to explore the use of LLM-based personas as digital twins—modeling individuals or populations for applications such as survey simulation and mental health analysis, with the goal of enabling scalable, controlled experimentation on human-centered processes.

Selected publications
  • Beyond the echo chamber: Modelling open-mindedness in citizens’ assemblies (AAMAS, 2025)
  • Predicting churn in online games by quantifying diversity of engagement (Big Data, 2023)
  • Opinion spam detection: A new approach using machine learning and network-based algorithms (ICWSM, 2022).
  • Modeling engagement in self-directed learning systems using principal component analysis (IEEE Transactions on Learning Technologies, 2019).
  • Simple statistics are sometimes too simple: A case study in social media data (IEEE Transactions on Knowledge and Data Engineering, 2019).
  • The million tweets fallacy: Activity and feedback are uncorrelated (ICWSM, 2018).

2. NLP and affective computing

This research explores how structure, semantics, and affect interact in language, with an emphasis on interpretable models. The work combines graph-based representations with linguistic signals to model stance, sentiment, and discourse without relying solely on large black-box models.

Selected publications
  • Acquired taste: Multimodal stance detection with textual and structural embeddings (COLING, 2025)
  • STEM: Unsupervised structural embedding for stance detection (AAAI, 2022)
  • Harald: Augmenting hate speech data sets with real data (EMNLP, 2022)

3. High-dimensional learning & feature selection

We study how to extract meaningful structure from high-dimensional data, focusing on sparse representations and feature selection. A key contribution is a dataset hardness taxonomy showing that many benchmarks are trivial, while suggesting a new way to generate meaningful benchmarks with a ground truth thus enabling a principled evaluation of feature selection methods on truly challenging cases.

Selected publications
  • Choosing the right dataset: Hardness criteria for feature selection benchmarking (Knowledge-Based Systems, 2025)
  • A greedy anytime algorithm for sparse PCA (COLT, 2020)

4. Theory meets machine learning (algorithms & graphs)

This project bridges theoretical computer science and modern machine learning, focusing on how learning systems solve combinatorial optimization problems. The goal is to uncover the structural principles that enable neural networks to succeed on tasks such as SAT, Max-Clique, and graph coloring.

Selected publications
  • Concept learning for algorithmic reasoning: Insights from SAT-solving GNNs (Information Sciences, 2026)
  • Learning to rank: How GNNs solve Max-Clique and Sparse PCA (AAAI, 2026, oral)
  • From Black Box to Algorithmic Insight: Explainable AI in Graph Neural Networks for Graph Coloring (AAAI NeurMAD Workshop, 2025)

5. Voice-based mental health detection

We study whether human voice can serve as a robust, non-invasive biomarker for mental health, focusing on depression and related conditions. The work combines clinical data collection with machine learning pipelines designed for real-world deployment. A central challenge is distinguishing true biomarkers from superficial correlations, and ensuring robustness to domain shifts and manipulation.

Selected publications
  • Method matters: Enhancing voice-based depression detection with a new data collection framework (Depression & Anxiety, 2025)