SENSOR: Graph-Based Revision History Analysis for Code Evolution Introspection

15 Sep, 2022

During the summer of 2022, I worked as a Research Intern at the Computer Science Laboratory at SRI International. My research focused on securing open-source software against malicious actors and influence operations within developer communities.

Research Objective

How can we protect the integrity of open-source software projects from malicious actors and influence operations?

Open-source software underpins much of today's digital infrastructure, making its security a critical issue. Unfortunately, several high-profile attacks on open-source projects have exposed supply chain vulnerabilities and triggered downstream security incidents.

To make progress on this broader challenge, our team focused on a more concrete problem:

Can we detect malicious patches in the Linux kernel repository?

Approach

The team had previously developed a graph-based AI model to analyze the social dynamics of the Linux kernel community. This model was able to flag incidents like Hypocrite Commits and other influence operations with encouragingly low false positive rates.

My role was to enrich the model by integrating signals from actual code changes. Here’s how I approached it:

Repository Mining. Extracting relevant patch data was straightforward once we identified the right abstractions and tools.
Code Representation via Change Graphs. We experimented with Change Graphs to represent code evolution. This turned out to be more challenging than expected, largely because of the Linux kernel’s constantly evolving build dependencies. We partially addressed this with TuxMake, a fantastic build tool for kernel compilation.
Feature Engineering for Patch Classification. This was the most engaging part of my work. We designed meaningful features from code changes and implemented analysis passes to extract them. These features became the foundation for improving classification accuracy.
Patch Classification with Graph Neural Networks (GNNs). Building on the existing graph structure, we trained a GNN model to classify patches as malicious or benign. This stage taught me a lot about combining social and code-level signals for effective detection.

Reflections

This project gave me hands-on experience at the intersection of software security, machine learning, and developer community dynamics. I especially enjoyed tackling the challenges of code representation and feature engineering — areas where small design choices had a big impact on results.

Overall, working at SRI reinforced my interest in securing the open-source ecosystem and highlighted the importance of interdisciplinary approaches to modern security problems.

#internship #kernel #machine-learning #research #security #supply-chain