Loading…

Loading grant details…

Completed STANDARD GRANT National Science Foundation (US)

SaTC: CORE: Small: SOFIA: Finding and profiling malware source-code in public archives at scale

$5M USD

Funder National Science Foundation (US)
Recipient Organization University of California-Riverside
Country United States
Start Date Oct 01, 2021
End Date Sep 30, 2025
Duration 1,460 days
Number of Grantees 1
Roles Principal Investigator
Data Source National Science Foundation (US)
Grant ID 2132642
Grant Description

This project develops methods and tools for identifying and profiling malware source code from public software archives. The project is motivated by the following insight: software archives, like GitHub, host a surprisingly large number of publicly-accessible malware repositories, where a "malware repository" refers to repositories that provide the source-code for compiling a working malware binary (effectively the malware author's software project).

This constitutes a huge missed opportunity: GitHub alone has more than 32 million public repositories and there are many similar software platforms. The project’s novelties are: (a) novel methods to extract and profile malware source code effectively and at scale, and (b) the largest annotated malware source code database. Ultimately, the project provides a key building block for fighting malware: security research could greatly benefit from an extensive database of malware source code, which is currently unavailable.

The project's broader significance and importance are it can help build a preventive system against malware. This capability enables an early detector of malware and its creation within the hacker ecosystem, which can provide critical insights into the activities of the hackers in advance, and possibly before an infection or an attack takes place.

From a technical point of view, the project develops methods to: (a) systematically mine these repositories, and (b) study and profile the malware and the related hacking ecosystem. Preliminary research has identified more than 7000 GitHub malware source code repositories and many highly-collaborative communities with hundreds of malware authors. A key novelty is that repositories are described with a comprehensive set of features along three dimensions: (a) metadata, such as title and description, (b) the source code and its structure, and (c) the social context, which captures the interactions among authors and repositories.

The second key novelty is algorithmic, as the project develops new approaches and also evaluates and adapts: (a) state of the art data-mining techniques, such as word embedding, (b) code-specific profiling techniques, and (c) techniques for the describing the interactions of authors, such as a novel recursive tensor decomposition.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

All Grantees

University of California-Riverside

Advertisement
Discover thousands of grant opportunities
Advertisement
Browse Grants on GrantFunds
Interested in applying for this grant?

Complete our application form to express your interest and we'll guide you through the process.

Apply for This Grant