Automating Reverse Engineering with Machine Learning, Binary Analysis, and Natural Language Processing


Instructors: Malachi Jones
Dates:  June 5 to 8  2023
Capacity: 25


Reverse engineering (RE) applications (e.g., malware detection, firmware/vulnerability analysis, and software bill of material [SBOM] generation) have historically been manual and time-intensive endeavors that often requires highly skilled practitioners. The prevalence of malware and the growth and ubiquity of IoT devices have created a need for automating reverse engineering in a manner that can scale well and meet the performance/business requirements.

 

In this course, we will introduce, discuss, and demonstrate (via labs) how Binary Analysis, Machine Learning (ML), and Natural Language Processing (NLP) can be leveraged to address automating, scaling, and performance challenges with respect to RE. In particular, we will show how Binary Analysis and NLP techniques can provide a vehicle to capture relevant information/features (e.g., strings, symbols, control flow graphs, and instructions) and represent information/features in a mathematical representation (i.e. vectors) that is suitable for ingesting into ML algorithms.

 

A common thread across RE applications in analysis and detection is the need to identify which features of a target binary/function is relevant and quantify the importance of each relevant feature. Feature importance scores can allow us to optimize performance for downstream RE tasks that include the following (and will be demonstrated via labs): 1) Automatically generating a YARA rule for malware identification 2.) Function/binary matching for SBOM generation and vulnerability detection in combination with Locality Sensitive Hashing (LSH) 3.) Classification of binaries for Malware Detection. We will discuss and demonstrate how we can leverage ML models (e.g. SVM, Logistic Regression, and Random Forest) to derive and utilize feature importance scores to automate downstream RE tasks in a scalable and performant manner.

 

We will conclude the course with a brief introduction to neural networks (NN) and the Keras/TensorFlow framework. We will discuss applications of NN to automated reverse engineering and demonstrate applications that include malware detection (via dense layers) and function name prediction for stripped binaries (via recurrent neural networks [RNNs]).

 


 


PREREQUISITES


We will assume that each student has a basic understanding of core computer engineering concepts and is COMFORTABLE WITH PYTHON OBJECT-ORIENTED PROGRAMMING (e.g. classes, inheritance, method overriding, polymorphism) CONCEPTS AND TECHNIQUES (that are needed for the course labs). KNOWLEDGE OF REVERSE ENGINEERING CONCEPTS (e.g. assembly languages, calling conventions, file formats, and CFGs) are critical for this course, and it will be assumed that each student has at least one year of hands-on RE experience. Exposure to machine learning and/or NLP concepts can be helpful but are not necessary.

 

OBJECTIVE

 

Leverage NLP and Binary analysis techniques to capture relevant features (e.g. strings, symbols, control flow graphs, and instructions) and to represent the features in a mathematical representation (i.e. vector) that is suitable for ingesting into ML algorithms


Utilize prominent ML and NN/DL frameworks (e.g. sci-kit learn and Keras/Tensorflow) to perform binary/function classification and optimize performance of downstream automated RE tasks.


Incorporate Binary Analysis, Natural Language Processing, and ML/NN/DL techniques to design and implement automated RE applications (e.g. firmware analysis, malware detection, vulnerability analysis, and software bill of material generation) that scale well and are performant.



WHAT SHOULD STUDENTS BRING


Laptop that has at least 16 GB RAM (preferably 32 GB RAM), 50 GB of free disk space, and either VMWare Workstation (>=15.0) or Virtualbox (>=6.1) preinstalled.



BIO



Dr. Malachi Jones is a Principal Applied Scientist at Microsoft in the Cloud Security Lab for AI/ML, where a focus is to  apply large language models (LLMs) to cyber security. He has over 12 years of combined experience performing security research work in academia and industry. Malachi has taught a course titled "Machine Learning Techniques Applied to Cyber Security" at University of Maryland at College Park as an Adjunct Professor.


 

Prior to joining Microsoft, Malachi worked at MITRE and specialized in automating reverse engineering by leveraging intermediate representation languages and Machine Learning Techniques. Malachi developed (as an independent effort) a 3-day course titled "Automating Reverse Engineering with Machine Learning and Binary Analysis" that he taught internally at MITRE as a part of the MITRE Institute.


 

Prior to MITRE, Malachi worked as an embedded security researcher at Booz Allen Dark Labs where he developed tools that leveraged llvm IR to perform automated vulnerability assessment; he is a co-author of a US Patent (US 10,133,871, Issued Nov 20, 2018) as the result of this work. In addition, he was also an instructor with Booz Allen's internal reverse engineering training program. Prior to Dark Labs, he worked as a vulnerability researcher at a defense contractor in Melbourne, FL for over two years.


 

Dr. Jones holds a B.S. in Computer Engineering from the University of Florida and an M.S. and PhD from Georgia Tech. His graduate work at Georgia Tech focused on modeling cyber security problems in a Game Theoretic framework to perform actionable cyber-attack forecasting.


 

In addition to his work at Microsoft, he separately conducts independent research on automating reverse engineering with Machine Learning, Binary Analysis, and Natural Language Processing. He has taught a course titled "Automating Reverse Engineering with Machine Learning and Binary Analysis" at Blackhat; he has previously taught at Blackhat USA 2019 and 2021.


 



To Register

Click here to register.