4 Recon Training - Automating Reverse Engineering with Machine Learning, Binary Analysis, and Natural Language Processing by Malachi Jones & Joe Mansour

Automating Reverse Engineering Processes with AI/ML, NLP, and LLMs


Instructors: Malachi Jones & Joe Mansour
Dates:  June 23 to 26 2025
Capacity: 25


This course enhances reverse engineering (RE) processes through automation, focusing on efficiency and scalability in malware and firmware analysis by integrating Neural Networks (NN), Natural Language Processing (NLP), and Large Language Models (LLMs). Participants will be introduced to Blackfyre, an open-source platform developed for the course, featuring a Ghidra plugin for initial data extraction of a binary into a Binary Context Container (BCC), and a Python library for advanced data parsing and lifting to VEX Intermediate Representation (IR). These features of Blackfyre are pivotal in facilitating the analysis and extraction of data and metadata, which are critical for effectively applying NN, NLP, and LLM techniques in RE tasks. The course's primary objective is to impart foundational knowledge and practical methodologies in RE automation, equipping learners with essential tools and skills. It emphasizes the application of these foundational elements, enabling learners to effectively utilize them in their independent projects, while not necessarily aiming to provide a complete end-to-end automation solution.

As the course progresses, students delve deeply into Neural Networks (NN) and Natural Language Processing (NLP), crucial for addressing significant challenges in reverse engineering. NLP plays a pivotal role by converting textual elements into numerical 'embeddings,' making them compatible for analysis by NN models. In malware analysis, these technologies are instrumental for tasks like binary classification and anomaly detection, while in firmware analysis, they assist in predicting function and binary names and detecting similarities between binaries/functions. The course introduces the BinaryRank algorithm, inspired by PageRank but tailored for static analysis in reverse engineering. BinaryRank assigns a global ranking to basic blocks by analyzing a binary's call graph and function control flow graphs. This ranking system emphasizes the relevance/importance of textual elements in binaries, such as strings, import function names, and function names, which is crucial in NLP applications. Unlike PageRank's complex computational cost, BinaryRank operates more efficiently with a linear computational cost, making it highly suitable for binary analysis. This approach, combined with Contrastive Learning, enhances the accuracy and relevance of NLP techniques in analyzing and representing binaries. The course reinforces these concepts through hands-on labs.

In the advanced module of the course, LLMs are presented to systematically distill and condense vital reverse engineering artifacts and data generated from static and dynamic analysis (e.g. sandbox detonation), enhancing their clarity and interpretability. The module begins by highlighting the capabilities of LLMs in function and whole-binary summarization, keeping in mind the token constraints of models like GPT-4o, GPT-3.5, and llama-2. To overcome these limitations, the module explores subsampling as a primary tool, allowing for targeted analysis while maintaining data relevance. Discussions also cover LLM-assisted malware analysis, aiming to generate malware signatures and comprehensive malware analysis reports that encapsulate vital details, streamlining information for reverse engineers. To ensure a practical grasp of these concepts, multiple labs are integrated throughout this module, specifically focusing on subsampling methods, function and whole-binary summarization techniques, and LLM-assisted malware analysis for malware signature and comprehensive report generation.

This course is designed for reverse engineering practitioners with a strong foundation in reverse engineering concepts and proficiency in Python object-oriented programming for hands-on labs. Participants should also have a basic understanding of mathematical concepts (e.g., vectors, weighted averages, and Euclidean distance) and foundational Machine Learning (ML) knowledge, including supervised learning, feature extraction, and evaluation metrics (e.g., precision and recall). This background is essential for engaging with advanced ML topics such as transformers and fine-tuning Large Language Models (LLMs). By meeting these prerequisites, participants will gain practical skills in automating reverse engineering tasks using AI/ML, NLP, and LLMs.


PREREQUISITES


A strong foundation in reverse engineering—including assembly languages, calling conventions, file formats, and Control Flow Graphs (CFGs)—is essential for this course. Participants should have at least two years of hands-on experience and proficiency in Python object-oriented programming (e.g., classes, inheritance, and polymorphism) for lab work.

The course explores advanced Machine Learning (ML), Natural Language Processing (NLP), and Large Language Models (LLMs) for malware and firmware analysis. Students should be familiar with key ML concepts such as supervised and unsupervised learning, feature extraction, data preprocessing, overfitting, underfitting, and evaluation metrics like precision, recall, and F1-score, as these are foundational to advanced topics like transformers and LLM fine-tuning.

Basic knowledge of mathematical concepts such as vectors, weighted averages, probabilities, gradient descent, and distance metrics (e.g., Euclidean distance) is also critical for applying ML techniques to automate reverse engineering tasks.



OBJECTIVES

 



WHO SHOULD TAKE THIS COURSE


This course is ideal for reverse engineering practitioners aiming to delve into the automation of reverse engineering processes, chiefly in malware and firmware analysis, using Machine Learning (ML), Natural Language Processing (NLP), and Large Language Models (LLMs). It provides a blend of theoretical insights and practical skills to tackle modern challenges in malware and firmware analysis, and lays groundwork for independent exploration(outside of the course) of other areas like vulnerability analysis and software bill of materials (SBOM) generation. With a structured curriculum, the course serves as a valuable stepping stone for those looking to advance their expertise in automating RE tasks within the cybersecurity domain.



WHO WOULD NOT BE A GOOD FIT FOR THIS COURSE


Individuals may not find this course suitable if they are not proficient in Python object-oriented programming, which is vital for navigating the course labs. Those who are uncomfortable with key mathematical concepts such as vectors, weighted averages, and metrics (e.g., Euclidean distance) may find the technical depth challenging. While the course explores Machine Learning (ML), Natural Language Processing (NLP), and Large Language Models (LLMs), it is principally designed for automating specific reverse engineering processes in malware and firmware analysis. Therefore, those seeking a broader or more generalized education on ML, NLP, or LLMs without a focus on reverse engineering may not find the course aligning with their learning objectives.





WHAT SHOULD STUDENTS BRING


Students should ensure they have a laptop with a minimum of 32 GB RAM, 250 GB of free disk space, and a processor with at least 4 cores, equivalent to an Intel i7 or higher. The processor must be an x86_64 architecture to ensure compatibility with the course-provided virtual machine (VM) and to run VirtualBox version 7.1 or later. Additionally, the processor must support AVX (Advanced Vector Extensions), which are required for running machine learning frameworks such as TensorFlow and PyTorch. Connectivity capabilities are also essential for accessing external services used in the Large Language Models (LLMs) components of the course. VirtualBox should be pre-installed to enable participation in the hands-on labs and exercises.





BIO


Dr. Malachi Jones is a Principal Cyber Security AI/LLM Researcher at Microsoft, working within the Microsoft Security AI (MSECAI) team. His current work focuses on fine-tuning Large Language Models (LLMs) for security applications, optimizing their performance for tasks such as reverse engineering (RE) and malware analysis. Dr. Jones leads the development of reverse engineering capabilities for Security Copilot, integrating advanced static and dynamic analysis techniques, including sandbox detonation, with large language models (LLMs) to enhance malware detection and analysis workflows.

With over 15 years of experience in security research, Dr. Jones has made significant contributions to both academic and industrial sectors. Prior to his work at Microsoft, he held a role at MITRE, where he applied machine learning (ML) and intermediate representation (IR) languages to automate reverse engineering tasks. During his tenure, he also developed and taught a course titled "Automating Reverse Engineering with Machine Learning and Binary Analysis." At Booz Allen Dark Labs, Dr. Jones specialized in embedded security research and developed LLVM IR-based tools for automated vulnerability assessment, co-authoring US Patent 10,133,871.

In addition to his work at Microsoft, Dr. Jones is the founder of Jones Cyber-AI, a company dedicated to his independent research and teaching initiatives. Through Jones Cyber-AI, he has taught his specialized course, "Automating Reverse Engineering Processes with AI/ML, NLP, and LLMs," at leading conferences such as Black Hat USA (2019, 2021, 2023, and 2024) and RECON Montreal (2023 and 2024). His independent research into AI, ML, NLP, and LLMs ensures his courses remain cutting-edge and relevant to the latest advancements in cybersecurity.

Dr. Jones has also demonstrated a commitment to education, having served as an Adjunct Professor at the University of Maryland, College Park, from 2019 to 2020, where he taught "Machine Learning Techniques Applied to Cybersecurity."

Dr. Jones holds a B.S. in Computer Engineering from the University of Florida and both an M.S. and Ph.D. from Georgia Tech, where his research focused on applying game theory to cybersecurity challenges. His expertise in AI, ML, LLMs, and cybersecurity continues to drive innovations in reverse engineering and broader cybersecurity solutions.






Joe Mansour is a Security Researcher at Microsoft. With a focus on reverse engineering malware, he develops detections to protect customers. His expertise is rooted in a background that spans red teaming, vulnerability assessment, and hardware hacking. Joe has contributed to projects involving automated reverse engineering showcasing his aptitude for binary analysis and tool development to simplify the complexities of reverse engineering.

He holds an M.S. in Computer Science from Johns Hopkins University and a B.S. from the University of Illinois at Urbana-Champaign. He currently resides in Reston, Virginia.

To Register

Click here to register.