4
Instructors: Malachi Jones & Joe Mansour
Dates: June 23 to 26 2025
Capacity: 25
This course enhances reverse engineering (RE) processes through automation, focusing on efficiency and
scalability in malware and firmware analysis by integrating Neural Networks (NN), Natural Language
Processing (NLP), and Large Language Models (LLMs). Participants will be introduced to Blackfyre, an
open-source platform developed for the course, featuring a Ghidra plugin for initial data extraction of
a binary into a Binary Context Container (BCC), and a Python library for advanced data parsing and
lifting to VEX Intermediate Representation (IR). These features of Blackfyre are pivotal in facilitating
the analysis and extraction of data and metadata, which are critical for effectively applying NN, NLP,
and LLM techniques in RE tasks. The course's primary objective is to impart foundational knowledge and
practical methodologies in RE automation, equipping learners with essential tools and skills. It
emphasizes the application of these foundational elements, enabling learners to effectively utilize them
in their independent projects, while not necessarily aiming to provide a complete end-to-end automation
solution.
As the course progresses, students delve deeply into Neural Networks (NN) and Natural Language
Processing (NLP), crucial for addressing significant challenges in reverse engineering. NLP plays a
pivotal role by converting textual elements into numerical 'embeddings,' making them compatible for
analysis by NN models. In malware analysis, these technologies are instrumental for tasks like binary
classification and anomaly detection, while in firmware analysis, they assist in predicting function and
binary names and detecting similarities between binaries/functions. The course introduces the BinaryRank
algorithm, inspired by PageRank but tailored for static analysis in reverse engineering. BinaryRank
assigns a global ranking to basic blocks by analyzing a binary's call graph and function control flow
graphs. This ranking system emphasizes the relevance/importance of textual elements in binaries, such as
strings, import function names, and function names, which is crucial in NLP applications. Unlike
PageRank's complex computational cost, BinaryRank operates more efficiently with a linear computational
cost, making it highly suitable for binary analysis. This approach, combined with Contrastive Learning,
enhances the accuracy and relevance of NLP techniques in analyzing and representing binaries. The course
reinforces these concepts through hands-on labs.
In the advanced module of the course, LLMs are presented to systematically distill and condense vital
reverse engineering artifacts and data generated from static and dynamic analysis (e.g. sandbox
detonation), enhancing their clarity and interpretability. The module begins by highlighting the
capabilities of LLMs in function and whole-binary summarization, keeping in mind the token constraints
of models like GPT-4o, GPT-3.5, and llama-2. To overcome these limitations, the module explores
subsampling as a primary tool, allowing for targeted analysis while maintaining data relevance.
Discussions also cover LLM-assisted malware analysis, aiming to generate malware signatures and
comprehensive malware analysis reports that encapsulate vital details, streamlining information for
reverse engineers. To ensure a practical grasp of these concepts, multiple labs are integrated
throughout this module, specifically focusing on subsampling methods, function and whole-binary
summarization techniques, and LLM-assisted malware analysis for malware signature and comprehensive
report generation.
This course is designed for reverse engineering practitioners with a strong foundation in reverse
engineering concepts and proficiency in Python object-oriented programming for hands-on labs.
Participants should also have a basic understanding of mathematical concepts (e.g., vectors, weighted
averages, and Euclidean distance) and foundational Machine Learning (ML) knowledge, including supervised
learning, feature extraction, and evaluation metrics (e.g., precision and recall). This background is
essential for engaging with advanced ML topics such as transformers and fine-tuning Large Language
Models (LLMs). By meeting these prerequisites, participants will gain practical skills in automating
reverse engineering tasks using AI/ML, NLP, and LLMs.
A strong foundation in reverse engineering—including assembly languages, calling conventions, file
formats, and Control Flow Graphs (CFGs)—is essential for this course. Participants should have at least
two years of hands-on experience and proficiency in Python object-oriented programming (e.g., classes,
inheritance, and polymorphism) for lab work.
The course explores advanced Machine Learning (ML), Natural Language Processing (NLP), and Large
Language Models (LLMs) for malware and firmware analysis. Students should be familiar with key ML
concepts such as supervised and unsupervised learning, feature extraction, data preprocessing,
overfitting, underfitting, and evaluation metrics like precision, recall, and F1-score, as these are
foundational to advanced topics like transformers and LLM fine-tuning.
Basic knowledge of mathematical concepts such as vectors, weighted averages, probabilities, gradient
descent, and distance metrics (e.g., Euclidean distance) is also critical for applying ML techniques to
automate reverse engineering tasks.
This course is ideal for reverse engineering practitioners aiming to delve into the automation of reverse engineering processes, chiefly in malware and firmware analysis, using Machine Learning (ML), Natural Language Processing (NLP), and Large Language Models (LLMs). It provides a blend of theoretical insights and practical skills to tackle modern challenges in malware and firmware analysis, and lays groundwork for independent exploration(outside of the course) of other areas like vulnerability analysis and software bill of materials (SBOM) generation. With a structured curriculum, the course serves as a valuable stepping stone for those looking to advance their expertise in automating RE tasks within the cybersecurity domain.
Individuals may not find this course suitable if they are not proficient in Python object-oriented programming, which is vital for navigating the course labs. Those who are uncomfortable with key mathematical concepts such as vectors, weighted averages, and metrics (e.g., Euclidean distance) may find the technical depth challenging. While the course explores Machine Learning (ML), Natural Language Processing (NLP), and Large Language Models (LLMs), it is principally designed for automating specific reverse engineering processes in malware and firmware analysis. Therefore, those seeking a broader or more generalized education on ML, NLP, or LLMs without a focus on reverse engineering may not find the course aligning with their learning objectives.
Students should ensure they have a laptop with a minimum of 32 GB RAM, 250 GB of free disk space, and a processor with at least 4 cores, equivalent to an Intel i7 or higher. The processor must be an x86_64 architecture to ensure compatibility with the course-provided virtual machine (VM) and to run VirtualBox version 7.1 or later. Additionally, the processor must support AVX (Advanced Vector Extensions), which are required for running machine learning frameworks such as TensorFlow and PyTorch. Connectivity capabilities are also essential for accessing external services used in the Large Language Models (LLMs) components of the course. VirtualBox should be pre-installed to enable participation in the hands-on labs and exercises.
Dr. Malachi Jones is a Principal Cyber Security AI/LLM Researcher at Microsoft, working within the
Microsoft
Security AI (MSECAI) team. His current work focuses on fine-tuning Large Language Models (LLMs) for
security
applications, optimizing their performance for tasks such as reverse engineering (RE) and malware
analysis.
Dr. Jones leads the development of reverse engineering capabilities for Security Copilot, integrating
advanced static and dynamic analysis techniques, including sandbox detonation, with large language
models
(LLMs) to enhance malware detection and analysis workflows.
With over 15 years of experience in security research, Dr. Jones has made significant contributions to
both
academic and industrial sectors. Prior to his work at Microsoft, he held a role at MITRE, where he
applied
machine learning (ML) and intermediate representation (IR) languages to automate reverse engineering
tasks.
During his tenure, he also developed and taught a course titled "Automating Reverse Engineering with
Machine
Learning and Binary Analysis." At Booz Allen Dark Labs, Dr. Jones specialized in embedded security
research
and developed LLVM IR-based tools for automated vulnerability assessment, co-authoring US Patent
10,133,871.
In addition to his work at Microsoft, Dr. Jones is the founder of Jones Cyber-AI, a company dedicated to
his
independent research and teaching initiatives. Through Jones Cyber-AI, he has taught his specialized
course,
"Automating Reverse Engineering Processes with AI/ML, NLP, and LLMs," at leading conferences such as
Black
Hat USA (2019, 2021, 2023, and 2024) and RECON Montreal (2023 and 2024). His independent research into
AI,
ML, NLP, and LLMs ensures his courses remain cutting-edge and relevant to the latest advancements in
cybersecurity.
Dr. Jones has also demonstrated a commitment to education, having served as an Adjunct Professor at the
University of Maryland, College Park, from 2019 to 2020, where he taught "Machine Learning Techniques
Applied to Cybersecurity."
Dr. Jones holds a B.S. in Computer Engineering from the University of Florida and both an M.S. and Ph.D.
from Georgia Tech, where his research focused on applying game theory to cybersecurity challenges. His
expertise in AI, ML, LLMs, and cybersecurity continues to drive innovations in reverse engineering and
broader cybersecurity solutions.
Joe Mansour is a Security Researcher at Microsoft. With a focus on reverse engineering malware, he
develops detections to protect customers. His expertise is rooted in a background that spans red
teaming, vulnerability assessment, and hardware hacking. Joe has contributed to projects involving
automated reverse engineering showcasing his aptitude for binary analysis and tool development to
simplify the complexities of reverse engineering.
He holds an M.S. in Computer Science from Johns Hopkins University and a B.S. from the University of
Illinois at Urbana-Champaign. He currently resides in Reston, Virginia.
Click here to register.