Automating Reverse Engineering Processes with AI/ML, NLP, and LLMs

Instructors: Malachi Jones & Joe Mansour
Dates: June 23 to 26 2025
Capacity: 25

This course enhances reverse engineering (RE) processes through automation, focusing on efficiency and scalability in malware and firmware analysis by integrating Neural Networks (NN), Natural Language Processing (NLP), and Large Language Models (LLMs). Participants will be introduced to Blackfyre, an open-source platform developed for the course, featuring a Ghidra plugin for initial data extraction of a binary into a Binary Context Container (BCC), and a Python library for advanced data parsing and lifting to VEX Intermediate Representation (IR). These features of Blackfyre are pivotal in facilitating the analysis and extraction of data and metadata, which are critical for effectively applying NN, NLP, and LLM techniques in RE tasks. The course's primary objective is to impart foundational knowledge and practical methodologies in RE automation, equipping learners with essential tools and skills. It emphasizes the application of these foundational elements, enabling learners to effectively utilize them in their independent projects, while not necessarily aiming to provide a complete end-to-end automation solution.

As the course progresses, students explore Neural Networks (NN) and Natural Language Processing (NLP) techniques for tackling core reverse engineering challenges. NLP is used to convert textual features—such as strings, imports, and function names—into embeddings for downstream analysis by NN models. Applications include binary classification, anomaly detection, and function similarity analysis. The course introduces BinaryRank, a static analysis algorithm inspired by PageRank, which ranks basic blocks using call graphs and control flow. Unlike PageRank, BinaryRank is computationally efficient, operating in linear time. Combined with Contrastive Learning, this approach improves the precision of binary representations. Concepts are reinforced through targeted, hands-on labs.

The advanced module introduces LLMs for condensing and interpreting reverse engineering artifacts produced through static and dynamic analysis. Key tasks include function and binary summarization and LLM-assisted malware analysis. Subsampling methods are used to address model token limits while maintaining data fidelity. Labs in this section focus on summarization and signature generation workflows to support downstream reverse engineering tasks.

New in 2025, a dedicated fine-tuning lab has been added to provide hands-on experience training LLaMA 3.2 on real-world reverse engineering problems. Students will explore supervised fine-tuning techniques using A6000 GPUs, incorporating LoRA and quantization strategies via Hugging Face Transformers and the LLaMAFactory platform. Each participant is provisioned with an individual cloud GPU environment to conduct fine-tuning experiments. This lab is designed to bridge research and application, giving students direct exposure to the process of adapting open-source LLMs for security-relevant RE tasks.

This course is designed for reverse engineering practitioners with a strong foundation in reverse engineering concepts and proficiency in Python object-oriented programming for hands-on labs. Participants should also have a basic understanding of mathematical concepts (e.g., vectors, weighted averages, and Euclidean distance) and foundational Machine Learning (ML) knowledge, including supervised learning, feature extraction, and evaluation metrics (e.g., precision, and recall). This background is essential for engaging with advanced ML topics such as transformers and fine-tuning Large Language Models (LLMs). By meeting these prerequisites, participants will gain practical skills for applying advanced NN/NLP/LLM techniques—including fine-tuning—to RE automation.

PREREQUISITES

A strong foundation in reverse engineering—including assembly languages, calling conventions, file formats, and Control Flow Graphs (CFGs)—is essential for this course. Participants should have at least two years of hands-on experience and proficiency in Python object-oriented programming (e.g., classes, inheritance, and polymorphism) for lab work.

The course explores advanced Machine Learning (ML), Natural Language Processing (NLP), and Large Language Models (LLMs) for malware and firmware analysis. Students should be familiar with key ML concepts such as supervised and unsupervised learning, feature extraction, data preprocessing, overfitting, underfitting, and evaluation metrics like precision, recall, and F1-score, as these are foundational to advanced topics like transformers and LLM fine-tuning.

Basic knowledge of mathematical concepts such as vectors, weighted averages, probabilities, gradient descent, and distance metrics (e.g., Euclidean distance) is also critical for applying ML techniques to automate reverse engineering tasks.

OBJECTIVES

Acquire proficiency in leveraging Machine Learning (ML), Artificial Intelligence (AI), Natural Language Processing (NLP), and Large Language Models (LLMs) to automate distinct aspects of reverse engineering processes, primarily focusing on malware and firmware analysis, to enhance the scalability and efficiency of these analyses.

Utilize prominent ML, NLP, and LLM techniques to effectively capture, represent, and analyze relevant features from binaries, enabling the automation of crucial reverse engineering tasks such as malware detection, firmware analysis, and function naming, whole-binary summarization, malware signature generation, and malware report generation.

Apply hands-on skills in various labs to tackle real-world challenges within malware and firmware analysis domains, fostering a deeper understanding of the potential and limitations of ML, AI, NLP, and LLMs in automating reverse engineering tasks.

WHO SHOULD TAKE THIS COURSE

This course is ideal for reverse engineering practitioners aiming to delve into the automation of reverse engineering processes, chiefly in malware and firmware analysis, using Machine Learning (ML), Natural Language Processing (NLP), and Large Language Models (LLMs). It provides a blend of theoretical insights and practical skills to tackle modern challenges in malware and firmware analysis, and lays groundwork for independent exploration(outside of the course) of other areas like vulnerability analysis and software bill of materials (SBOM) generation. With a structured curriculum, the course serves as a valuable stepping stone for those looking to advance their expertise in automating RE tasks within the cybersecurity domain.

WHO WOULD NOT BE A GOOD FIT FOR THIS COURSE

Individuals may not find this course suitable if they are not proficient in Python object-oriented programming, which is vital for navigating the course labs. Those who are uncomfortable with key mathematical concepts such as vectors, weighted averages, and metrics (e.g., Euclidean distance) may find the technical depth challenging. While the course explores Machine Learning (ML), Natural Language Processing (NLP), and Large Language Models (LLMs), it is principally designed for automating specific reverse engineering processes in malware and firmware analysis. Therefore, those seeking a broader or more generalized education on ML, NLP, or LLMs without a focus on reverse engineering may not find the course aligning with their learning objectives.

WHAT SHOULD STUDENTS BRING

Students should ensure they have a laptop with a minimum of 32 GB RAM, 250 GB of free disk space, and a processor with at least 4 cores, equivalent to an Intel i7 or higher. The processor must be an x86_64 architecture to ensure compatibility with the course-provided virtual machine (VM) and to run VirtualBox version 7.1 or later. Additionally, the processor must support AVX (Advanced Vector Extensions), which are required for running machine learning frameworks such as TensorFlow and PyTorch. Connectivity capabilities are also essential for accessing external services used in the Large Language Models (LLMs) components of the course. VirtualBox should be pre-installed to enable participation in the hands-on labs and exercises.

BIO

Dr. Malachi Jones is a Principal Cyber Security AI/LLM Researcher at Microsoft, working within the Microsoft Security AI (MSECAI) team. His current work focuses on fine-tuning Large Language Models (LLMs) for security applications, optimizing their performance for tasks such as reverse engineering (RE) and malware analysis. Dr. Jones leads the development of reverse engineering capabilities for Security Copilot, integrating advanced static and dynamic analysis techniques, including sandbox detonation, with large language models (LLMs) to enhance malware detection and analysis workflows.

With over 15 years of experience in security research, Dr. Jones has made significant contributions to both academic and industrial sectors. Prior to his work at Microsoft, he held a role at MITRE, where he applied machine learning (ML) and intermediate representation (IR) languages to automate reverse engineering tasks. During his tenure, he also developed and taught a course titled "Automating Reverse Engineering with Machine Learning and Binary Analysis." At Booz Allen Dark Labs, Dr. Jones specialized in embedded security research and developed LLVM IR-based tools for automated vulnerability assessment, co-authoring US Patent 10,133,871.

In addition to his work at Microsoft, Dr. Jones is the founder of Jones Cyber-AI, a company dedicated to his independent research and teaching initiatives. Through Jones Cyber-AI, he has taught his specialized course, "Automating Reverse Engineering Processes with AI/ML, NLP, and LLMs," at leading conferences such as Black Hat USA (2019, 2021, 2023, and 2024) and RECON Montreal (2023 and 2024). His independent research into AI, ML, NLP, and LLMs ensures his courses remain cutting-edge and relevant to the latest advancements in cybersecurity.

Dr. Jones has also demonstrated a commitment to education, having served as an Adjunct Professor at the University of Maryland, College Park, from 2019 to 2020, where he taught "Machine Learning Techniques Applied to Cybersecurity."

Dr. Jones holds a B.S. in Computer Engineering from the University of Florida and both an M.S. and Ph.D. from Georgia Tech, where his research focused on applying game theory to cybersecurity challenges. His expertise in AI, ML, LLMs, and cybersecurity continues to drive innovations in reverse engineering and broader cybersecurity solutions.

Joe Mansour is a Security Researcher at Microsoft. With a focus on reverse engineering malware, he develops detections to protect customers. His expertise is rooted in a background that spans red teaming, vulnerability assessment, and hardware hacking. Joe has contributed to projects involving automated reverse engineering showcasing his aptitude for binary analysis and tool development to simplify the complexities of reverse engineering.

He holds an M.S. in Computer Science from Johns Hopkins University and a B.S. from the University of Illinois at Urbana-Champaign.

To Register

Click here to register.