Automating Reverse Engineering Processes with AI/ML, NLP, and LLMs

Instructors: Malachi Jones
Dates: June 24 to 27 2024
Capacity: 25

This course enhances reverse engineering (RE) processes through automation, focusing on efficiency and scalability in malware and firmware analysis by integrating Neural Networks (NN), Natural Language Processing (NLP), and Large Language Models (LLMs). Participants will be introduced to Blackfyre, an open-source platform developed for the course, featuring a Ghidra plugin for initial data extraction of a binary into a Binary Context Container (BCC), and a Python library for advanced data parsing and lifting to VEX Intermediate Representation (IR). These features of Blackfyre are pivotal in facilitating the analysis and extraction of data and metadata, which are critical for effectively applying NN, NLP, and LLM techniques in RE tasks. The course's primary objective is to impart foundational knowledge and practical methodologies in RE automation, equipping learners with essential tools and skills. It emphasizes the application of these foundational elements, enabling learners to effectively utilize them in their independent projects, while not necessarily aiming to provide a complete end-to-end automation solution.

As the course progresses, students delve deeply into Neural Networks (NN) and Natural Language Processing (NLP), crucial for addressing significant challenges in reverse engineering. NLP plays a pivotal role by converting textual elements into numerical 'embeddings,' making them compatible for analysis by NN models. In malware analysis, these technologies are instrumental for tasks like binary classification and anomaly detection, while in firmware analysis, they assist in predicting function and binary names and detecting similarities between binaries/functions. The course introduces the BinaryRank algorithm, inspired by PageRank but tailored for static analysis in reverse engineering. BinaryRank assigns a global ranking to basic blocks by analyzing a binary's call graph and function control flow graphs. This ranking system emphasizes the relevance/importance of textual elements in binaries, such as strings, import function names, and function names, which is crucial in NLP applications. Unlike PageRank's complex computational cost, BinaryRank operates more efficiently with a linear computational cost, making it highly suitable for binary analysis. This approach, combined with Contrastive Learning, enhances the accuracy and relevance of NLP techniques in analyzing and representing binaries. The course reinforces these concepts through hands-on labs.

In the advanced module of the course, LLMs are presented to systematically distill and condense vital reverse engineering artifacts and data generated from static and dynamic analysis (e.g. sandbox detonation), enhancing their clarity and interpretability. The module begins by highlighting the capabilities of LLMs in function and whole-binary summarization, keeping in mind the token constraints of models like GPT-4, GPT-3.5, and llama-2. To overcome these limitations, the module explores subsampling as a primary tool, allowing for targeted analysis while maintaining data relevance. Discussions also cover LLM-assisted malware analysis, aiming to generate malware signatures and comprehensive malware analysis reports that encapsulate vital details, streamlining information for reverse engineers. To ensure a practical grasp of these concepts, multiple labs are integrated throughout this module, specifically focusing on subsampling methods, Automating Reverse Engineering Processes with AI/ML, NLP, and LLMs function and whole-binary summarization techniques, and LLM-assisted malware analysis for malware signature and comprehensive report generation.

This comprehensive course is designed for reverse engineering practitioners with a foundational understanding of relevant reverse engineering concepts. Proficiency in object-oriented programming in Python is required for the hands-on labs. Additionally, a basic grasp of mathematical concepts, such as vectors (e.g., ordered list of numbers), weighted averages, and distance metrics (e.g., Euclidean distance), is strongly recommended. This foundational mathematical knowledge will comfortably prepare participants to engage with and comprehend the Machine Learning (ML), Natural Language Processing (NLP), and Large Language Models (LLMs) covered in the course. By having this foundational understanding, participants can expect to gain valuable insights and practical skills in automating reverse engineering processes using AI/ML, NLP, and LLMs, enhancing their understanding of these technologies.

PREREQUISITES

An understanding of reverse engineering concepts, including assembly languages, calling conventions, file formats, and Control Flow Graphs (CFGs), is essential for this course. Each student is expected to have at least two years of hands-on reverse engineering experience. Participants should also be proficient in Python object-oriented programming, which encompasses classes, inheritance, method overriding, and polymorphism, as these concepts are pivotal for the course labs. The course delves into Machine Learning (ML), Natural Language Processing (NLP), and Large Language Models (LLMs) within the context of malware and firmware analysis. Foundational knowledge of basic mathematical concepts, such as vectors (e.g. ordered list of numbers), weighted averages, and metrics like Euclidean distance, is crucial. These mathematical foundations underpin the practical application of ML, NLP, and LLM techniques in the automation of reverse engineering tasks.

OBJECTIVES

Acquire proficiency in leveraging Machine Learning (ML), Artificial Intelligence (AI), Natural Language Processing (NLP), and Large Language Models (LLMs) to automate distinct aspects of reverse engineering processes, primarily focusing on malware and firmware analysis, to enhance the scalability and efficiency of these analyses.

Utilize prominent ML, NLP, and LLM techniques to effectively capture, represent, and analyze relevant features from binaries, enabling the automation of crucial reverse engineering tasks such as malware detection, firmware analysis, and function naming, whole-binary summarization, malware signature generation, and malware report generation.

Apply hands-on skills in various labs to tackle real-world challenges within malware and firmware analysis domains, fostering a deeper understanding of the potential and limitations of ML, AI, NLP, and LLMs in automating reverse engineering tasks.

WHO SHOULD TAKE THIS COURSE

This course is ideal for reverse engineering practitioners aiming to delve into the automation of reverse engineering processes, chiefly in malware and firmware analysis, using Machine Learning (ML), Natural Language Processing (NLP), and Large Language Models (LLMs). It provides a blend of theoretical insights and practical skills to tackle modern challenges in malware and firmware analysis, and lays groundwork for independent exploration(outside of the course) of other areas like vulnerability analysis and software bill of materials (SBOM) generation. With a structured curriculum, the course serves as a valuable stepping stone for those looking to advance their expertise in automating RE tasks within the cybersecurity domain.

WHO WOULD NOT BE A GOOD FIT FOR THIS COURSE

Individuals may not find this course suitable if they are not proficient in Python object-oriented programming, which is vital for navigating the course labs. Those who are uncomfortable with key mathematical concepts such as vectors, weighted averages, and metrics (e.g., Euclidean distance) may find the technical depth challenging. While the course explores Machine Learning (ML), Natural Language Processing (NLP), and Large Language Models (LLMs), it is principally designed for automating specific reverse engineering processes in malware and firmware analysis. Therefore, those seeking a broader or more generalized education on ML, NLP, or LLMs without a focus on reverse engineering may not find the course aligning with their learning objectives.

WHAT SHOULD STUDENTS BRING

Students must have a laptop with at least 32 GB RAM, 250 GB free disk space, and a processor that supports x86_64 instructions, with at least 4 cores, equivalent to an Intel i7 or higher, for running an x64 VM. Ensure the laptop can connect to external services for course-related Large Language Models (LLMs) activities. Additionally, a pre-installed Virtualbox version 7.0 or newer for hands-on labs and exercises.

CHANGES FROM THE PREVIOUS OFFERING OF THE COURSE

In the ongoing effort to enhance the learning experience and ensure the adequacy of prerequisite knowledge, several notable changes have been made to the course structure and content from the previous offering. These changes are outlined below:

Refined Prerequisite Articulation: Following the feedback received, there's a more clearer articulation of the prerequisite knowledge required by students. Emphasis has been placed proficiency in Python object-oriented programming and a basic understanding of relevant mathematical concepts. These refinements aim to ensure that students have the necessary background to engage effectively with the course material.

Introduction of Large Language Model Applications: The course now incorporates Large Language Model (LLM) applications, offering hands-on labs to allow students a practical experience. This addition brings to light the evolving landscape of automating reverse engineering processes, particularly in malware and firmware analysis.

Re-structured Course Content to Incorporate LLMs: The course has been completely restructured to seamlessly incorporate Large Language Models, aligning with the objective of automating various reverse engineering processes. This restructure enriches the course content, making it a more comprehensive and up-to-date offering that mirrors the current advancements in the field.

Updated Computer Resource Requirements: Based on past experiences where some students faced challenges due to inadequate computing resources that hindered their progress in labs, especially with Neural Network and NLP software packages, an update on the computer resource requirements has been introduced. This update aims to ensure that students are well-prepared with the necessary computing resources to effectively participate in the hands-on labs and have a smoother learning experience. It's crucial that students have access to adequate computing resources to prevent any slowdowns in their labs, which could impact their overall experience and learning outcomes.

BIO

Dr. Malachi Jones is a Principal Applied Scientist at Microsoft, working in the Microsoft Security AI (MSECAI) team. His work is focused on the application of Artificial Intelligence (AI) and Large Language Models (LLMs) for enhancing cybersecurity solutions. At MSECAI, Dr. Jones leads the development of Reverse Engineering (RE) capabilities for Security Copilot, employing advanced static and dynamic analysis methods, including sandbox detonation, combined with LLMs to improve malware analysis.

With over 15 years of experience in security research, Dr. Jones has contributed significantly to both the academic and industrial sectors. Prior to his current position at Microsoft, he worked at MITRE, where his projects included the application of machine learning and intermediate representation languages to automate reverse engineering tasks. During his tenure at MITRE, he also created and taught a course titled "Automating Reverse Engineering with Machine Learning and Binary Analysis." His experience further includes a role at Booz Allen Dark Labs, focusing on embedded security research and the development of LLVM IR-based tools for automated vulnerability assessment, leading to his co-authorship of US Patent 10,133,871.

Throughout his career, Dr. Jones has been dedicated to education and training. He has consistently taught courses on Automating Reverse Engineering at all his employers, including Microsoft. His independent research endeavors in AI/ML, NLP, and LLMs have not only kept him at the forefront of technological advancements but also ensured that the courses he teaches are up-to-date with the latest developments in the field.

Dr. Jones held an Adjunct Professor position at the University of Maryland at College Park from 2019 to 2020, where he taught "Machine Learning Techniques Applied to Cyber Security." Additionally, he has taught his specialized course on Automating Reverse Engineering with Machine Learning, Binary Analysis, and Natural Language Processing at Blackhat USA in 2019, 2021, and 2023, as well as at RECON Montreal in 2023.

Holding a B.S. in Computer Engineering from the University of Florida and an M.S. and PhD from Georgia Tech, with a focus on Game Theory applications in cybersecurity, Dr. Jones's extensive knowledge and practical experience in AI, ML, and cybersecurity are instrumental to his contributions in the field.

To Register

Click here to register.