Fine-tuning a LLM for DNA Sequence Classification

e-learning

Fine-tuning a LLM for DNA Sequence Classification

View material

Abstract

After preparing, training, and utilizing a language model for DNA sequences, we can now fine-tune a pre-trained Large Language Model (LLM) for specific DNA sequence classification tasks. Here, we will use a pre-trained model from Hugging Face, specifically the Mistral-DNA-v1-17M-hg38, and adapt it to classify DNA sequences based on their biological functions. Our objective is to classify sequences according to whether they bind to transcription factors.

About This Material

This is a Hands-on Tutorial from the GTN which is usable either for individual self-study, or as a teaching material in a classroom.

Questions this will address

How to classify a DNA sequence depending on if it binds a protein or not (transcription factor)?

Learning Objectives

Load a pre-trained model and modify its architecture to include a classification layer.
Prepare and preprocess labeled DNA sequences for fine-tuning.
Define and configure training parameters to optimize the model's performance on the classification task.
Evaluate the fine-tuned model's accuracy and robustness in distinguishing between different classes of DNA sequences.

Licence: Creative Commons Attribution 4.0 International

Keywords: Large Language Model, Statistics and machine learning, ai-ml, elixir, jupyter-notebook

Target audience: Students

Resource type: e-learning

Version: 4

Status: Active

Prerequisites:

Deep Learning (without Generative Artificial Intelligence) using Python
Foundational Aspects of Machine Learning using Python
Introduction to Python
Neural networks using Python
Pretraining a Large Language Model (LLM) from Scratch on DNA Sequences
Python - Warm-up for statistics and machine learning

Learning objectives:

Load a pre-trained model and modify its architecture to include a classification layer.
Prepare and preprocess labeled DNA sequences for fine-tuning.
Define and configure training parameters to optimize the model's performance on the classification task.
Evaluate the fine-tuned model's accuracy and robustness in distinguishing between different classes of DNA sequences.

Date modified: 2025-05-22

Date published: 2025-04-17

Authors: Bérénice Batut, Raphael Mourad

Contributors: Anup Kumar, Björn Grüning, Bérénice Batut, Wandrille Duchemin, olisand

Scientific topics: Statistics and probability

External resources: