e-learning

Fine-tuning a LLM for DNA Sequence Classification

Abstract

After preparing, training, and utilizing a language model for DNA sequences, we can now fine-tune a pre-trained Large Language Model (LLM) for specific DNA sequence classification tasks. Here, we will use a pre-trained model from Hugging Face, specifically the Mistral-DNA-v1-17M-hg38, and adapt it to classify DNA sequences based on their biological functions. Our objective is to classify sequences according to whether they bind to transcription factors.

About This Material

This is a Hands-on Tutorial from the GTN which is usable either for individual self-study, or as a teaching material in a classroom.

Questions this will address

  • How to classify a DNA sequence depending on if it binds a protein or not (transcription factor)?

Learning Objectives

  • Load a pre-trained model and modify its architecture to include a classification layer.
  • Prepare and preprocess labeled DNA sequences for fine-tuning.
  • Define and configure training parameters to optimize the model's performance on the classification task.
  • Evaluate the fine-tuned model's accuracy and robustness in distinguishing between different classes of DNA sequences.

Licence: Creative Commons Attribution 4.0 International

Keywords: AI & ML, ELIXIR, Large Language Model, Statistics and machine learning, jupyter-notebook, work-in-progress

Target audience: Students

Resource type: e-learning

Version: 2

Status: Draft

Prerequisites:

  • Deep Learning (without Generative Artificial Intelligence) using Python
  • Foundational Aspects of Machine Learning using Python
  • Introduction to Python
  • Neural networks using Python
  • Pretraining a Large Language Model (LLM) from Scratch on DNA Sequences
  • Python - Warm-up for statistics and machine learning

Learning objectives:

  • Load a pre-trained model and modify its architecture to include a classification layer.
  • Prepare and preprocess labeled DNA sequences for fine-tuning.
  • Define and configure training parameters to optimize the model's performance on the classification task.
  • Evaluate the fine-tuned model's accuracy and robustness in distinguishing between different classes of DNA sequences.

Date modified: 2025-04-25

Date published: 2025-04-17

Authors: Bérénice Batut, Raphael Mourad

Contributors: Anup Kumar, Björn Grüning, Bérénice Batut, Wandrille Duchemin, olisand

Scientific topics: Statistics and probability


Activity log