e-learning

Generating Artificial Yeast DNA Sequences using a DNA LLM

Abstract

Generating synthetic DNA sequences using pre-trained language models bridges the fields of synthetic biology and artificial intelligence, enabling the creation of novel DNA sequences that closely mimic natural genomes. By leveraging the power of advanced language models, we can generate biologically relevant sequences that have the potential to revolutionize genetic engineering, drug discovery, and our understanding of genomic function.

About This Material

This is a Hands-on Tutorial from the GTN which is usable either for individual self-study, or as a teaching material in a classroom.

Questions this will address

  • How do you set up a computational environment for generating synthetic DNA sequences using pre-trained language models?
  • What role does the temperature parameter play in controlling the variability of generated DNA sequences?
  • How can you compare generated synthetic DNA sequences with real genomic sequences using k-mer counts and PCA?
  • What is the significance of performing BLAST searches on generated DNA sequences, and how do you interpret the results?
  • How can you detect open reading frames (ORFs) in generated DNA sequences and translate them into amino acid sequences?

Learning Objectives

  • Describe the process of generating synthetic DNA sequences using pre-trained language models and explain the significance of temperature settings in controlling sequence variability.
  • Set up a computational environment (e.g., Google Colab) and configure a pre-trained language model to generate synthetic DNA sequences, ensuring all necessary libraries are installed and configured.
  • Use k-mer counts and Principal Component Analysis (PCA) to compare generated synthetic DNA sequences with real genomic sequences, identifying similarities and differences.
  • Perform BLAST searches to assess the novelty of generated DNA sequences and interpret the results to determine the biological relevance and uniqueness of the synthetic sequences.
  • Develop a pipeline to detect open reading frames (ORFs) within generated DNA sequences and translate them into amino acid sequences, demonstrating the potential for creating novel synthetic genes.

Licence: Creative Commons Attribution 4.0 International

Keywords: AI & ML, ELIXIR, Large Language Model, Statistics and machine learning, jupyter-notebook, work-in-progress

Target audience: Students

Resource type: e-learning

Version: 2

Status: Draft

Prerequisites:

  • Deep Learning (without Generative Artificial Intelligence) using Python
  • Fine-tuning a LLM for DNA Sequence Classification
  • Foundational Aspects of Machine Learning using Python
  • Introduction to Python
  • Neural networks using Python
  • Predicting Mutation Impact with Zero-shot Learning using a pretrained DNA LLM
  • Pretraining a Large Language Model (LLM) from Scratch on DNA Sequences
  • Python - Warm-up for statistics and machine learning

Learning objectives:

  • Describe the process of generating synthetic DNA sequences using pre-trained language models and explain the significance of temperature settings in controlling sequence variability.
  • Set up a computational environment (e.g., Google Colab) and configure a pre-trained language model to generate synthetic DNA sequences, ensuring all necessary libraries are installed and configured.
  • Use k-mer counts and Principal Component Analysis (PCA) to compare generated synthetic DNA sequences with real genomic sequences, identifying similarities and differences.
  • Perform BLAST searches to assess the novelty of generated DNA sequences and interpret the results to determine the biological relevance and uniqueness of the synthetic sequences.
  • Develop a pipeline to detect open reading frames (ORFs) within generated DNA sequences and translate them into amino acid sequences, demonstrating the potential for creating novel synthetic genes.

Date modified: 2025-04-25

Date published: 2025-04-17

Authors: Bérénice Batut, Raphael Mourad

Contributors: Anup Kumar, Björn Grüning, Bérénice Batut, Wandrille Duchemin, olisand

Scientific topics: Statistics and probability


Activity log