A Closer Look at Automating Human Genome Sequencing

Comments · 1710 Views

an in-depth guide to how the secondary pipeline of genome sequencing works and what are different aspects of it.

What is a Genome?

A genome is an organism’s complete set of DNAs and these DNA molecules are made of two twisting strands. Each  Strand is made up of four basic chemical units called nucleotide bases. The bases are adenine (A), thymine (T), guanine (G) and cytosine (C). Bases on opposite strands pair specifically; an A always pairs with a T, and a C always with a G.

What is DNA sequencing?

Now, DNA sequencing is the process of determining the order of nucleotide bases in a piece if DNA and this process is simple. The difficult part is sequencing an entire organism’s DNA which involves breaking down all the organism’s DNA molecules and putting them in order.

Difficulties associated with Human Genome Sequencing

Human genome sequencing is a complicated and time taking process that we can deal with because of Advancements in Computing power. With the advent of cloud computing can take this a step further and automate parts of this process in a lesser amount of time.

Different Layers of human genome sequencing

  • Primary Analysis
  • Secondary Analysis
  • Tertiary Analysis

Primary Analysis :

The primary analysis is carried out using DNA sequencers like MiniSeq which take the physical raw sample into a raw sequence data called FASTQ files.

Secondary Analysis :

The process of secondary analysis is taking the short sequences or reads from the FASTQ files and putting them in the right order

Tertiary Analysis:

This is the process in which we leverage the power of big data, machine learning to get insights out of the data generated by the secondary analysis.

A closer look at automating the secondary pipeline of human genome sequencing

We need to focus on the secondary analysis as it is the part, we need to automate to reduce the run time for each sample.

  • Each sample will be having one or more FASTQ files
  • A sample sheet CSV is generated at the start of the run of a batch of samples which has all the names of the samples which are about to run
  • we have two XML files which will be generated, one of them is run parameters XML which is generated for the samples that are going to run as the run starts and other is run info XML which is generated for each when the run ends for the sample

For each sample, the FASTQ files should be ready for secondary analysis once we have the Run Info parameters file generated for the sample.

The goal now is to upload the FASTQ files for the sample to a storage service like amazon s3 where the secondary analysis API ( this is provided as a service ) takes the samples and runs the secondary analysis. This can be achieved by writing a python script that runs in the background periodically in the server where we have the files generated from the DNA Sequencer.

 

This is what the script should achieve:

  • Look for the RunInfo.xml file to be generated in the folder of the sample generated in the server
  • If the XML file exists, then check if all the FASTQ files exist for the sample and upload them to the amazon s3 bucket using a python library called boto3

Once it has been uploaded the secondary analysis API completes the process and the output of the secondary analysis is ready to be taken advantage of the power machine learning, big data, etc.

Comments