Input file format

Sequencing raw data file

The tool needs two type of input: one is the FASTQ file, the other is the information files of the loci. Information about FASTQ file type can be found on illumina, or on wikipedia.

Sequencing raw data file example

Information files of the loci

There are two files for the loci:

Sequence file about the loci, example on github as:

s258878 CGGGCTGTTACTGTGACTCTGTCAGCTGGCTTTCGATTCTGATAAATGAAATAGGGAGATCTGGTGCAGAATCACTTCGATCGATTCGACTCAAGCAGTATCAGTTTACAGATTCCGTTCCGTTTTGCAAATCCAGGGCTAACATCTACGTTACAAATCAGTAAGCCAAAGTAAAGTGAAAAAGAAAAGGTTTTAGCTGTACTAGTCAGTTCTCAGTTAACTGTGCCTTGATCTGTTTGCCAGCTGGGGCAAGATTTCTTCGGTGCCTGCACCCTTTCTCTTGCGGTTGTTTATGCTATCgctgctgctgctgctGATTGTGGGGTTCCTGCGTTCGCCACTGTGACTGTCACTTTGCTGGTGCTGTTCCTGGTTGCATCTGCTTTTCAGTATGTGGGGCTTGAGCTTGTTCATGTCTGATAGCCGCACAGGTTTTAGGAAGAACTCGTCAGCTCCTTCCTCCAGGCACCTGGTTCAGAATTTCAAAACCAGGATTGTTAGTTTTCTCTCTCTCCCGTTGATCATATATAGAGCATCATATATATGCTGTTAAAAGACAAGTCTTGCAACTTGTGTTTTTTTATATATTTTTATTTTTGGATATGTCATATCTTA s282049 TATTCTTCCGGAGGAGGGAGAGACCGATCATGGGTCAGCGTCACCATGGTTGGAACAAAGTCCCCGGATTGGATTGGATTGGATCTCAGTATACATGAGACATGCATGAGAAGCCCCCCCCTACCCCTTTTTTCTTCTTTTTTTTTCTTTTTCACTACTACTACTAGTATTTCTCTTTTTTTTCTAGCTTGTTTTTCACCTTTACTTTCTTTGGACAAAGAGGGGGAAAAAGAAGCAACTTCCTTGAAGGCGAAGGTAGGTATAGTAGTAATTATCAATAGTGGTGGGCATAAGACATGAtggtggtggtggtggTTCATGGGGTTTATAGTAAGGAGTAGCTCAAGAAGGGTGGCTCTCATGGCTCCCCCCTTCTTGTTGCTCCACAATGTGATGATGGTATGGAGGTGGCATGGTAGAAAAGCTTGGTGAATCTCAATGTGCTTTCGAGTGCCGTGTCTGTCCCTAGTGGAATGGAATGTCTCTGGTGTCCCCCCGTTCATTTTTTTTAATCTTTATGGCTTTCTGCATGCTCATCGGATCACCAGATTAAAATTTAGAAGTTGAATTTTAGGTTAGACTTTGTATGTTCAAAGAGCACCCGATAAAAATCTA

This is in fact a FASTA file format(https://en.wikipedia.org/wiki/FASTA_format), each sequence correspond a locus. The SSR region in the sequence marks as lower case bases, and flanks are upper case bases.

Stats information about the loci, example as https://github.com/plantdna/amgt-ts/blob/master/ref/sites.motif.info.stats:

#ID SSR.No Motif_len Motif Repeat_times Start End Repeat_len Seq_len
s258878 1 3 GCT 5 301 315 15 615
s282049 1 3 TGG 5 301 315 15 615

It has 9 columns, meaning as:

2.1 #ID The locus id, same as the locus’s name in the Sequence file of the loci.
2.2 SSR.No The motif index of the locus. We support multi-motif in one SSR region. But for the common use case, the value of this column is always 1.
2.3 Motif_len The bases count of the motif.
2.4 Motif Thre repeat unit of the locus.
2.5 Repeat_times The repeat times of the motif in this locus.
2.6 Start The start base position of the SSR region in the sequence of the locus. From the first base to position Start -1, is the left flank of the locus.
2.7 End The end base position of the SSR region in the sequence of the locus. From End +1 to the last bases of the locus, if the right flank of the locus.
2.8 Repeat_len The result of column #3 (Motif_len) times column #5 (Repeat_times)
2.9 Seq_len The length of the sequence, is the bases count of the locus in the Sequence file about the locus.

Any questions, feel free to contact via email, thank you very much to have interest in this tool.

Summary

AMGT-TS input file format

An Accurate and Ultra-deep Coverage Method for Large-scale SSR Genotyping with SNPs in the SSR and Flanking Region Compatible

AMGT-TS input file format

An Accurate and Ultra-deep Coverage Method for Large-scale SSR Genotyping with SNPs in the SSR and Flanking Region Compatible

Input file format

Sequencing raw data file

Information files of the loci

See also