RNA Structure Formats


In the world of RNA secondary structure studies there have to date been just 2 dominant structure formats: Mike Zukers' 'ct' or connect-format and the 'dot-bracket' format used by amongst others the Vienna group. Many useful tools use these formats as input (and output) and hence these have become unofficial standards. The official standard, RNAML has not been as widely accepted as the inventors would have liked. I personally have not figured out how to (ab)use it properly and this is not through lack of trying either! My personal (highly predjudiced) opinion is that the dot-bracket & connect/tabular formats are sufficient for dealing with secondary structures. A further development is the more sensible "column format", it has all the advantages of connect-format, but is more compact and general. Of course, when considering 3D structure the gloves are off, PDB format is probably what you want.

When trying to evaluate a number of RNA folding packages I encountered a number of additional input and output formats which to my mind un-neccesarily complicated my task! I beg researchers to try to adhere to fasta & dot-bracket/connect formats. Scripting languages such as perl make conversions easier, but as you'll see below, there are limits to what perl can do! Vienna ships with ct2b.pl and b2ct.c which can be used to convert between most bracket and ct formats. Pseudo-knots are not dealt with, but these scripts can be easily modified to include knots.


Connect format:

Columns 1, 3, 4, and 6 redundantly give sequence indices, the informative columns 2 and 4 give the sequence and 'j' in position 'i' if (i,j) is a base-pair, otherwise this is zero. One could envisage encoding multiple (aligned) sequences and structures in this format by alternating sequence and structure columns in the one file.
   73 ENERGY =     -17.50    S.cerevisiae_tRNA-PHE
    1 G       0    2   72    1
    2 C       1    3   71    2
    3 G       2    4   70    3
    4 G       3    5   69    4
    5 A       4    6   68    5
    6 U       5    7   67    6
    7 U       6    8   66    7
    8 U       7    9    0    8

                 .
                 .
                 .

   66 A      65   67    7   66
   67 A      66   68    6   67
   68 U      67   69    5   68
   69 U      68   70    4   69
   70 C      69   71    3   70
   71 G      70   72    2   71
   72 C      71   73    1   72
   73 A      72   74    0   73

Dot-bracket format:

Matching parentheses in positions 'i' and 'j' indicate a base-pair, otherwise a '.' is used. Many people complain that this format cannot represent pseudo-knots in an un-ambiguous fashion, however using additional parenthese types '[', ']', '{', '}', '<', '>', 'A', 'a', 'B', 'c', 'C', ... one can represent extremely high order knots in an un-ambiguous fashion. Alternatively, as Sean Eddy discusses in his Infernal documentation these can be used to 'mark-up' the structure to discriminate between different loop types.
>S.cerevisiae_tRNA-PHE M10740/1-73
GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA
(((((((..((((........)))).((((.........)))).....(((((.......)))))))))))). (-17.50)


Column format:

; ========================================================================
; TYPE              RNA
; COL 1             label
; COL 2             residue
; COL 3             seqpos
; COL 4             align_bp
; ENTRY             trna
; LENGTH            73
; ----------
N     G     1    72
N     C     2    71
N     G     3    70
N     G     4    69
N     A     5    68
N     U     6    67
N     U     7    66
N     U     8     .
N     A     9     .
N     G    10    25
N     C    11    24
N     U    12    23
N     C    13    22
N     A    14     .

                 .
                 .
                 .

N     C    60     .
N     C    61    53
N     A    62    52
N     C    63    51
N     A    64    50
N     G    65    49
N     A    66     7
N     A    67     6
N     U    68     5
N     U    69     4
N     C    70     3
N     G    71     2
N     C    72     1
N     A    73     .
; **********


Stockholm Format 1:


# STOCKHOLM 1.0
#=GF AU    Infernal 0.55

DF6280             GCGGAUUUAGCUCAGUuGGGAGAGCGCCAGACUGAAGAUCUGGAGGUCCU
#=GC SS_cons       (((((((,,<<<<___.____>>>>,<<<<<_______>>>>>,,,,,<<
#=GC RF            gccgauaUagcgcAgU.GGuAgcgcgccacccUgucaagguggAGgUCcg

DF6280             GUGUUCGAUCCACAGAAUUCGCA
#=GC SS_cons       <<<_______>>>>>))))))):
#=GC RF            gggUUCGAUuccccguaucggcg
//

Stockholm Format 2:


# STOCKHOLM 1.0
#=GF ID    trna
#=GF DE    Taken from Sprinzl alignment of 1415 tRNAs [Steinberg93]

DF6280             GCGGAUUUAGCUCAGUUGGG.AGAGCGCCAGACUGAAGAUCUGGAGGUCC
DE6280             UCCGAUAUAGUGUAAC.GGCUAUCACAUCACGCUUUCACCGUGGAGA.CC
DD6280             UCCGUGAUAGUUUAAU.GGUCAGAAUGGGCGCUUGUCGCGUGCCAGA.UC
DC6280             GCUCGUAUGGCGCAGU.GGU.AGCGCAGCAGAUUGCAAAUCUGUUGGUCC
DA6280             GGGCACAUGGCGCAGUUGGU.AGCGCGCUUCCCUUGCAAGGAAGAGGUCA
#=GC SS_cons       <<<<<<<..<<<<.........>>>>.<<<<<.......>>>>>.....<
#=GC RF            xxxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxx

DF6280             UGUGUUCGAUCCACAGAAUUCGCA
DE6280             GGGGUUCGACUCCCCGUAUCGGAG
DD6280             GGGGUUCAAUUCCCCGUCGCGGAG
DC6280             UUAGUUCGAUCCUGAGUGCGAGCU
DA6280             UCGGUUCGAUUCCGGUUGCGUCCA
#=GC SS_cons       <<<<.......>>>>>>>>>>>>.
#=GC RF            xxxxxxxxxxxxxxxxxxxxxxxx
//

RNA_align Format:

The below is input format into the RNA_align package for aligning structures. A sequence name is given, followed by the sequence in a pretty-numbered format, followed by stem numbers, coordinates of exterior base-pairs, stem length and energies. If one is going to invent such complicated formats it would be nice to also supply conversion tools such as dotb2align and ct2align (and vice-versa) otherwise you make your program almost impossible to use.
Alcaligenes-eutrophus-pb-b
       1  AAAGCAGGCC AGGCAACCGC UGCCUGCACC GCAAGGUGCA GGGGGAGGAA
      51  AGUCCGGACU CCACAGGGCA GGGUGUUGGC UAACAGCCAU CCACGGCAAC
     101  GUGCGGAAUA GGGCCACAGA GACGAGUCUU GCCGCCGGGU UCGCCCGGCG
     151  GGAAGGGUGA AACGCGGUAA CCUCCACCUG GAGCAAUCCC AAAUAGGCAG
     201  GCGAUGAAGC GGCCCGCUGA GUCUGCGGGU AGGGAGCUGG AGCCGGCUGG
     251  UAACAGCCGG CCUAGAGGAA UGGUUGUCAC GCACCGUUUG CCGCAAGGCG
     301  GGCGGGGCGC ACAGAAUCCG GCUUAUCGGC CUGCUUUGCU U
>
 (    1)       1     337      10      -6.0
 (    2)      11     326       1      -7.9
 (    3)      12     278       7      -5.3
 (    4)      20      45       2      -8.6
 (    5)      23      42       8      -3.4
 (    6)      59     183       4      -1.3
 (    7)      71     179       5      -2.9
 (    8)      77      89       4      -2.1
 (    9)      91     105       1      -8.1
 (   10)      92     103       4     -11.7
 (   11)     106     174       2      -1.8
 (   12)     111     172       2      -8.8
 (   13)     127     156       4      -8.8
 (   14)     132     151       8      -2.1
 (   15)     187     235       4     -17.0
 (   16)     197     226       6     -17.0
 (   17)     206     220       5      -8.1
 (   18)     242     261       8      -6.9
 (   19)     281     308       2      -6.9
 (   20)     284     305       9      -3.7
>

Genebee Format:

Much like the RNA_align format, 'Genebee' format reduces structures to stem components. Again this is a nightmare to convert standard formats for plotting structure,
Stem_ 1 with energy -15.200000 Kkal/mol
              *:  :
   2    7     GCGGGG
  51   46     CGCCCC

Stem_ 2 with energy -5.700000 Kkal/mol
              *
  14   16     AGC
  25   23     TCG

Stem_ 3 with energy -4.400000 Kkal/mol
              ::
  30   31     GG
  42   41     CC

Stem_ 4 with energy -4.000000 Kkal/mol
              *
  11   12     AG
  45   44     TC

comRNA Format:

This format is actually ambiguous! By not distinguishing between 5' and 3' base-pairs these structures cannot be parsed in an automated fashion.

M10740/1-73             1 GCGGAUUUAGCUCAGUU-GGGAGAGCGCCAGACUGAAGAUUUGGAGGUCCUG-UGUUCGAUCCACAGAAUUCGCA 75
                            aaaaa  bbbb         bbbb                                aaaaa
K00349/1-73             1 GCCGAAAUAGCUCAGUU-GGGAGAGCGUUAGACUGAAGAUCUAAAGGUCCCC-GGUUCAAUCCCGGGUUUCGGCA 75
                                   bbbb         bbbb
K00283/1-74             1 GGGCCGGUAGCUCAUUUAGGCAGAGCGUCUGACUCUUAAUCAGACGGUCGCG-UGUUCGAAUCGCGUCCGGCCCA 75
                           aaaaa a bbbb         bbbb                a aaaaa
K00354/1-74             1 CUCCGUGUAGCUCAGUUUGGUAGAGCGCCUGAUUUGGGAUCAGGAGGUCCAA-GGUUCAAAUCCUUGUAUGGAGA 75
                           aaaa    bbbb         bbbb        aaaa
X02682/1-73             1 GGGUGAUUAGCUCAGCU-GGGAGAGCACCUCCCUUACAAGGAGGGGGUCGGC-GGUUCGAUCCCGUCAUCACCCA 75
                                   bbbb         bbbb
J01624/1-73             1 GCGGGAAUAGCUCAGUU-GGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCG-AGUUCGAGUCUCGUUUCCCGCU 75
                          aaaa     bbbb         bbbb                     aaaa
M10268/1-73             1 GCUUCAGUAGCUCAGUA-GGAAGAGCGUCAGUCUCAUAAUCUGAAGGUCGAG-AGUUCGAACCUCUCCUGGAGCA 75
                           aaaaaa  bbbb         bbbb              aaaaaa
K00277/1-73             1 GCUGAUUUAGCUCAGUA-GGUAGAGCACCUCACUUGUAAUGAGGAUGUCGGC-GGUUCGAUUCCGUCAAUCAGCA 75
                          aaaaaa   bbbb         bbbb                    aaaaaa
K00335/1-72             1 GCUUUUAUAGCUUAGU--GGUAAAGCGAUAAAUUGAAGAUUUAUUUA-CAUGUAGUUCGAUUCUCAUUAAGGGCA 75
                             bbbbb bbbb         bbbb bbbbb
Y14521/1-73             1 CCGAACUUAGCUCAGUU-GGAAGAGCAUCGGACUGUAAAUCCGGUGGUCCCC-GGUUCGAACCCGGGAGUUCGGA 75
                           aaaaaa  bbbb         bbbb                           aaaaaa
K02511/1-74             1 GCCGCCUUAGCUCAUACUGGGAGAGCACUCGACUGAAGAUCGAGCUGUCCCC-GGUUCGAAUCCGGGAGGCGGCA 75
                                   bbbb         bbbb

Sequence Formats:

Mostly researchers have kept to reasonable standards here. Generally Fasta or ClustalW styles. However there is one completely un-documented notable exception (which if you ask the authors about, they apologise profusely for :).

Dynalign sequence Format:

EVERY single sequence you wish to align with dynalign must be kept in a seperate file, in the below format. BE CAREFUL, if you forget to include the sequence name (for example), dynalign will happily produce some very odd structures for you.
;
;
;
M10740
GCGGAUUUAGCUCAGUUGGGAGAGCGCCAGACUGAAGAUUUGGAGGUCCUGUGUUCGAUCCACAGAAUUCGCA
1

ILM sequence Format:

So called MWM format. It is difficult to see any advantages over gapped-fasta here.

M107:    GCGGAUUUAGCUCAGUU-GGGAGAGCGCCAGACUGAAGAUUUGGAGGUCCUG-UGUUCGAUCCACAGAAUUCGCA
K003:    GCCGAAAUAGCUCAGUU-GGGAGAGCGUUAGACUGAAGAUCUAAAGGUCCCC-GGUUCAAUCCCGGGUUUCGGCA
K002:    GGGCCGGUAGCUCAUUUAGGCAGAGCGUCUGACUCUUAAUCAGACGGUCGCG-UGUUCGAAUCGCGUCCGGCCCA
K003:    CUCCGUGUAGCUCAGUUUGGUAGAGCGCCUGAUUUGGGAUCAGGAGGUCCAA-GGUUCAAAUCCUUGUAUGGAGA
X026:    GGGUGAUUAGCUCAGCU-GGGAGAGCACCUCCCUUACAAGGAGGGGGUCGGC-GGUUCGAUCCCGUCAUCACCCA
J016:    GCGGGAAUAGCUCAGUU-GGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCG-AGUUCGAGUCUCGUUUCCCGCU
M102:    GCUUCAGUAGCUCAGUA-GGAAGAGCGUCAGUCUCAUAAUCUGAAGGUCGAG-AGUUCGAACCUCUCCUGGAGCA
K002:    GCUGAUUUAGCUCAGUA-GGUAGAGCACCUCACUUGUAAUGAGGAUGUCGGC-GGUUCGAUUCCGUCAAUCAGCA
K003:    GCUUUUAUAGCUUAGU--GGUAAAGCGAUAAAUUGAAGAUUUAUUUA-CAUGUAGUUCGAUUCUCAUUAAGGGCA
Y145:    CCGAACUUAGCUCAGUU-GGAAGAGCAUCGGACUGUAAAUCCGGUGGUCCCC-GGUUCGAACCCGGGAGUUCGGA
K025:    GCCGCCUUAGCUCAUACUGGGAGAGCACUCGACUGAAGAUCGAGCUGUCCCC-GGUUCGAAUCCGGGAGGCGGCA