An update on Bioawk

Bioawk is a great project started by Heng Li some years ago. The aim was to take the awk source code and modify it slightly for use with common biological formats and adding in some new functions. Heng’s original doesn’t accept too many pull requests so to add in some features, I maintain my own fork that has a few improvements.

Long ago I added in a translate function and recently I added in a function that will take the attribute string from a GFF file and turn it into an awk array. It was possible to split the GFF attribute string in awk itself as the following awk code demonstrates

function gffattr(attr, array){
    # split the attribute string into key value pairs
    split(attr, a, ";")
    for (i in a) {
        # now split out the key and value
        split(a[I], b, "=")
        for (j in b) {
            array[j] = b[j]

but this is a bit clunky and not suited for the kind of one-liner processing that awk excels at. So now with the latest update of my bioawk fork this function is built-in. If you want to filter a GFF on some attribute you can now do something like the following

bioawk -c gff '{gffattr($attribute, a); if(a["some_attr"] == "some_value") print}'

Other random updates

The base awk source code has been getting a lot more love recently and my fork has so far merged in all of the upstream updates so there are a few new bug fixes in there.

Bioawk used to ignore the output field separator when printing the whole record — this is no longer the case in my fork.

Getting data from NCBI assembly using the accession number only

NCBI’s assembly database is a great one-stop-shop for genomic data and annotations but it’s actually kind of difficult to download data if you only know the accession number of an assembly. The documentation says that the assembly database is integrated with entrez-direct, a great set of command line utilities for accessing NCBI data from the command line. Most of the databases have an option to download data based on the ID, so I thought that something like the following would work

efetch -id GCA_000398025.1 -db assembly -format fasta

Alas it does not, however the directory structure for the assembly FTP site nicely mirrors the accession numbers so for the example above, the path looks something like this:

Below is a little awk script that I wrote to generate the FTP path and for input into wget. The input for this script is a file containing the accession in the first column and optionally a second column containing the assembly name. The full FTP path contains both of these but we can get around not having the assembly name by using wildcards. Nope, you need both the accession and the assembly name for this snippet to work.

But I wrote a more expansive script in python that do it with just the accession number, you should probably use it instead.

  accession = $1;
  joined_acc = accession
  if (NF == 2) {
    joined_acc = joined_acc"_"$2;
  } else {
    joined_acc = joined_acc"*";
  split($1, a, "_");
  sub(/\.[[:digit:]]$/, "", a[2]);
  gsub(/.{3}/,"&/", a[2]);
  print accession"\t",a[1],a[2],joined_acc,joined_acc"_genomic.fna.gz"

Creating huge metabolic overviews for comparative genomics

I love looking at KEGG maps and using them to understand an organisms metabolism but they have their limitations. For starters, you’re obviously stuck with how they are drawn, which in most cases includes many variations on a particular pathway. Secondly, the tools for mapping on your own genes to a pathway are limited to one organism at a time.

What I really wanted was a way of quickly comparing 10s of genomes across a common set of pathways, in this case amino acid synthesis pathways. Failing to find a suitable way to map on many genomes to KEGGs pathway images, I decided to roll my own. I mixed up manually drawing the pathways and then using code to color in the presence/absence of the reaction in each organism.

My solution starts by creating a template svg image in Illustrator. This is the laborious part as I went through and redrew the amino acid pathways using information from both KEGG and MetaCyc. For each enzyme I positioned a grid of boxes, where each box represents an individual genome, that I would eventually color in based on the presence/absence of that reaction in that organism.

template pathway part
The TCA cycle and parts of the amino acid synthesis pathways. Each reaction is labelled with the EC number and contains a grid of 100 boxes.

Now the trick is to group the parts of the enzyme together and give that group a name in Illustrator. First the grid of 100 boxes was grouped first per row and then I grouped those 10 rows together (this is important and I’ll explain why below). And then the boxes, the arrow and the text label all got grouped together. Next name both the larger group and the boxes in Illustrator. You can do this easily by double clicking the group to enter “isolation mode”, where everything that isn’t in the group gets faded, and then changing the name in the layers panel of Illustrator.

enzyme group hilight
When in isolation mode, the layers panel gives easy access to changing the name of the group. In this case I’m calling the group after the EC number.

Now comes the critical part in naming – the group containing only the 100 boxes must be start with “box-” and then what follows will be the identifier that gets used for coloring the boxes.

box hilight
each enzyme has a sub-group containing the matrix of 100 boxes. This group must start with “box-” and then whatever follows will be used to lookup that reaction for each organism.

In the case above I’m using the EC number again as my identifier. It is whatever is written after the “box-” in the name of the group that is important. This means that you could use any kind of label in the figure, for simplicity I’ve simply kept both the visual label of the enzyme and the name of the group to both use the EC number.

Now, the reason why this is important is that we are going to save our figure as an svg file. Svg is an xml based file format, which means that it is simply text and thus is easy to modify using some scripting. Lets have a look at how each enzyme gets encoded in svg:

Screen Shot 2017-12-14 at 2.43.24 PM
svg of part of one enzyme. the “” tag in svg is a group. You can see that the names for the whole enzyme and for the group of 100 boxes are retained from Illustrator

The names that were given to the groups in Illustrator have been retained in the svg file as the id tags, and it is this text that we’ll process using the script. The other thing to note is how the boxes are arranged, in this case as groups of 10 – these are the rows that were made originally.

Coloring the boxes programmatically

Now for the scripted part. You’ll need the code available here, which will color in the boxes based on a couple of text files. Run the script as follows:

python -e ec_numbers.csv -i template.svg -o colored.svg -c organism_colors.csv

The ec_numbers.csv file must contain three columns and must have a header row labelled organism, ec, lineage. The organism column is the name of the organism, the ec column must contain the identifiers that match up to the names used in the svg file. For example, in the svg file, if there is a box called “box-” then the corresponding annotation in the ec column should be However, there isn’t a requirement that this column be EC numbers or even that they are all from the same source. You could combine EC numbers, KO numbers and Pfam families in this column they just have to match what’s written in the svg file. The lineage column should contain information on how to properly order all of the genomes into the 100 boxes. I use the full taxonomy string of the organism to allow for similar organisms to be grouped together. However, this column doesn’t require a taxonomy string and is simply used for lexicographic sorting, so you could put anything in there to sort organisms however you want.

The organism_colors.csv file should contain two columns with a header row labelled organism and color. The organism column is the organism name that was used in the ec_numbers.csv file. The color column should contain a valid css color.


Formatting scientific names in R

Here is a function that will take a character string in R and return an expression for fancy formatting in plots that properly italicize scientific names. The syntax for doing this is truly quite horrible, but this is how R does it.

scientific_name_formatter <- function(raw_name) {
    # strsplit returns a list but we are passing in only
    # one name so we just take the first element of the list
    words = strsplit(raw_name, ' ', fixed = TRUE)[[1]]
    # some sort of acronym or bin name, leave it alone    
    if(length(words) < 2) {                  
    } else if (length(words) > 2) {
        if (words[1] == 'Candidatus') {
            # for candidatus names, only the Candidatus part is italicised
            # name shortening it for brevity
            unitalic <- paste(words[2:length(words)], collapse = ' ')
            return(bquote(paste(italic(Ca.) ~ .(unitalic))))
        } else if (grepl('^[A-Z]+$', words[1])) {
            # If the first word is in all caps then it is an abreviation
            # so we don't want to italicize that at all
        } else {
            # assume that everything after the second word is strain name
            # which should not get italicised
            unitalic <- paste(words[3:length(words)], collapse = ' ')
            return(bquote(paste(italic(.(words[1])) ~ italic(.(words[2])) ~ .(unitalic) )))
    } else {
        return(bquote(paste(italic(.(words[1])) ~ italic(.(words[2])))))

I use it like the following in ggplot by setting the labels in scale_y_discrete.

scale_y_discrete(limits = df$organism_name, labels = sapply(df$organism_name, scientific_name_formatter ))

split_by: splitting files on the command line based on content

Unix has so many great ways to perform text manipulation but one niche which hasn’t been filled is splitting a tabular file into pieces based on the contents of certain columns. There are two commands, split and csplit, that do a similar role. split can split a file into a certain number of bytes or lines; csplit uses a regular expression to determine where to split the file. Often for my purposes neither of these tools is a good fit, and what I really want is an equivalent to the “group by” clause in SQL databases. Group by sorts rows based on certain grouping columns so that you can then perform summaries on that group. split_by splits out a file based on the grouping columns for further processing on each chunk.

For example, say I’ve got a delimited text file containing a mix of categorical and numerical data like the following:

AU BR 447 223.2 55958 US
AU FR 348 16.8 58484 AU
US UK 12 53.30 129 PG

I want to split this file into smaller files based on what’s in column A, which in this example would make two files for the 2 lines containing “AU” and for the 1 line containing “US”. Neither, split or csplit really handle this case and the only option is to use an awk, perl or other script to handle it. (It’s also possible to do this in multiple passes with grep)

After writing similar awk one-liners to do this sort of thing I decided to make it a bit more reusable:

#!/usr/bin/awk -f
    split(cols, group_by, ",")
    col_sep = "_"
    if(header == 1 && NR == 1) {
    out_file = ""
    for(col in group_by) {
        if(col == 1) {
            out_file = group_by[col]
        } else {
            out_file = out_file col_sep group_by[col]
    out_file = (prefix out_file suffix)
    print $0 > out_file

That’s the whole program! You can run it like the following, taking advantage of the variables defined in the script:

./split_by -vFS="\t" -vprefix=test_split_ -vsuffix=".tsv" -vcols=1 -vheader=1 file.tsv

This will produce two files on the example above called “test_split_AU.tsv” and “test_split_US.tsv”. Only the cols argument is needed to say which of the columns is used to split on. If you want to split on multiple columns pass a comma separated list -vcols=1,2. The output file names are generated based on the contents of those columns so if there are any special characters in there that might mess up a file name, you’re in trouble. The prefix and suffix variables default to nothing but are useful for being able to find the files for later by giving them all a common prefix. The nice thing about this is that all of the awk variables are still available so changing between a csv and tsv file can be achieved using the FS parameter on the command line.

Downloading NCBI genomes from a given taxonomy

The Entrez Direct toolkit is great for programmatic access to all of NCBI’s resources. This little snippet below finds all of the refseq representative genomes for a given NCBI taxonomy ID, makes a little summary of the genomes downloaded and uses wget to download the genbank files from the Assembly FTP. Change the inital query in the first call to esearch to change what genomes are downloaded.

esearch -db assembly -query "txid28890[Organism:exp] AND representative [PROP]" |\
 efetch -format docsum |\
 xtract -pattern DocumentSummary -element \
    AssemblyAccession SpeciesTaxid SpeciesName FtpPath_RefSeq |\
 sed 's/,.*//' |\
 sort -k 3,3 |\
 tee downloaded_genomes.tsv |\
 cut -f 4 |\
 sed -e 's/$/\/*genomic.gbff.gz/' |\
 wget -i /dev/stdin

Python ete3: formatting organism names the way I want

I feel like I’m on a life-long quest to make all of my phylogenetic tree figures completely programmatically. The best tool I’ve found for making them is the ete library for the python programing language. I’ve already figured out how to get trees drawn in the style that I like but there was still one thing left to do: making organism names italicize correctly.

I work with microorganisms where the convention is for the genus and species names to be italicized but the strain name to be in regular font. For example: Methanoregula formicica SMSP, DSM 22288. There are exceptions to this as well; if an organism is not in pure culture then the name Candidatus is prepended and any genus and species names remain in regular format. Unfortunately, ete doesn’t have an interface for doing this kind of name formatting, you can either have it italic, bold or regular, but you can’t mix and match. Happily though it does contain an interface for dropping down into the lower-level drawing engine (pyQt4), which enables you to do all sorts of custom things.

Below is my solution to making fancy formatted organism names. I think it might be a little hacky but seems to work. The organism name should be in the “name” attribute of the leaf for this to work. (See the example below)

Here are the formatting rules:

  1. If the name has less than 2 words: do nothing. This is to catch internal names or accession numbers that are not valid scientific names
  2. If the first word of the name is “Candidatus”: abbreviate it to “Ca.” and italicize just that part.
  3. If there are more than 2 words: italicize the first two.
  4. If there are exactly 2 words: italicize both
from ete3 import Tree, faces, TreeStyle
from PyQt4 import QtCore
from PyQt4.QtGui import QGraphicsRectItem, QGraphicsSimpleTextItem, \
QGraphicsPolygonItem, QGraphicsTextItem, QPolygonF, \
     QColor, QPen, QBrush, QFont

def scientific_name_face(node, *args, **kwargs):
    scientific_name_text = QGraphicsTextItem()
    words =
    text = []
    if len(words) < 2:
        # some sort of acronym or bin name, leave it alone
        text = words
    elif len(words) > 2:
        if words[0] == 'Candidatus':
            # for candidatus names, only the Candidatus
            # part is italicised
            # name shortening it for brevity
            # assume that everything after the
            # second word is strain name
            # which should not get italicized

    scientific_name_text.setHtml(' '.join(text))

    # below is a bit of a hack - I've found that the height of the bounding
    # box gives a bit too much padding around the name, so I just minus 10
    # from the height and recenter it. Don't know whether this is a generally
    # applicable number to use
    masterItem = QGraphicsRectItem(0, 0,
        scientific_name_text.boundingRect().height() - 10)

    center = masterItem.boundingRect().center()
        center.y() - scientific_name_text.boundingRect().height()/2)

    # I don't want a border around the masterItem

    return masterItem

def master_layout(node):
    if node.is_leaf():
        F = faces.DynamicItemFace(scientific_name_face)
        faces.add_face_to_node(F, node, 0)

if __name__ == '__main__':
    t = Tree()
    # this is fake data to show the rendering
    leaves = t.get_leaves()
    # give the leaves some different types of names
    leaves[0].name = 'Methanosarcina barkeri'
    leaves[1].name = 'Candidatus Methanoperedens nitroreducens'
    leaves[2].name = 'ANME_bin_23'
    leaves[3].name = 'Methanosaeta thermophila PT'

    ts = TreeStyle()
    ts.show_leaf_name = False
    ts.layout_fn = master_layout

My guide to annotating proteins and pathways

So. You’ve got yourself a nice new genome sequence and you want to know what kind of metabolism it has. There is a good chance that you have some idea already — you think it’s a nitrogen fixer or a sulfate reducer etc. — based on other analyses and now it’s time to strengthen your paper with a bit of genomic evidence.

Getting an initial annotation

The vast majority of the genes in the genome are going to be hypothetical proteins; of the rest, a sizable chunk are going to be genes with a general sort of annotation like “ABC transporter” (which says nothing about what it’s transporting), and the rest are going to be metabolic genes that you probably care about. The first thing that you want to do is to throw your genome into one of the automatic annotation pipelines (prokkarastimgncbikaas) to get an initial annotation. You can then search through these to look for pathways of interest.

However, don’t just straight out believe what these pipelines say, they will definitely miss some genes and report annotations that aren’t correct, but they are a good start since in most cases you will only be interested in a portion of the genes.

The most important thing to remember is to be a scientist. Have a sceptical mindset and interrogate your data. You can’t look at every gene (well you can but probably don’t have time) so my rule of thumb is that you should manually check all of the genes in all of the pathways that you are going to write about in your paper.

My Workflow

1. Compare the sequence of your unknown gene against Uniprot. 

You can also use NCBI blast database or IMG or any other database you wish however I like Uniprot because it ranks it annotations with an overall confidence score, so you know what kind of experiments has informed the annotation. Uniprot will show some starred proteins that are biochemically verified – this is a good way to figure out if your gene is annotated with the correct function.

Remember that most genomes are automatically annotated which means that even if all of the top blast hits are annotated as the same thing, it doesn’t mean that it’s correct. Comparing it against genes that have been biochemically annotated is important.

Also remember that blast is a local aligner, so don’t just look at the e-value, look at the percentage matched of the query (your gene) and the subject (gene in the database). If you have a very long gene and only part of it matches then you should definitely investigate this!

2. Find conserved domains using Interpro

Sometimes (read: Often) blast matches will not be conclusive or return only a general annotation. In these situations it’s best to look at the protein sequence another way. In this case looking at the domains that make up the protein using Interpro may help figure out what it does. This is also a good way to compare your protein to a biochemically analysed protein. I your gene has extra domains, different domains, or fewer domains then it may well be that the blast annotation was incorrect.

3. Look at surrounding genes

A lot of enzymes are encoded by multiple genes which all exist in one operon. If you want to say that your bug has some enzyme you should make sure that it contains all of the subunits for that enzyme. This can be an essential step where enzymes evolved from a common ancestor such that individual subunits can be difficult to distinguish. A good example here is NADH dehydrogenase and the evolutionarily related hydrogenases, which can be easily distinguished by the operon structure, but individual subunits often get mis-annotated.

4. Make an alignment (optional)

The last thing you might need to do is to make an alignment to see if your putative enzyme has all (or some of) of the right conserved residues as other biochemically characterised proteins. If the alignment looks terrible then perhaps you’ve got something novel.

Messing around with Phyloxml in ete3

ete3 has support for phyloxml which I use with archaeopteryx tree viewer for a lot of my day-to-day phylogenetics visualisation. My main reason for using phyloxml is one of convenience as I have a script that will easily add in the proper organism name onto the tree and I think that archaeopteryx is a really good basic tree viewer. I wanted to draw a tree from phyloxml in ete using my own style and to have the proper organism name to be rendered. In my phyloxml file I have this coded in as the scientific name for each leaf (see below for phyloxml snippet), so now all I needed to do was make this the node name when rendering the tree.

    <scientific_name>Desulfobacterium anilini DSM 4660</scientific_name>


Easy, right?


I found that the interface for phyloxml was not the same as for newick formatted trees and unfortunately the documentation for phyloxml in ete3 is a bit lacking as there wasn’t a complete listing of methods for each class. After much messing around, looking at the source code of ete3 and examining python objects using the builtin dir function I was able to get what I wanted.

turns out that for each node/leaf I needed to access the phyloxml_clade attribute, which has an attribute taxonomy, which implements an iterable interface (I think it’s probably a list), which I could then use to access the scientific name and make the name of the leaf for printing. It’s a little convoluted but easy when you know how.

from ete3 import Phyloxml
project = Phyloxml()

# iterate through the trees in the phyloxml file
for tree in project.get_phylogeny():
    # go through the node in the tree
    for node in tree:
        # assign the node name from the data in the phyloxml file = node.phyloxml_clade.taxonomy[0].get_scientific_name()

Going from a messy supplementary table to good clean data


Get ready for some advanced file copying!

I recently had to clean up some data from the supplementary material from Pereira et. al 2011, which is a very nice table of manually annotated genes in sulfate reducing bacteria. The only problem is that the table is designed for maximum human readability, which made it a real pain when trying to parse out the data. I decided to use R and the “hadleyverse” packages to clean up the table to make things work better for downstream analyses. This isn’t part of my normal workflow, I’m more of a python guy, but after doing this analysis in R I’d have to say that I’m a convert.

Getting the data

Download the supplementary material from the paper linked above, if you want to play along at home. This data is a nicely formatted excel workbook containing eight spreadsheets with the locus identifiers for a number of genes important in suflate reducing prokaryotes. While this data is nice and visually appealing, it is not tidy and it’s difficult to get the information I want out of it.

I want the locus identifiers for the genes of interest so that I can download them from NCBI and use them as a blast database.

Cleaning the data

Some things are just easier to do in excel before tidying the data in R here is what I did:

  1. removed the empty columns and rows at the beginning. This actually isn’t difficult to do in R, but doing this makes inputting the data more painless cause then R will pick up the column names.
  2. Remove the rows that contain just the taxonomic information
  3. For some of the sheets (‘Hases’, for example) I removed rows at the beginning that gave hierarchy to the columns. These are mostly unnecessary and make it difficult to parse the excel sheet, as readxl does not currently handle merged cells and cause the boundaries of this hierarchy is coded visually using cell boarders in excel.
  4. For some reason there were single quotation marks in the Archaeoglobus fulgidus DsrK locus identifiers, which I removed

Open up an R session and load the following libraries (assuming that you already have them installed)


Import the data into R using readxl. Creates a list of dataframes.

d <- lapply(excel_sheets("~/Downloads/data_sheet_2.xls"), \
read_excel, path = "~/Downloads/data_sheet_2.xls")

remove the completely empty rows

d <- lapply(d, \
function(n) n[rowSums( != ncol(n),])

Lets look at what our table looks like (Note the ‘organism’ column is not shown for brevity)

locus SAT AprA QmoA DsrA DsrC H-Ppi FdxI FdxII
AF 1667 1670 0663 0423 2228 NA 00427; 1010; 0355; 0923; 2142; 0166; 1700; 0156; 0464 0167
Arcpr_ 1264 1261 1260 0139 1726 NA 0142; 0625; 0483; 0712; 1058 NA
Cmaq_ 0274 0273 NA 0853 0856 0949 0549; 1711 NA
DaesDRAFT_ 2031 2029 2028 2438 0796 NA 1729 0903
Dde_ 2265 1110 1111 0526 0762 NA 3775 0286
Ddes_ 0454 2129 2127 2275 1917 NA 889 NA
DMR_ 39470 05400 05410 03600 15890 NA 39570; 18760 13970
DESPIG_ 02241 02773 02771 NA 02353 NA 00991 NA
Desal_ 0228 0230 0231 0787 0984 NA 1299 0241; 2850
DFW101DRAFT_ 0832 1162 1163 3451 2958 NA 0847 0729
DVU 1295 0847 0848 0402 2776 NA 3276 NA
Dbac_ 3196 3198 3199 0279 2958 NA 0275 2977
Dalk_ 2445 1569 1568 4301 4140 3373 4380; 2230; 2714 2374
HRM2_ 31180 04510 04500 42400 22050 NA 26720; 10680; 01580; 39570 40690
Dole_ 1002 0998 0999 2669 0463 2820 1168 2655
Dret_ 1968 1966 1965 0244 1739 NA 0240 0154; 0169
DthioDRAFT_ 1410 1407 1406 2272 2675 NA 2268 NA
DP 1472 1105 1106 0797 0997 NA 2755; 0775 1865
DaAHT2_ 0293 1471 1470 0296 2041 NA 1668 2532; 2287
Sfum_ 1046 1048 1287 4042 4045 3037 4046 2933; 3217
Dtox_ 3579 3577 3576 0079 0077 3931 0074; 0532; 1221; 1608; 3208 1637
Dred_ 0635 0637 0638 3187 3197 2985 3200; 2937; 0466 2203
Daud_ 1076 NA 1884 2201 2190 0308 2193; 1963 1080
Adeg_ 1712 1080 1079 2094 0035 NA 0032 1625
THEYE_ A1835 A1832 A1831 A1994 A0003 NA A0789 NA

In the data, the columns for each gene are really values, not variables; they should be converted into individual rows. To do this use the gather function from tidyr. Here I specify the name of the new columns gene.identifier which will contain the name of the gene and locus.identifier which will contain the information for that gene. I’m also setting na.rm which will not include genes where it was not found in the organism. After the gather function is applied all of the data frames in the list will have the same columns, which means that they can all be concatenated into one big data frame. To do this I’m using dpylr::bind_rows.

d <- lapply(d, function(n){ n %>% \
gather(gene.identifier, locus.identifier, \
-c(organism, locus), na.rm=TRUE)}) 
d <- bind_rows(d)
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
organism locus gene.identifier locus.identifier
Archaeoglobus fulgidus AF FdxI 00427; 1010; 0355; 0923; 2142; 0166; 1700; 0156; 0464
Archaeoglobus profundus Arcpr_ FdxI 0142; 0625; 0483; 0712; 1058
Caldivirga maquilingensis Cmaq_ FdxI 0549; 1711
Desulfovibrio aespoeensis DaesDRAFT_ FdxI 1729
Desulfovibrio desulfuricans G20 Dde_ FdxI 3775
Desulfovibrio desulfuricans ATCC 27774 Ddes_ FdxI 889
Desulfovibrio magneticus RS-1 DMR_ FdxI 39570; 18760
Desulfovibrio piger DESPIG_ FdxI 00991
Desulfovibrio salexigens Desal_ FdxI 1299
Desulfovibrio sp. FW1012B DFW101DRAFT_ FdxI 0847
Desulfovibrio vulgaris Hildenborough DVU FdxI 3276

The other untidy aspect of the dataset is that there are multiple locus identifiers for some of the genes (presumably cause there are multiple copies in the genome). We next need to split them out into separate observations (rows). The str_split function from stringr will split a string based on a regular expression and return a list of values. I then pass this to the unnest function, which will “flatten” the list into multiple rows as required.

d %>% mutate(locus.identifier = \
str_split(as.character(locus.identifier), "; |\\/")) %>% \
unnest(locus.identifier) -> d
organism locus gene.identifier locus.identifier
Archaeoglobus fulgidus AF FdxI 00427
Archaeoglobus fulgidus AF FdxI 1010
Archaeoglobus fulgidus AF FdxI 0355
Archaeoglobus fulgidus AF FdxI 0923
Archaeoglobus fulgidus AF FdxI 2142
Archaeoglobus fulgidus AF FdxI 0166
Archaeoglobus fulgidus AF FdxI 1700
Archaeoglobus fulgidus AF FdxI 0156
Archaeoglobus fulgidus AF FdxI 0464
Archaeoglobus profundus Arcpr_ FdxI 0142
Archaeoglobus profundus Arcpr_ FdxI 0625

Now for the final clean-up. The original data separated the locus prefix and the locus identifier, now I want to combine them back together. To do this I’m going to use a couple of calls to the mutate function, which modifies a column. First, in the previous command I converted the locus.identifier column to characters, however this has the unwanted effect of having decimal places in the strings, which I don’t want. Passing the locus.identifier column to the sub function will remove the unwanted text. The next mutate command combines the locus and locus.identifier columns into one and finally I select which columns I want in the final data frame using the select function.

d %>% mutate(locus.identifier = \
sub("\\.0+","",locus.identifier, perl=T)) %>% \
mutate(locus = paste0(locus,locus.identifier)) %>% \
select(organism, locus, gene.identifier) -> d
organism locus gene.identifier
Archaeoglobus fulgidus AF1667 SAT
Archaeoglobus profundus Arcpr_1264 SAT
Caldivirga maquilingensis Cmaq_0274 SAT
Desulfovibrio aespoeensis DaesDRAFT_2031 SAT
Desulfovibrio desulfuricans G20 Dde_2265 SAT
Desulfovibrio desulfuricans ATCC 27774 Ddes_0454 SAT
Desulfovibrio magneticus RS-1 DMR_39470 SAT
Desulfovibrio piger DESPIG_02241 SAT
Desulfovibrio salexigens Desal_0228 SAT
Desulfovibrio sp. FW1012B DFW101DRAFT_0832 SAT