The blog has been a bit quiet of late, I have been buried away in front of a computer and in the lab trying to develop new markers for different sources of pollution in the UK. After collecting a range of faecal samples (Glamorous I know) from sources around the North East of England. These sources include: Sheep, Pig, Cow, Horse, Dog, Chicken and Gull faeces, and of course, Sewage.
Swabs from faeces and sewage were used to grow Escherichia coli in the lab and their genomes sequenced by the Earlham Institute . Sequencing is still expensive, so it was important not to sequence identical E.coli from the same source. To make sure the genomes were different we used BOX-PCR fingerprinting. Bacterial genomes have sections of DNA that repeat from time to time throughout the genome. These can be used to make fingerprints (Below) which I used to tell if the E.coli strains were identical or not.
Sequenced E.coli genomes are a whole lot of data which can be downloaded. The data comes as lots of sequences of A’s, T’s. G’s and C’s of different lengths, the next challenge is to put it all together. Imagine a puzzle where you need to fit together 1000’s of pieces, but some pieces are there multiple times, some bits might be missing and some bits are just wrong. Luckily smarter people than myself have (and continue to) come up with a range of tools to help us figure the mess out without taking 100’s of years to do it.
Firstly, every single base (A,T,G or C) in the sequence comes with a quality score, it tells us how confident we can be the base we see on the computer is the correct base seen in the E.coli genome. So we can go through and throw out all the sequences that have poor quality. Next we can do one of two things, we can take an E.coli genome that has already been sequenced and assembled and use it as a template to help us with our puzzle, or we can just go ahead and try and complete the puzzle with no help! What you are going to do with the genomes once you have completed the puzzle, tells you how you should complete the puzzle. We wanted to look at the accessory genome, the part of the E.coli genome which is not the same in all E.coli, so we opted to complete the puzzle without a template. to do this we used SPAdes, a piece of software that will take all the good quality pieces of the puzzle and arrange them into something that makes sense. Once SPAdes has worked it’s magic, we can check the genomes by comparing our completed puzzle with previously assembled E.coli to make sure they are a similar size and that there are no huge gaps in our puzzle.
That’s the easy bit done. Once the puzzle of each E.coli had been complete, I wanted to look for genes that were only found in E.coli that had come from humans. But since sequencing is expensive, I only had a small selection of all the possible E.coli found in Sewage or the human gut. The National Centre for Biotechnology Information (NCBI) is a great resource containing millions of genomes and genome sequences from a range of organisms. Researchers can submit their data to a large database such as this, allowing others to use their data. So around 200 E.coli genomes (That’s about 280 MB of data) were downloaded from the NCBI database. There is over 1000 E.coli genomes or part-genomes on the database, mostly from humans and cattle, but many do not have descriptions of where the E.coli have come from making them useless for our purpose.
I looked for genes that were present much more often in E.coli from humans than other animals which might cause pollution in the UK. This sounds much simpler than it is, but again software to the rescue! We can relatively easily pick out the sections of DNA which we think are the code for proteins. Once this has been picked out I isolated the bits of DNA which occurred most often in E.coli from humans. Of the 1000’s of coding regions only 100 or so were more prevalent in humans, and some of these were also prevalent in E.coli from other animals. 90% was the magic number I chose. If 90% of E.coli from non-human sources did not contain a section of DNA it was good enough for me to explore further.
The next problem was that different bacteria can share some of the same genes. To answer the question: Are these sections of DNA which we know are in a lot of human E.coli present in other bacteria which are not E.coli? We used a widely known and used algorithm called BLAST. This clever bit of code takes your piece of DNA (or RNA or proteins) and slides it over lots of other pieces of DNA in a database (Like the NCBI one) giving a score to represent how similar the sequence is to each little bit of DNA it comes across. The thing that makes BLAST so popular and widely used is that is it incredibly fast. So taking all of our bits of E.coli DNA which are more prevalent in humans we checked if they had been found in any other organisms using BLAST. Any that were in other organisms were ruthlessly and unceremoniously binned. That left less than 20 sequences with which to pick from for the next phase.
The next phase was the lab to test if the DNA sections that the computer says are more prevalent in humans are actually more often in human E.coli in the North East of the UK. 5 human sections were chosen to be tested in the lab. There were many more possibilities and if anyone has the time and money they would definitely find something useful. Since I was limited by both (As is life) 5 seemed a reasonable number to choose. To test these in the lab a lot of E.coli from different faecal sources is needed. You will recall I delightfully sampled lots of poop and grew E.coli from these samples so I now
have an E.coli library, a poop library. To check the DNA sections we used PCR, amplifying the genes enough so that we can ‘see’ them. This whole sequence was done for E.coli in the faeces of other animals.
So what have we learned so far? Finding human markers is pretty easy compared to non-human markers. This is probably because sewage provides a slurry E.coli from a lot of humans that sampling individual animal faeces simply does not. Nevertheless a number of markers which may be useful have been identified. Some are still being tested in the lab, but I am excited to say that the successful markers will be used in the next case study to quantify how much human sources are contributing to E.coli in the Seaton catchment.