Cleaning the Raw Sequencing Reads

This section will start with the raw sequencing data and perform a series a cleaning steps to prepare the sequences for the genome assembly. The various steps include:

  1. Process mate-pair (MP) reads, separating into proper MP and paired-end (PE) reads
  2. Filtering low-quality reads, Trimming low-quality bases, adapter identification and removal
    • Program: fastp for paired-end reads
  3. Process PE reads into overlapping single-end (SE) reads
    • Program: pear
    • Alternatively, one can use flash
  4. Removing identical read pairs
    • Program: clumpify - part of the BBTools package
    • Alternatively, one can use fastuniq, but this is only for PE reads (not SE).
  5. Removing reads that map conclusively to the American Shad mitochondrial genome
    • A mitogenome is already available, so we want to minimize their presence
    • Program bowtie2 for the mapping
  6. Kmer counting and Error-correcting the sequencing reads

Sometimes the code below only shows the code for a single run, and runs may be repeated for different files. For reference to the amount of resources required, see the accompanying .sh scripts in the Data folder.

Raw Data Summary:

Name Type Insert Size # Paired Reads # Bases Q20 bases Q30 Bases
PE500 151 bp; Paired-end TBD 313,465,270 94,666,511,540 88.2% 81.4%
MP5k 151 bp; Mate-pair 5-7 kb 226,692,460 68,461,122,920 93.1% 85.3%
MP10k 151 bp; Mate-pair 10-12 kb 223,358,676 67,454,320,152 92.8% 85.0%
Total n/a n/a 763,516,406 230,581,954,612 93.0% 85.5%

Step 1: Separate the MP reads into MP and PE reads

Here the software NxTrim v0.4.3-6eb8d5e was used to separate the MP reads. NxTrim removes the Nextera Mate Pair junction adapters and categorizes reads according to the orientation implied by the adapter location. The publication can be found here:
O’Connell J, Schulz-Trieglaff O, Carlson E, Hims MH, Gormley NA, Cox AJ. (2015) NxTrim: optimized trimming of Illumina mate pair reads. Bioinformatics 31(12):2035–2037. https://doi.org/10.1093/bioinformatics/btv057

Installation:

# Install NxTrim using git
git clone https://github.com/sequencing/NxTrim.git
cd NxTrim
make

Run NxTrim
An example run is shown below for the 5-7kb insert library, please see the scripts nxtrim_MP5k.sh, nxtrim_MP10k.sh, for more details on Job information.

# Assign read file names
fwd="5-7kb_S16_L002_R1_001.fastq.gz"
rev="5-7kb_S16_L002_R2_001.fastq.gz"
name="MP5k"

# Run nxtrim
nxtrim \
   -1 ${fwd} \
   -2 ${rev} \
   -O ${name} \
   --rf \
   --separate \
   -a \
   -l 50

Parameters Explained:

Output Summary

MP5k Reads

Trimming summary:
226692460 / 226692460	( 100.00% )	reads passed chastity/purity filters.
15085 / 226692460	( 0.01% )	reads had multiple copies of adapter (filtered).
970809 / 226677375	( 0.43% )	read pairs were ignored because template length appeared less than read length
225706566 remaining reads were trimmed
72226739 / 225706566	( 32.00% )	read pairs had MP orientation
87847026 / 225706566	( 38.92% )	read pairs had PE orientation
60434557 / 225706566	( 26.78% )	read pairs had unknown orientation
5198244 / 225706566	( 2.30% )	were single end reads
12147103 / 225706566	( 5.38% )	extra single end reads were generated from overhangs

MP10k Reads

Trimming summary:
223358676 / 223358676	( 100.00% )	reads passed chastity/purity filters.
16545 / 223358676	( 0.01% )	reads had multiple copies of adapter (filtered).
1000265 / 223342131	( 0.45% )	read pairs were ignored because template length appeared less than read length
222341866 remaining reads were trimmed
70317539 / 222341866	( 31.63% )	read pairs had MP orientation
92595412 / 222341866	( 41.65% )	read pairs had PE orientation
52979519 / 222341866	( 23.83% )	read pairs had unknown orientation
6449396 / 222341866	( 2.90% )	were single end reads
11670353 / 222341866	( 5.25% )	extra single end reads were generated from overhangs

Step 2: Read trimming and filtering

Here the new software fastp v0.19.6 was used to trim all (PE, MP, SE) reads. It combines a QC (Similar to FastQC) along with various trimming and filtering functions. The publication can be found here:
Chen S, Zhou Y, Chen Y, Gu (2018) fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34(17):i884–i890. https://doi.org/10.1093/bioinformatics/bty560

Installation:

# Install fastp using git
git clone https://github.com/OpenGene/fastp.git
cd fastp
make

Run fastp
An example run is shown below, please see the scripts trimPE500.sh, trimMP5k.sh, and trimMP10k.sh for more details on Job information.

# Assign names to each forward and reverse sequence reads file
fwd1="gDNA_S18_L002_R1_001.fastq.gz"
rev1="gDNA_S18_L002_R2_001.fastq.gz"
name1="PE500"

# Trim PE reads
echo "Trimming the PE reads"
fastp \
   -i ${fwd1} \
   -I ${rev1} \
   -o ${name1}_F.trimmed.fq.gz \
   -O ${name1}_R.trimmed.fq.gz \
   --detect_adapter_for_pe \
   --cut_front \
   --cut_tail \
   --cut_window_size=4 \
   --cut_mean_quality=20 \
   --qualified_quality_phred=20 \
   --unqualified_percent_limit=30 \
   --n_base_limit=5 \
   --length_required=50 \
   --low_complexity_filter \
   --complexity_threshold=30 \
   --overrepresentation_analysis \
   --json=${name1}.json \
   --html=${name1}.html \
   --report_title="$name1" \
   --thread=8

Parameters Explained:

See the Output HTML/PDF Files from fastp below:

Output Summary

PE500 Reads

Detecting adapter sequence for read1...
Illumina TruSeq Adapter Read 1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA

Detecting adapter sequence for read2...
Illumina TruSeq Adapter Read 2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

Read1 before filtering:
total reads: 313465270
total bases: 47333255770
Q20 bases: 44459515886(93.9287%)
Q30 bases: 41161938102(86.962%)

Read1 after filtering:
total reads: 282679799
total bases: 40866088764
Q20 bases: 39259694128(96.0691%)
Q30 bases: 36722535538(89.8607%)

Read2 before filtering:
total reads: 313465270
total bases: 47333255770
Q20 bases: 43740886904(92.4105%)
Q30 bases: 40215115322(84.9617%)

Read2 after filtering:
total reads: 282679799
total bases: 40844727907
Q20 bases: 39168769257(95.8968%)
Q30 bases: 36532722615(89.4429%)

Filtering result:
reads passed filter: 565359598
reads failed due to low quality: 56453630
reads failed due to too many N: 7128
reads failed due to too short: 4751908
reads failed due to low complexity: 358276
reads with adapter trimmed: 117911171
bases trimmed due to adapters: 3697697721

Duplication rate: 3.22418%

Insert size peak (evaluated by paired-end reads): 151

JSON report: PE500.json
HTML report: PE500.html

fastp v0.19.6, time used: 3677 seconds

MP5k Reads

# MP5k MP reads
Detecting adapter sequence for read1...
No adapter detected for read1

Detecting adapter sequence for read2...
No adapter detected for read2

Read1 before filtering:
total reads: 72226739
total bases: 8555203278
Q20 bases: 8169036254(95.4862%)
Q30 bases: 7611101112(88.9646%)

Read1 after filtering:
total reads: 69302647
total bases: 8211891813
Q20 bases: 7907272937(96.2905%)
Q30 bases: 7396093669(90.0656%)

Read2 before filtering:
total reads: 72226739
total bases: 8652737388
Q20 bases: 8158905710(94.2928%)
Q30 bases: 7528610405(87.0084%)

Read2 after filtering:
total reads: 69302647
total bases: 8269710924
Q20 bases: 7909224976(95.6409%)
Q30 bases: 7337874335(88.7319%)

Filtering result:
reads passed filter: 138605294
reads failed due to low quality: 5687612
reads failed due to too many N: 1332
reads failed due to too short: 134194
reads failed due to low complexity: 25046
reads with adapter trimmed: 88598
bases trimmed due to adapters: 4520619

Duplication rate: 79.9732%

Insert size peak (evaluated by paired-end reads): 0

JSON report: MP5k.json
HTML report: MP5k.html

fastp v0.19.6, time used: 1021 seconds

# MP5k unk reads
Detecting adapter sequence for read1...
No adapter detected for read1

Detecting adapter sequence for read2...
No adapter detected for read2

Read1 before filtering:
total reads: 60434557
total bases: 9103963533
Q20 bases: 8383688714(92.0883%)
Q30 bases: 7644980028(83.9742%)

Read1 after filtering:
total reads: 50514384
total bases: 7593290562
Q20 bases: 7236813461(95.3054%)
Q30 bases: 6701035903(88.2494%)

Read2 before filtering:
total reads: 60434557
total bases: 9104941968
Q20 bases: 8033994501(88.2377%)
Q30 bases: 7129774671(78.3066%)

Read2 after filtering:
total reads: 50514384
total bases: 7590443800
Q20 bases: 7076651939(93.2311%)
Q30 bases: 6410485139(84.4547%)

Filtering result:
reads passed filter: 101028768
reads failed due to low quality: 19123774
reads failed due to too many N: 1782
reads failed due to too short: 624196
reads failed due to low complexity: 90594
reads with adapter trimmed: 57920
bases trimmed due to adapters: 3690502

Duplication rate: 63.1683%

Insert size peak (evaluated by paired-end reads): 0

JSON report: MP5k_unk.json
HTML report: MP5k_unk.html

fastp v0.19.6, time used: 986 seconds

# MP5k PE reads
Detecting adapter sequence for read1...
No adapter detected for read1

Detecting adapter sequence for read2...
No adapter detected for read2

Read1 before filtering:
total reads: 87847026
total bases: 10620316674
Q20 bases: 9987265240(94.0392%)
Q30 bases: 9214906794(86.7668%)

Read1 after filtering:
total reads: 79605083
total bases: 9409417239
Q20 bases: 8997865087(95.6262%)
Q30 bases: 8356854855(88.8137%)

Read2 before filtering:
total reads: 87847026
total bases: 10742746146
Q20 bases: 9980076411(92.9006%)
Q30 bases: 9114573642(84.844%)

Read2 after filtering:
total reads: 79605083
total bases: 9317609181
Q20 bases: 8837411964(94.8463%)
Q30 bases: 8123820342(87.1878%)

Filtering result:
reads passed filter: 159210166
reads failed due to low quality: 12503538
reads failed due to too many N: 1382
reads failed due to too short: 3777180
reads failed due to low complexity: 201786
reads with adapter trimmed: 16886874
bases trimmed due to adapters: 934315765

Duplication rate: 58.4791%

Insert size peak (evaluated by paired-end reads): 151

JSON report: MP5k_pe.json
HTML report: MP5k_pe.html

fastp v0.19.6, time used: 1274 seconds

# MP5k SE reads
Detecting adapter sequence for read1...
No adapter detected for read1

Read1 before filtering:
total reads: 17345347
total bases: 1152136349
Q20 bases: 1103155168(95.7487%)
Q30 bases: 1030205878(89.417%)

Read1 after filtering:
total reads: 16861797
total bases: 1124360058
Q20 bases: 1085461753(96.5404%)
Q30 bases: 1017086457(90.4591%)

Filtering result:
reads passed filter: 16861797
reads failed due to low quality: 351338
reads failed due to too many N: 55
reads failed due to too short: 131753
reads failed due to low complexity: 404
reads with adapter trimmed: 0
bases trimmed due to adapters: 0

Duplication rate (may be overestimated since this is SE data): 69.1975%

JSON report: MP5k_se.json
HTML report: MP5k_se.html

fastp v0.19.6, time used: 100 seconds

MP10k Reads

# MP10k MP reads
Detecting adapter sequence for read1...
No adapter detected for read1

Detecting adapter sequence for read2...
No adapter detected for read2

Read1 before filtering:
total reads: 70317539
total bases: 8252846187
Q20 bases: 7888440507(95.5845%)
Q30 bases: 7356591994(89.1401%)

Read1 after filtering:
total reads: 67429656
total bases: 7916626307
Q20 bases: 7629893125(96.3781%)
Q30 bases: 7143698146(90.2366%)

Read2 before filtering:
total reads: 70317539
total bases: 8353353046
Q20 bases: 7861552795(94.1125%)
Q30 bases: 7239715878(86.6684%)

Read2 after filtering:
total reads: 67429656
total bases: 7975344398
Q20 bases: 7616341234(95.4986%)
Q30 bases: 7053148356(88.4369%)

Filtering result:
reads passed filter: 134859312
reads failed due to low quality: 5602906
reads failed due to too many N: 1352
reads failed due to too short: 151746
reads failed due to low complexity: 19762
reads with adapter trimmed: 92364
bases trimmed due to adapters: 4738508

Duplication rate: 76.9895%

Insert size peak (evaluated by paired-end reads): 0

JSON report: MP10k.json
HTML report: MP10k.html

fastp v0.19.6, time used: 591 seconds

# MP10k unk reads
Detecting adapter sequence for read1...
No adapter detected for read1

Detecting adapter sequence for read2...
No adapter detected for read2

Read1 before filtering:
total reads: 52979519
total bases: 7981417831
Q20 bases: 7345057582(92.027%)
Q30 bases: 6701747698(83.9669%)

Read1 after filtering:
total reads: 40033628
total bases: 6017186104
Q20 bases: 5741092250(95.4116%)
Q30 bases: 5322663420(88.4577%)

Read2 before filtering:
total reads: 52979519
total bases: 7982376176
Q20 bases: 6875842548(86.1378%)
Q30 bases: 6039886071(75.6653%)

Read2 after filtering:
total reads: 40033628
total bases: 6014358388
Q20 bases: 5589766113(92.9404%)
Q30 bases: 5047412465(83.9227%)

Filtering result:
reads passed filter: 80067256
reads failed due to low quality: 22907050
reads failed due to too many N: 1380
reads failed due to too short: 2851036
reads failed due to low complexity: 132316
reads with adapter trimmed: 43496
bases trimmed due to adapters: 2799867

Duplication rate: 43.0167%

Insert size peak (evaluated by paired-end reads): 0

JSON report: MP10k_unk.json
HTML report: MP10k_unk.html

fastp v0.19.6, time used: 546 seconds

# MP10k PE reads
Detecting adapter sequence for read1...
No adapter detected for read1

Detecting adapter sequence for read2...
No adapter detected for read2

Read1 before filtering:
total reads: 92595412
total bases: 10339427498
Q20 bases: 9734991913(94.1541%)
Q30 bases: 8989871407(86.9475%)

Read1 after filtering:
total reads: 76806899
total bases: 8989408271
Q20 bases: 8602162029(95.6922%)
Q30 bases: 7994251470(88.9297%)

Read2 before filtering:
total reads: 92595412
total bases: 10456199676
Q20 bases: 9696992802(92.7392%)
Q30 bases: 8841628845(84.5587%)

Read2 aftering filtering:
total reads: 76806899
total bases: 8846079072
Q20 bases: 8373740525(94.6605%)
Q30 bases: 7680890174(86.8282%)

Filtering result:
reads passed filter: 153613798
reads failed due to low quality: 12159456
reads failed due to too many N: 1128
reads failed due to too short: 19079170
reads failed due to low complexity: 337272
reads with adapter trimmed: 22139306
bases trimmed due to adapters: 1220839087

Duplication rate: 45.2146%

Insert size peak (evaluated by paired-end reads): 151

JSON report: MP10k_pe.json
HTML report: MP10k_pe.html

fastp v0.19.6, time used: 776 seconds

# MP10k SE reads
Detecting adapter sequence for read1...
No adapter detected for read1

Read1 before filtering:
total reads: 18119749
total bases: 1257561974
Q20 bases: 1205203768(95.8365%)
Q30 bases: 1126084117(89.545%)

Read1 after filtering:
total reads: 17639144
total bases: 1229482188
Q20 bases: 1187332177(96.5717%)
Q30 bases: 1112826453(90.5118%)

Filtering result:
reads passed filter: 17639144
reads failed due to low quality: 354238
reads failed due to too many N: 50
reads failed due to too short: 125992
reads failed due to low complexity: 325
reads with adapter trimmed: 0
bases trimmed due to adapters: 0

Duplication rate (may be overestimated since this is SE data): 73.236%

JSON report: MP10k_se.json
HTML report: MP10k_se.html

fastp -i /work/frr6/SHAD/NXTRIM/MP10k.se.fastq.gz -o MP10k_se.trimmed.fq.gz --cut_front --cut_tail --cut_window_size=4 --cut_mean_quality=20 --qualified_quality_phred=20 --unqualified_percent_limit=30 --n_base_limit=5 --length_required=50 --low_complexity_filter --complexity_threshold=30 --overrepresentation_analysis --json=MP10k_se.json --html=MP10k_se.html --report_title=MP10k_se --thread=8 
fastp v0.19.6, time used: 94 seconds

Step 3: Process PE overlaps

The software pear v0.9.11 was used to merge paired-end reads. The resulting longer reads are SE but can significantly improve genome assemblies. The publication can be found here:
Zhang J, Kobert K, Flouri T, Stamatakis A (2014) PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics 30(5):614-20. https://doi.org/10.1093/bioinformatics/btt593
Alternatively, but not shown below, I tried running flash v1.2.11, but I liked pear much better. You can see the script flash.sh if this is of interest.

Installation:

# Downloaded from http://www.exelixis-lab.org/web/software/pear
gunzip pear-src-0.9.11.tar.gz
tar -xvf pear-src-0.9.11.tar
cd pear-src-0.9.11
./configure PREFIX=/dscrhome/frr6/bin/
make

Run Pear
An example run is shown below. NOTE: We have to process both the cleaned PE500 reads and the cleaned PE reads produced from NxTrim above. Please see the script pear.sh for more details on Job information.

# Set file names
pe500f="PE500_F.trimmed.fq.gz"
pe500r="PE500_R.trimmed.fq.gz"
mp5kf="MP5k_pe_F.trimmed.fq.gz"
mp5kr="MP5k_pe_R.trimmed.fq.gz"
mp10kf="MP10k_pe_F.trimmed.fq.gz"
mp10kr="MP10k_pe_R.trimmed.fq.gz"

# PE500 reads
pear \
   -o PE500.trimmed.pear \
   -f ${pe500f} \
   -r ${pe500r} \
   -j 16 \
   -k

# From the MP5k reads
pear \
   -o MP5k_pe.trimmed.pear \
   -f ${mp5kf} \
   -r ${mp5kr} \
   -j 16 \
   -k

# From the MP10k reads
pear \
   -o MP10k_pe.trimmed.pear \
   -f ${mp10kf} \
   -r ${mp10kr} \
   -j 16 \
   -k

# Compress reads
gzip PE500.trimmed.pear.assembled.fastq; mv PE500.trimmed.pear.assembled.fastq.gz PE500_se.trimmed.pear.fq.gz
gzip PE500.trimmed.pear.discarded.fastq
gzip PE500.trimmed.pear.unassembled.forward.fastq; mv PE500.trimmed.pear.unassembled.forward.fastq.gz PE500_F.trimmed.pear.fq.gz
gzip PE500.trimmed.pear.unassembled.reverse.fastq; mv PE500.trimmed.pear.unassembled.reverse.fastq.gz PE500_R.trimmed.pear.fq.gz
echo "Finished compressing PE500 reads"

gzip MP5k_pe.trimmed.pear.assembled.fastq; mv MP5k_pe.trimmed.pear.assembled.fastq.gz MP5k_pe_se.trimmed.pear.fq.gz
gzip MP5k_pe.trimmed.pear.discarded.fastq
gzip MP5k_pe.trimmed.pear.unassembled.forward.fastq; mv MP5k_pe.trimmed.pear.unassembled.forward.fastq.gz MP5k_pe_F.trimmed.pear.fq.gz
gzip MP5k_pe.trimmed.pear.unassembled.reverse.fastq; mv MP5k_pe.trimmed.pear.unassembled.reverse.fastq.gz MP5k_pe_R.trimmed.pear.fq.gz
echo "Finished compressing MP5k PE reads"

gzip MP10k_pe.trimmed.pear.assembled.fastq; mv MP10k_pe.trimmed.pear.assembled.fastq.gz MP10k_pe_se.trimmed.pear.fq.gz
gzip MP10k_pe.trimmed.pear.discarded.fastq
gzip MP10k_pe.trimmed.pear.unassembled.forward.fastq; mv MP10k_pe.trimmed.pear.unassembled.forward.fastq.gz MP10k_pe_F.trimmed.pear.fq.gz
gzip MP10k_pe.trimmed.pear.unassembled.reverse.fastq; mv MP10k_pe.trimmed.pear.unassembled.reverse.fastq.gz MP10k_pe_R.trimmed.pear.fq.gz
echo "Finished compressing MP10k PE reads"

Parameters Explained:

Output Summary

PE500 Reads

 ____  _____    _    ____ 
|  _ \| ____|  / \  |  _ \
| |_) |  _|   / _ \ | |_) |
|  __/| |___ / ___ \|  _ <
|_|   |_____/_/   \_\_| \_\

PEAR v0.9.11 [Nov 5, 2017]

Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR
Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593

Forward reads file.................: /work/frr6/SHAD/FASTP/PE500_F.trimmed.fq.gz
Reverse reads file.................: /work/frr6/SHAD/FASTP/PE500_R.trimmed.fq.gz
PHRED..............................: 33
Using empirical frequencies........: YES
Statistical method.................: OES
Maximum assembly length............: 999999
Minimum assembly length............: 50
p-value............................: 0.010000
Quality score threshold (trimming).: 0
Minimum read size after trimming...: 1
Maximal ratio of uncalled bases....: 1.000000
Minimum overlap....................: 10
Scoring method.....................: Scaled score
Threads............................: 16

Allocating memory..................: 200,000,000 bytes
Computing empirical frequencies....: DONE
  A: 0.277289
  C: 0.224540
  G: 0.224961
  T: 0.273210
  13241024 uncalled bases

Assembled reads ...................: 183,983,972 / 282,679,799 (65.086%)
Discarded reads ...................: 0 / 282,679,799 (0.000%)
Not assembled reads ...............: 98,695,827 / 282,679,799 (34.914%)
Assembled reads file...............: PE500.trimmed.pear.assembled.fastq
Discarded reads file...............: PE500.trimmed.pear.discarded.fastq
Unassembled forward reads file.....: PE500.trimmed.pear.unassembled.forward.fastq
Unassembled reverse reads file.....: PE500.trimmed.pear.unassembled.reverse.fastq

MP5k_pe Reads

 ____  _____    _    ____ 
|  _ \| ____|  / \  |  _ \
| |_) |  _|   / _ \ | |_) |
|  __/| |___ / ___ \|  _ <
|_|   |_____/_/   \_\_| \_\

PEAR v0.9.11 [Nov 5, 2017]

Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR
Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593

Forward reads file.................: /work/frr6/SHAD/FASTP/MP5k_pe_F.trimmed.fq.gz
Reverse reads file.................: /work/frr6/SHAD/FASTP/MP5k_pe_R.trimmed.fq.gz
PHRED..............................: 33
Using empirical frequencies........: YES
Statistical method.................: OES
Maximum assembly length............: 999999
Minimum assembly length............: 50
p-value............................: 0.010000
Quality score threshold (trimming).: 0
Minimum read size after trimming...: 1
Maximal ratio of uncalled bases....: 1.000000
Minimum overlap....................: 10
Scoring method.....................: Scaled score
Threads............................: 16

Allocating memory..................: 200,000,000 bytes
Computing empirical frequencies....: DONE
  A: 0.283777
  C: 0.214987
  G: 0.213886
  T: 0.287350
  3225996 uncalled bases

Assembled reads ...................: 12,346,077 / 79,605,083 (15.509%)
Discarded reads ...................: 0 / 79,605,083 (0.000%)
Not assembled reads ...............: 67,259,006 / 79,605,083 (84.491%)
Assembled reads file...............: MP5k_pe.trimmed.pear.assembled.fastq
Discarded reads file...............: MP5k_pe.trimmed.pear.discarded.fastq
Unassembled forward reads file.....: MP5k_pe.trimmed.pear.unassembled.forward.fastq
Unassembled reverse reads file.....: MP5k_pe.trimmed.pear.unassembled.reverse.fastq

MP10k_pe Reads

 ____  _____    _    ____ 
|  _ \| ____|  / \  |  _ \
| |_) |  _|   / _ \ | |_) |
|  __/| |___ / ___ \|  _ <
|_|   |_____/_/   \_\_| \_\

PEAR v0.9.11 [Nov 5, 2017]

Citation - PEAR: a fast and accurate Illumina Paired-End reAd mergeR
Zhang et al (2014) Bioinformatics 30(5): 614-620 | doi:10.1093/bioinformatics/btt593

Forward reads file.................: /work/frr6/SHAD/FASTP/MP10k_pe_F.trimmed.fq.gz
Reverse reads file.................: /work/frr6/SHAD/FASTP/MP10k_pe_R.trimmed.fq.gz
PHRED..............................: 33
Using empirical frequencies........: YES
Statistical method.................: OES
Maximum assembly length............: 999999
Minimum assembly length............: 50
p-value............................: 0.010000
Quality score threshold (trimming).: 0
Minimum read size after trimming...: 1
Maximal ratio of uncalled bases....: 1.000000
Minimum overlap....................: 10
Scoring method.....................: Scaled score
Threads............................: 16

Allocating memory..................: 200,000,000 bytes
Computing empirical frequencies....: DONE
  A: 0.282901
  C: 0.215660
  G: 0.214794
  T: 0.286645
  3073285 uncalled bases

Assembled reads ...................: 14,781,418 / 76,806,899 (19.245%)
Discarded reads ...................: 0 / 76,806,899 (0.000%)
Not assembled reads ...............: 62,025,481 / 76,806,899 (80.755%)
Assembled reads file...............: MP10k_pe.trimmed.pear.assembled.fastq
Discarded reads file...............: MP10k_pe.trimmed.pear.discarded.fastq
Unassembled forward reads file.....: MP10k_pe.trimmed.pear.unassembled.forward.fastq
Unassembled reverse reads file.....: MP10k_pe.trimmed.pear.unassembled.reverse.fastq

Step 4: Remove duplicates

The software clumpify v38.34 from the BBTools package was used to remove duplicated reads (both PE or SE). This process removes unnecessary reads that won’t contribute ultimately to the assemblies and scaffolding. There is no associated publication. Alternatively, but not shown below, I have run fastuniq vx, but I like clumpify much better since it also works on SE reads and can remove optical duplicates as well.. You can see the script fastuniq.sh if this is of interest.

Installation:

# Downloaded bbtools
wget https://sourceforge.net/projects/bbmap/files/BBMap_38.34.tar.gz
tar -zxvf BBMap_38.34.tar.gz
cd bbmap/
ln -s /dscrhome/frr6/PROGRAMS/bbmap/clumpify.sh /dscrhome/frr6/bin/clumpify.sh

Run clumpify
An example run is shown below. NOTE: We have to process both PE and SE files. Please see the script clump.sh for more details on Job information.

# Process all the SE files
./clump.sh SE PE500 PE500_se.trimmed.pear.fq.gz empty
./clump.sh SE MP5k_pe MP5k_pe_se.trimmed.pear.fq.gz empty
./clump.sh SE MP10k_pe MP10k_pe_se.trimmed.pear.fq.gz empty
./clump.sh SE MP5k MP5k_se.trimmed.fq.gz empty
./clump.sh SE MP10k MP10k_se.trimmed.fq.gz empty

# Process all the MP PE files
./clump.sh PE MP5k MP5k_F.trimmed.fq.gz MP5k_R.trimmed.fq.gz
./clump.sh PE MP5k_unk MP5k_unk_F.trimmed.fq.gz MP5k_unk_R.trimmed.fq.gz
./clump.sh PE MP5k_pe MP5k_pe_F.trimmed.pear.fq.gz MP5k_pe_R.trimmed.pear.fq.gz
./clump.sh PE MP10k MP10k_F.trimmed.fq.gz MP10k_R.trimmed.fq.gz
./clump.sh PE MP10k_unk MP10k_unk_F.trimmed.fq.gz MP10k_unk_R.trimmed.fq.gz
./clump.sh PE MP10k_pe MP10k_pe_F.trimmed.pear.fq.gz MP10k_pe_R.trimmed.pear.fq.gz

# Process the PE PE files
./clump.sh PE PE500 PE500_F.trimmed.pear.fq.gz PE500_R.trimmed.pear.fq.gz

clump.sh script

#!/bin/bash -l
# Read in file name stem
n=$2
read1=$3
read2=$4

if [ "$1" = 'PE' ]
   then
   echo "Running Clumpify PE mode..."
   clumpify.sh \
      -in=${read1} \
      -in2=${read2} \
      -out=${n}_F.trimmed.deduped.fq.gz \
      -out2=${n}_R.trimmed.deduped.fq.gz \
      dedupe=t \
      containment=f \
      optical=t \
      dupedist=12000
   echo "Finished Clumpify PE mode..."
   elif [ "$1" = 'SE' ]
   then
   echo "Running Clumpify SE mode..."
   clumpify.sh \
      -in=${read1} \
      -out=${n}_se.trimmed.deduped.fq.gz \
      dedupe=t \
      containment=f \
      optical=t \
      dupedist=12000
   echo "Finished Clumpify SE mode..."
   else
   echo "Error:  Must state whether to RUn SE or PE mode!"
fi

Parameters Explained:

Output Summary

PE500 SE Reads

Reads In:            183983972
Clumps Formed:        24043576
Duplicates Found:      5314343
Reads Out:           178669629
Bases Out:         32580881321
Total time: 	5376.857 seconds.

MP5k PE SE Reads

Reads In:             12346077
Clumps Formed:         2776147
Duplicates Found:       343249
Reads Out:            12002828
Bases Out:          1152843265
Total time: 	231.658 seconds.

MP10k PE SE Reads

Reads In:             14781418
Clumps Formed:         2104231
Duplicates Found:       494149
Reads Out:            14287269
Bases Out:          1316546945
Total time: 	224.572 seconds.

MP5k SE Reads

Reads In:             16861797
Clumps Formed:         3281375
Duplicates Found:       571332
Reads Out:            16290465
Bases Out:          1084833963
Total time: 	218.710 seconds.

MP10k SE Reads

Reads In:             17639144
Clumps Formed:         2279245
Duplicates Found:       800526
Reads Out:            16838618
Bases Out:          1174631851
Total time: 	217.946 seconds.

MP5k MP reads

Reads In:            138605294
Clumps Formed:         9704475
Duplicates Found:      3174154
Reads Out:           135431140
Bases Out:         16109764707
Total time: 	1804.868 seconds.

MP5k Unk reads

Reads In:            101028768
Clumps Formed:         9212697
Duplicates Found:      1299408
Reads Out:            99729360
Bases Out:         14988178619
Total time: 	5752.525 seconds.

MP5k PE reads

Reads In:            134518012
Clumps Formed:        10440484
Duplicates Found:      2831958
Reads Out:           131686054
Bases Out:         16051253878
Total time: 	4844.656 seconds.

MP10k MP reads

Reads In:            134859312
Clumps Formed:         6695378
Duplicates Found:      4276242
Reads Out:           130583070
Bases Out:         15392514683
Total time: 	3751.891 seconds.

MP10k Unk reads

Reads In:             80067256
Clumps Formed:         5758195
Duplicates Found:      1513806
Reads Out:            78553450
Bases Out:         11803770125
Total time: 	4356.128 seconds.

MP10k PE reads

Reads In:            124050962
Clumps Formed:         7288410
Duplicates Found:      3684230
Reads Out:           120366732
Bases Out:         14662031651
Total time: 	3916.013 seconds.

PE500 PE Reads

Reads In:            197391654
Clumps Formed:        19586064
Duplicates Found:      2238096
Reads Out:           195153558
Bases Out:         29388195556
Total time: 	10547.929 seconds.

Step 5: Remove mitochondrial reads

Since a mitogenome sequence for American shad already exists, it is useful to remove mitochondrial sequences from the dataset. This reduces the overall computational burden and facilitates less cleanup of the final, assembled scaffolds. It is possible this may filter some occassional nuclear-mitochondrial insertions, but we can accept that since the read pair information will help minimize this. We use the efetch command from the NCBI E-utilities toolset to download the mitogenome sequence. The mitogenome was published in:
Bi YH and Chen XW (2011). Mitochondrial genome of the American shad Alosa sapidissima. Mitochondrial DNA 22(1-2):9-11. http://doi.org/10.3109/19401736.2010.551659.

Download the mitogenome

efetch \
   -db nucleotide \
   -format fasta \
   -id NC_014690.1 >> Asap_mito.fasta

Do the Mapping and Extract the Remaining Reads
This uses bowtie2 v2.3.0, which has been enabled for mate-pair reads. Only examples are shown below, please see remove-mitoPE500.sh script for full code. bowtie2 is published in:
Langmead B, Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2. Nature Methods 9(4):357-359. https://doi.org/10.1038/nmeth.1923

sbatch -J PE500 -o PE500.out -e PE500.err  mito.sh PE PE500 /work/frr6/SHAD/CLUMPIFY/PE500_F.trimmed.deduped.fq.gz /work/frr6/SHAD/CLUMPIFY/PE500_R.trimmed.deduped.fq.gz

sbatch -J PE500_se -o PE500_se.out -e PE500_se.err  mito.sh SE PE500_se /work/frr6/SHAD/CLUMPIFY/PE500_se.trimmed.deduped.fq.gz empty

sbatch -J MP10k -o MP10k.out -e MP10k.err  mito.sh MP MP10k /work/frr6/SHAD/CLUMPIFY/MP10k_F.trimmed.deduped.fq.gz /work/frr6/SHAD/CLUMPIFY/MP10k_R.trimmed.deduped.fq.gz

sbatch -J MP10k_unk -o MP10k_unk.out -e MP10k_unk.err  mito.sh MP MP10k_unk /work/frr6/SHAD/CLUMPIFY/MP10k_unk_F.trimmed.deduped.fq.gz /work/frr6/SHAD/CLUMPIFY/MP10k_unk_R.trimmed.deduped.fq.gz

sbatch -J MP10k_pe -o MP10k_pe.out -e MP10k_pe.err  mito.sh PE MP10k_pe /work/frr6/SHAD/CLUMPIFY/MP10k_pe_F.trimmed.deduped.fq.gz /work/frr6/SHAD/CLUMPIFY/MP10k_pe_R.trimmed.deduped.fq.gz

sbatch -J MP10k_pe_se -o MP10k_pe_se.out -e MP10k_pe_se.err  mito.sh SE MP10k_pe_se /work/frr6/SHAD/CLUMPIFY/MP10k_pe_se.trimmed.deduped.fq.gz empty

sbatch -J MP10k_se -o MP10k_se.out -e MP10k_se.err  mito.sh SE MP10k_se /work/frr6/SHAD/CLUMPIFY/MP10k_se.trimmed.deduped.fq.gz empty
################
sbatch -J MP5k -o MP5k.out -e MP5k.err  mito.sh MP MP5k /work/frr6/SHAD/CLUMPIFY/MP5k_F.trimmed.deduped.fq.gz /work/frr6/SHAD/CLUMPIFY/MP5k_R.trimmed.deduped.fq.gz

sbatch -J MP5k_unk -o MP5k_unk.out -e MP5k_unk.err  mito.sh MP MP5k_unk /work/frr6/SHAD/CLUMPIFY/MP5k_unk_F.trimmed.deduped.fq.gz /work/frr6/SHAD/CLUMPIFY/MP5k_unk_R.trimmed.deduped.fq.gz

sbatch -J MP5k_pe -o MP5k_pe.out -e MP5k_pe.err  mito.sh PE MP5k_pe /work/frr6/SHAD/CLUMPIFY/MP5k_pe_F.trimmed.deduped.fq.gz /work/frr6/SHAD/CLUMPIFY/MP5k_pe_R.trimmed.deduped.fq.gz

sbatch -J MP5k_pe_se -o MP5k_pe_se.out -e MP5k_pe_se.err  mito.sh SE MP5k_pe_se /work/frr6/SHAD/CLUMPIFY/MP5k_pe_se.trimmed.deduped.fq.gz empty

sbatch -J MP5k_se -o MP5k_se.out -e MP5k_se.err  mito.sh SE MP5k_se /work/frr6/SHAD/CLUMPIFY/MP5k_se.trimmed.deduped.fq.gz empty
# Build index of reference
bowtie2-build Asap_mito.fasta Asap_mito

# Run mapping script, remember the parameters differ for each of the three libraries.
sbatch remove-mitoPE500.sh

# Do Mapping for PE reads
bowtie2 \
   --phred33 \
   -q \
   --very-sensitive \
   --minins 0 \
   --maxins 500 \
   --fr \
   --threads 8 \
   --reorder \
   -x Asap_mito \
   -1 PE500_F.trimmed.uniq.fq.gz \
   -2 PE500_R.trimmed.uniq.fq.gz | \
   samtools1.3 view -b -F 2 | \
   samtools1.3 sort -T PE500.tmp -n -O bam | \
   bedtools bamtofastq -i - -fq PE500_F.trimmed.uniq.noMito.fq -fq2 PE500_R.trimmed.uniq.noMito.fq

# Compress the resulting reads
gzip PE500_F.trimmed.uniq.noMito.fq
gzip PE500_R.trimmed.uniq.noMito.fq

Parameters Explained:

Final Summary of Cleaning:

Name # Paired Reads # Bases Q20 bases Q30 Bases
PE500 280,454,045 81,520,790,896 96.0% 89.6%
MP5k 207,677,745 62,423,641,430 95.2% 88.1%
MP10k 194,228,811 58,364,280,836 95.3% 88.1%
Total 682,360,601 202,308,713,162 95.5% 88.7%