A new paper out that provides some guidance to those trying to optimize stacks parameters.
They sequence a subset of their samples multiple times, and are able to leverage this to into some nice conclusions about how to set up your analysis. Their recommendations generally match pretty well with our experience and current protocol.
Wes and Carita have identified the shearing and size selection step as a driver or variation in sequencing depth. The authors speak to the source of coverage differences across replicates of the same sample:
“A key source of variation between replicate pairs is that the identity of most
(>70%) of the missing loci in a given replicate are not the same in the corresponding
replicate (Fig. 3c), […]. As these differences are between samples from the same
DNA source that were processed together, it seems that stochastic PCR/sequencing
sampling events and imprecise size selection are the main sources of heterogeneous
coverage among loci.”
Careful about setting the m/M parameters too high:
“Importantly, the parameter profiles at which incorrect clustering occurred were high
values for minimal coverage (-m) and the number of mismatches between loci when
processing an individual (-M)”
Put a bound on the error rate! They suggest:
“a SNP calling model with an upper bound of 0.05”
Mastretta‐Yanes, Alicia, et al. RAD sequencing, genotyping error estimation and de novo assembly optimization for population genetic inference. Molecular Ecology Resources (2014).