ML trading or data mining?
The line is thinner than you think
Everyone runs ML trading. Almost no one can tell when it quietly turns into data mining.
The code is identical. The tools are the same: XGBoost, cross-validation, feature engineering. Yet one produces alpha, the other produces a statistical illusion that dies in the first week of live trading.
I see this line crossed every day. Not out of incompetence. Out of habit.
1. The line is not technical, it’s methodological
ML trading: you start from an economic hypothesis, you build a model to test it.
Data mining: you start from the data, you search for what works, you rationalize after.
The honest test: can you explain why your feature should work BEFORE seeing the backtest? If the justification shows up after the equity curve, you haven’t found alpha. You’ve found a correlation.
2. The red flags
A few patterns that should trigger an alarm in your research:
Sharpe climbing with every iteration
Hyperparameters tuned on the out-of-sample period
Lookback window chosen through grid search
Holdout set “checked” more than once
Sharpe above 3 on a simple strategy, on a handful of assets
You trade BTC because “it doesn’t work as well on ETH”
Two boxes checked, you’re likely data mining. Three, you definitely are.
3. The validation framework that closes the door
Three tools, no less:
CPCV (Combinatorial Purged Cross-Validation) instead of naive K-fold
Deflated Sharpe Ratio (Bailey and López de Prado), which explicitly adjusts for the number of trials
A holdout set that stays untouched. Truly untouched.
The rule that stings: testing 20 strategies at a 5% threshold is a near-guarantee of one false discovery. If you don’t know how many variants you’ve tested, you’re data mining without knowing it.
Edge comes from the hypothesis, not from the algorithm.
ML is an alpha extraction tool, not an alpha generator. No library, no model, no feature creates an edge from nothing. If the economic thesis isn’t there before the code, the code won’t make it appear.
The discipline that separates the quant from the data miner: pre-registering hypotheses, the way academic research does. Writing down what you’re looking for before you start looking.
It’s uncomfortable. That’s exactly why so few do it.


