If you live and breathe at the intersection of Cyber Security and Data Science, you have probably seen Alexandre Pinto’s DefCon22 talk, #SecureBecauseMath (https://www.youtube.com/
Feature Design is the art of creating useful variables (features) -- numerical or otherwise -- that are meant to capture the salient details of the patterns contained in data. Feature Design is hard. Machine Learning is not magic, and there is no guarantee that a pattern that is barely there can be represented well enough for an algorithm to latch on to. This remains difficult, no matter how much one begins to believe in Deep Learning and tera-feature classification with on-line algorithms. Any Data Scientist with experience in Machine Learning will tell you that Feature Design is where systems are made (or broken). An algorithm isn’t likely to figure out the multi-dimensional correlations without a great deal of well-labeled data -- reliable labels for Supervised Learning are extremely important. However, (a) well-designed feature(s) may bring that connection out more easily. So, apply some human ingenuity and a little bit of prototype-level elbow grease, and suddenly performance improves by leaps and bounds.
There is another side of the coin, however -- does the performance jump reflect the actual skill of the learning system and its generalization capability, or is it just overtraining in disguise? Notable specialists, including Trevor Hastie, Robert Tibshirani (of Stanford University and authors of “The Elements of Statistical Learning”), and John Langford (author of Vowpal Wabbit), speak at great length about hidden overtraining (http://hunch.net/?p=22). A simple mistake, such as adjusting the features after evaluating on the Test Set can yield improved results in evaluation, but fail hard in real operation. There are some technically simple (but conceptually non-trivial) solutions to this, but they require that the researcher or engineer at the very least recognize that this type of error is occurring.
So, #math, i.e., the practice of Machine Learning / Statistical Learning, may indeed be the answer, but it doesn’t absolve us from the responsibility of performing the due diligence and doing science the RIGHT way.