Opinion: Is a PhD helpful for a data science career?

Author: Vincent Granville

The answer to this question is not black and white, and also depends on where you live, what you did during your PhD program, how much time and money you spent on it, what kind of jobs you are interested in, and what other experience you have. You could say the same thing about earning an MBA instead.

If your PhD took place abroad as in my case, you may have not spent much money to earn it (you might even have been well paid.) Depending on your school, it might have been cross-disciplinary (computer science, dynamical systems, bioinformatics, and statistics, for instance), applied (with a significant part writing code and processing data), and you might have been able to work for a company in the process, gaining real professional experience. All this are big pluses, and some top schools offer all of this. Not sure if you can work and pursue a PhD program online at the same time as you are employed: this would be another advantage.

That said, PhD programs are aimed at getting an (elusive) academic position, or at least focus on research and publishing. If you don’t like research and publishing, you will suffer, though you can opt out at any time. There are also other ways to gain the same amount of knowledge, writing /research skills, and expertise: for instance, in a program such as my doctorship. If you are a self-learner, you can do the same quality of work, gain the same experience, all by yourself, while working full-time at the same time. For instance, I became a PhD-level number theorist, very applied and data-intense (working on very big data!), blending computational number theory with data analysis and statistics, all by myself, despite no formal training in number theory. The book that I published recently (see here) about my research is indeed focused on the data science aspects, and it is of better quality than my actual PhD dissertation 25 years ago: more advanced, more innovative, more groundbreaking, and written in very simple English, making advanced topics accessible to beginners — which is more difficult than publishing jargon in esoteric journals, and more rewarding from a professional and personal point of view.

It helps if you manage to make ground-breaking discoveries during your PhD years, with significant potential for applications; yet very few achieve this goal during their PhD years — indeed some professors never achieve this goal in their entire lifetime. In order to do so, you must work on the right kinds of problems; there are still many low hanging fruits, for instance in AI. Nevertheless, some positions in research laboratories (be it Google, Microsoft, or the government) require a PhD degree, and are very competitive unless you are ready to start as an intern and you come from a top program. If you become an entrepreneur and want to raise VC funding, a PhD degree helps a lot, and it gives some level of trust from your clients, investors, readers (if you write books) or users (if you teach in programs such as Coursera.)

Probably what scares many hiring managers about PhD’s (except if they have themselves a PhD degree — many do in the pharmaceutical industry) is

lack of people skills (managing a team of people),
very vertical knowledge, unwillingness to work on mundane stuff,
not adapted to “real life” and collaborative work,
being stuck in some rigid theoretical framework, resulting in lack of vision (*)
a 95/5 mindset, as opposed to 80/20

The last point is another way to describe perfectionism (not appreciated in the corporate world): the 80/20 rule means you work on a task until you’ve reached 80% of perfection; the remaining 20% takes a lot of time to complete but provide little additional benefits, and prevent you from meeting deadlines as new projects keep popping up every few months. The fact is that real data is never perfect, so perfect models are irrelevant; but during your PhD years you might not have been exposed to dirty data. Even the “pure” data sets that I used in my number theory experiments are imperfect; you might realize this only when you work with trillions of numbers; that’s when all sorts of oddities start to show up: you realize you need far more numerical precision than Python can offer (that’s when you get interested in BigNum libraries), that tons of artificial correlations start to pop up and must be detected and ignored, that stuff that seems to be true when computed on 100,000 data points, is no longer true if you use 10,000 times more data points, or that random number generators used for simulation purposes have several big flaws that make them useless and you have to create your own. Even though my data was static and consisted of the (infinite) set of integers and real numbers, the results were sometimes biased due to missing data behaving differently (numbers not included in my analysis due to hardware limitations) and massive errors (due sometimes to lack of precision) which is not any different from data that classical data scientists have to deal with. Yet many classical data scientists, even with a PhD, ignore or are not aware of these problems, or don’t care if there is some agenda to prove the hypothesis that managers want you to confirm.

At the same time, in this context and if you care about unearthing the truth, you realize that it helps to use some automated tools – some available online – for pattern detection or to get billions of digits of some numbers, that working in the cloud with a distributed architecture and efficient algorithms, helps, and that finding reference material about the topic you are working on (to avoid re-inventing the wheel) requires more than a Google search (posting questions on Quora proved to be very useful.) All these skills are useful to land a job. They should be acquired during your PhD years, but can also be acquired independently.

Finally, you should take control of your destiny during your PhD years, rather than being controlled by external factors. For more, read my article 5 myths about PhD data scientists. Last but not least, nowadays you don’t need a laboratory with multi-million dollars worth of computers and data sets to do your research: it can be carried from home on your laptop, and tons of data sets are available for free, you can even create your own ones, for instance by collecting billions of tweets and analyzing them.

(*) One PhD statistician argued with me that it is impossible to put a statistical distribution on the digits of numbers such as Pi, as there are infinitely many of these digits and anyway they are deterministically generated; surprisingly, the layman understands this concept easily. In my case, I leveraged this apparent paradox to create new cryptographic systems and Fintech predictors, and without introducing arcane models from stochastic point process theory, to go around this apparent problem. Ironically, I am PhD statistician too, though, but I have learned to un-learn the negative aspects of that training, and to keep the best of it. Also, despite having no deadline dictated by a boss, I still feel very much, possibly even more so than in the corporate world, that I must deliver quickly, correct and great insights that everyone can understand, even if the price to pay is lack of elegance in my approach. I only have so much time to spend on these exciting problems (spending most of my time managing my company) and that gives me a sense of urgency, yet happiness when I make great progress and beat my competitors. PhD’s who want to work in the corporate world must understand this.

For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn, or visit my old web page here.

DSC Resources

Go to Source