Cliff Notes for Managing the Data Science Function

Author: William Vorhies

Summary:  There are an increasing number of larger companies that have truly embraced advanced analytics and deploy fairly large numbers of data scientists.  Many of these same companies are the one’s beginning to ask about using AI.  Here are some observations and tips on the problems and opportunities associated with managing a larger data science function.


We spend a lot of time looking inward at our profession of data science, studying new developments, looking for anomalies in our own practices, and spreading the word to other practitioners.  But when we look outward to communicate about data science to others it’s different.  Maybe you have this same experience but when I talk to new clients it’s often as not to educate them at a fairly basic level about what’s possible and what’s not.

The good news is that there’s now a third group:  execs and managers in larger companies who have embraced advanced analytics and who try to keep up by reading, but are not formally trained.  If these analytics managers are data scientists, all well and good.  But as you move up the chain of command just a little bit you’ll soon find yourself talking to someone who may be an enthusiastic supporter but whose well intentioned self-education still leaves them short on some basic knowledge.

Who are these folks?  Well Gartner says you’re a mid-size user if you have 6 to 12 data scientists and it takes more than 12 to be a larger user.  And that’s not counting the dedicated data engineers, IT, and analysts also assigned to the task.  So it’s certainly the large users and probably many of the mid-size users we’re addressing.

For a while I’ve been collecting what I call Cliff Notes for Managing Data Science to address this group.  Here’s the first installment.


Do you really want an AI strategy?

I’ll try to keep this short because this topic tends to set me off.  The popular press and many of the platform and application vendors have started just recently calling everything in advanced analytics “Artificial Intelligence”.  Not only is this not accurate it makes the conversation much more difficult.

First of all if you’ve already got a dozen data scientists then you are firmly in the camp of machine learning / predictive analytics.  Machine learning is much more mature and more broadly useful than just AI (which also uses a narrow group of machine learning techniques).  So good for you.  Keep up the good work.  Just because it helps humans make decisions, what you have been doing so far is not AI.

Modern AI is the outcome of deep neural nets and reinforcement learning.  AI involves recognition and response to text, voice, image, and video.  It also encompasses automated and autonomous vehicles, game play, and the examination of ultra-large data sets to identify very rare events.  This area is quite new and only the text, voice, image, and video capabilities are ready for commercial deployment.

Technically if you have deployed a chatbot anywhere in your organization you are utilizing AI.  Chatbot input and sometimes output is based on NLU (natural language understanding) one of the good applications of deep learning.  Chances are that if you do not have at least one chatbot today, you will have one within a year.  Chatbots are a great way to engage with customers and save money.  They are not whiz-bang solutions.  This is what most of AI is going to look like when deployed.

By all means begin the conversation about where modern AI may be of value in your strategy but don’t oversell this yet.  The real money is in what you’ve been doing right along with predictive and prescriptive analytics, IoT, and the other well developed machine learning technologies.


Should your data scientists be centralized or decentralized?

There are two schools of thought here and the deciding factor is probably how many data scientists you have.  One school of thought is that you should embed them closest to where the action is, in marketing, sales, finance, manufacture.  You name it, every process can benefit from advanced analytics.  They will learn the unique perspective, language, problems, and data of that process which will make them more effective.

On the other hand, the average data scientist has had that title only 2 ½ years.  That just shows you how fast we’re starting to graduate new ones and how rapidly they’re getting snapped up.  What it means is that you probably have a few relatively experienced data scientists who have been around the block and a larger number of juniors who are just getting started.

The juniors should have come with a very impressive set of technical skills and in theory can contribute to any data science problem.  The reality is that the juniors and the seniors as well need to keep learning by experience, not to mention having time to catch up with the new techniques that are being introduced all the time.  So the goal will be to have enough contact time between the seniors and the juniors so that everyone continues to develop. 

If you’ve got a half-dozen with various experience levels working together that’s probably OK.  However one interesting model is a hybrid that brings all your data scientists together on a fairly regular schedule so they can share experiences and learning. 

Another possible implementation would be to have a few seniors deployed out in each end user organization with the juniors on rotating assignment to assist.

Spread them too thin and you won’t benefit from their growth.  It may also cause them to leave for greener pastures.


Should every data science project have an ROI?

The further you go up the chain of command, the more senior management will say ‘of course, this is our most basic concept’.  And that’s not necessarily bad but it needs to come with some balance.

It will be many years and perhaps never when you fully exploit all the data and all the analytics that will create competitive advantage.  And many of those applications haven’t even been conceived today.

One type of financial discipline that is completely appropriate is pre-establishing time budgets or measures that tell you when the solution is good enough.  This is particularly true in all types of customer behavior modeling. 

If your data science team has a built in bias that is a potential weakness it is that they will always want to keep working to make those models better.  Even if the time would be better spent on other data science projects.  This scheduling discipline needs the understanding of exactly how the work is done and that most likely belongs to your Chief Data Scientist.

Incidentally, that Chief Data Scientist should also be regularly evaluating and recommending platforms and techniques to make the group more efficient, particularly in the fast emerging area of automated machine learning.

HOWEVER, there needs to be an opportunity for discovery.  This is a little like an engineering lab where your data scientists need a little formally allocated time to ‘go in there and find something interesting’.  Give them a little unstructured time to explore. 

The most interesting phrase in science is not ‘Eureka’, but ‘that’s funny’ (Isaac Asimov).


Should you keep all that data or only what you need?

This question is very closely related to the one above about ROI.  Our ordinary instinct is to keep only what we need.  However there is a strong school of thought among data scientists that data is now so inexpensive to store that we should keep it all and figure out how to benefit from it later.

The opposite school starts with ‘what is the problem we are trying to solve’ and works backwards to the data necessary to achieve that.  This is also the school that says when we have achieved X% accuracy to this question that is sufficient and any data not necessary to support that should be discarded.

Well the problem is that all that data that doesn’t appear to be predictive most assuredly contains pockets of outliers and pockets of really interesting new opportunities.  You may not have the manpower to dig into it today, but that’s also why we argued for giving your data science team some self-directed time to ‘find something interesting’.

Storing new data in a cloud data lake where data scientists can explore it is ridiculously cheap (but not free).  Where you need discipline is when you operationalize your new insight and it becomes mission critical.  Then you need the full weight of good data management, provenance control, bias elimination, and the proverbial single definition of the truth.


What about Citizen Data Scientists and the democratization of analytics?


All data science projects are team efforts and those teams consists of data scientists, LOB SMEs, and probably some analysts and folks from IT.  As these non-DS team members get more experience of course they become more valuable.  Particularly they are increasingly able to restate business problems as data science problems that they can bring to the table.

What really scares me and should scare you too are statements in the press to the effect that AI and machine learning have become so user friendly that they are “only a little more complex than word processors or spreadsheets”, or “users no longer need to code”.

That may be very narrowly true but that does not mean you should ever begin a data science project without including an adequate number of formally trained data scientists.

It is true that our advanced analytic platforms are becoming easier to use.  The benefit is that fewer data scientists can do the work that used to require many more.  It does not mean that citizen data scientists, no matter how well intentioned, should be given control over these projects. 

The slick visual user interfaces in many analytic applications hide many critical considerations that a DS will know and a CDS will not.  The issues are much too long to list here but for example include false positive/false negative threshold cost tradeoffs, best algorithm selection, the creation of new features, hyperparameter adjustment of those algorithms, bias detection, and the list goes on.  This is no self-driving car.

What we recommend is indeed getting those analysts and LOB managers deeply involved in the team process.  That will allow them to spot new opportunities.  If you want to empower your organizations start with actively educating for data literacy.  Leave actually driving the data science to the data scientists.


Cybersecurity may be what forces your hand into true AI

Whether you are dealing with cybersecurity in-house or contracting out, this is the first place you should be sure that there is real AI at work.  It turns out that the deep learning techniques at the core of modern AI are particularly good at spotting anomalies and threats. If you want to front load your AI strategy this is the best place to start.  As the Marines say, don’t bring a knife to a gun fight.



About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001.  He can be reached at:

Go to Source