Who is a data scientist?
A data scientist is simply someone who is highly adept at studying large amounts of often unorganized/undigested data. In the process of crunching data, a data scientist, often:
- 'models' the data in interesting ways,
- 'eliminates noise' and identifies canonical representative data points,
- 'studies the distribution of data based on various axes'
- 'extracts useful components/signals out of it'
- 'makes comparisons',
- 'derives relationships' and eventually,
- 'generalizes' the data model to be able to make useful statistical predictionsbased on what we already know.
Are there types of data scientists?
The nature of work is pretty broad in scope, and different problem spaces end up needing wildly different statistical data models to try out and play around with. A data scientist should be good at all or most of these. In some sense, there are many different experts - some on say quality, some on machine learning techniques, and so forth, but we can broadly refer to them as data scientists.
How do they help businesses?
A large number of companies are involved in data crunching and mining these days. Studying user activity in heavy-traffic sites and providing them recommendations/personalized content, for instance is one. Mining insights based on social feeds is one. Understanding user's buying pattern based on multiple parameters (prices, seasonality, etc...) and making useful recommendations is one. The number of problems is really quite huge.
What are the needed skill sets?
- Solid knowledge of several statistical modeling and learning techniques, and having the eye to identify which technique is relevant to the problem in hand (based on the nature of the data itself).
- Understanding and being able to tackle problems around data quality, clustering, dimensionality, etc..
- Being able to work with big data (in large clusters like hadoop, hive, etc... or on NoSQL databases) and mine relationships with them (hql, for instance)
- Programming (usually in basic languages). Knowing to work in data-crunching languages such as 'R' would be a big plus. Being able to quickly develop scripts that can digest data would also be a valuable skill.
I guess, a CS graduate or a mathematician/statistician who has worked on a lot of data modeling problems with large amounts of varied kinds of data (user-generated, machine-generated, time-variant, etc..) will qualify to be a potential data scientist.