Data Science

Interesting article

Came across this interesting article on the internet

How to build an in-house data science team (without a data scientist)

Technology tends to favour teams that have a firm grasp on data and analytics, and you can create one from scratch

While at CES representing my company 12 Labs, I noticed two very clear patterns. The first: The virtual/augmented reality economy is going to be as big as the app economy, if not larger. Even single-man VR startups had 10-minute waits to try out their demos. The second: There were surprisingly few artificial intelligence or data science companies giving demonstrations. This observation didn’t quite reconcile with what the brightest minds in tech — think Elon Musk, Bill Gates — have predicted about AI finally being here. So where are the companies who are going to bring AI onto the field? (Microsoft and Facebook can’t do it alone.)

In my view, these companies aren’t around yet because although the technology computational power is at a stage where AI can be harnessed for large-scale impact, data scientists are a rare find and are in high demand. They’re rare because data science requires a blend of various skills: statistical modelling, programming, and business savviness. It’s hard enough to find people who are experts in any one of the above individually, let alone find someone who has mastered all three.

How to take advantage of data science (without hiring a data scientist)

There are two main approaches for how you can spice up your current projects with data science. One is to license data science technology from another company. The problem with this approach is that AI technology has not sufficiently matured to be generic enough for broad use, though there are a few startups doing great work in this field.

 

The second approach is what we adopted at 12 Labs. It was to build an in-house data science team. When we started, we didn’t have any data scientists. But within a year, we built several data science products that even established companies in the fitness space have not been able to build.

Here are the steps we took to build our from-scratch data science team:

1. Find an engineer who is a hustler with a good product sense. Aspiring product managers in your team might be a good fit, as frequently such people work on developing a good product sense while they are still working as an engineer.

2.Hire someone with a statistics background. I don’t mean a data scientist. You need a pure statistician because your engineer will need pointers from someone who is strongly grounded in statistics and machine learning. As I understand it, even a strong data scientist can’t tell you which approach or algorithm is the right one without trying several approaches first. So, if you have someone willing to try out several approaches (in this case, the engineer above), the statistician can point them in the right direction.

3. Use Python. There are many online courses about machine learning and data science in several other languages such as R, Octave, etc. However, Python has the most vibrant data science community, with a large number of open-source libraries. If you use Python, you’ll future-proof yourself against any closed platform traps.

4. Don’t wait for perfection. The quickest way to make your product perfect is to ship it. So, when we first launched our meal recommendation technology, it was far from being production grade. But — with the data we had — it was the best could do. We could have waited until we had collected more user data, but decided to ship sooner as users are often more tolerant with something that is novel and cool. Gradually, our meal recommendation technology has become state of the art.

Also Read: GovTech, NUS partner to trains 2,000 public sevrants data science annually

Data science is here to stay and it is going to disrupt every field. It’s just a matter of time before your competitors adopt it, and then you’ll be left behind because technology usually favours the first to move on it. So, it’s imperative for startups to make a move sooner rather than later. Luckily, you can start even without a data scientist.

Ashu Dubey is a product guy, hacker and data scientist all rolled into one. He is the co-founder of 12 Labs’ Applause, a data science powered weight loss application. At 12 Labs, he crunches data to help Applause users lose weight smartly and scientifically. During his undergrad career, Ashu launched a successful non-profit, Fast Forward India, to help underprivileged children gain computer literacy. He loves hiking and running. He has an MBA from UCLA and B.Tech degree from Indian School of Mines.

The Young Entrepreneur Council (YEC) is an invite-only organisation comprising the world’s most promising young entrepreneurs. In partnership with Citi, YEC recently launched BusinessCollective, a free virtual mentorship programme that helps millions of entrepreneurs start and grow businesses.

Article source:

https://e27.co/build-house-data-science-team-without-data-scientist-20170522/

Data Science

Some interesting stuff on Quora

I read this question on Quora.  The answers were as interesting as the question:

I am currently learning data science in Python on DataCamp and working with data sets on Kaggle, but I am afraid I won’t get a job. What more can I do?

I am also proficient in Excel and SQL, but I feel that I’m still inadequate when I look at job descriptions for data analysts or data scientists.

Answer1

Great to hear that you’re learning Python on DataCamp! You seem to be really doing a good job at learning data science by working also on your own with data sets, and you say that you’re proficient in Excel and SQL, so that all sounds great!

The information that you provided to this question leaves me with the question what other skills or talents you see in job descriptions for data analysts especially. What skills are you still lacking? Please feel free to comment below on what other skills they are asking of you because I’d like to know! From what I have seen in practice and from researching the job openings, your skills match perfectly with those of a data analyst.

I’m not so much questioning that you don’t feel adequate to go for a data scientist position, as they usually require years and years of experience in two or three programming languages, a master or PhD in a technical field, etc, etc. This list seems never-ending, especially when you’re working towards becoming a data scientist.

I can relate to this heavily. But I also want to tell you something that I first heard when I told somebody in the industry that I wanted to have a job as a data scientist. He told me that I shouldn’t move too fast and that data scientists need to have an idea of what happens with the data before it gets to them. This means that they should have a perfect understanding of how the data enters, of ETL, of ODS’s, of database structures, how reporting on the data works, … You’ll probably know the whole flow which you most see in university courses*. And when I tried to counter and tell him that I had seen all of this in university, he told me that he meant real business cases and problems, which can be very different from what they teach you at university.

I was very disappointed when I heard this, but at the same time, this also clarified a lot to me: how can you provide insights to any company if you aren’t aware of where the data comes from, what happens to it and how it gets to you?

I’m not sure if my point will come across, but from my experience as a junior big data engineer, I can confirm that I did learn a lot of new stuff and I realize that I wouldn’t have fit in if I had opted to do a job as a data scientist. Additionally, I think it’s very important to remember that becoming a data scientist is, for most, a process that they need to work towards. Most start out as data engineers, data analysts, …., build up experience, and then move on to becoming a data scientist. See it as a more senior position, if you will.

Like I said, you’re off to a great start, you definitely qualify for a data analyst job and work slowly towards becoming a data scientist. This will ask of you that you continue to work on your data science skills and that you engage within the data science community (networking is very important!).

Don’t get discouraged. Just do it! 🙂

PS. If you want to see what my definition of “data analyst” or “data scientist” is, check out The Data Science Industry: Who Does What (Infographic)

Answer2

If you are looking for a job, you need to make sure that potential employers looking for you can find you. The best way to do that is to create a stellar LinkedIn and Github Account. LinkedIn to connect with Data Geeks, share your work with the general community, and Github specifically to share your code. I wrote these two articles on how to do this. Check these out.

How to use LinkedIn to land your Dream Job?

How to use Github to land a Job?

Last but not the least is. Start experimenting, learn how to scrape data from internet, beautiful soup in python is what you have to drink. Data Scraping skills gives you access to almost any data on internet, so think of an interesting idea and execute it. This will take you to the next level.

Back in my days, I used to scrape property prices data from big portals, clean it(it’s a headache but you will learn it 🙂 ), and make dashboards to see how average property prices are changing in real time. It was so awesome.

I still haven’t completed y final vision of the project and now I have three more guys working on this with me. But my experience of doing this work ladedme job of my dreams, so mate you got to do it.

Answer3

You’re going to need to be good to get a job.

Really good, not just passingly familiar.

I’ve had to interview plenth of candidates. There are tons of smart guys and decent programmers, and most of them just don’t have the depth and breadth to be useful as a data scientist.

Most people who aspire to be data scientists would be worse than useless in a work setting. A bad scientist literally does more harm than good. And I literally mean literally.

If you’re not highly skilled, you’re only going to get a very junior position. If you’re lucky, you might score a DS position at a company where nobody knows what they’re doing.

You need to be a quick and proficient programmer, and you need to be very confident on the maths. You also need experience on a wide variety of datasets, and also experience working in a production setting.

If you don’t have all this, just keep working, keep training models, keep reading and learning, and keep working whatever programming and hopefully also data science related jobs you can until you no longer do more harm than good 😉

The entire link to this quora question is here.  Other interesting links have been repeated here below:

https://www.quora.com/I-am-currently-learning-data-science-in-Python-on-DataCamp-and-working-with-data-sets-on-Kaggle-but-I-am-afraid-I-won%E2%80%99t-get-a-job-What-more-can-I-do 

How to use LinkedIn to land your Dream Job?

How to use Github to land a Job?

The Data Science Industry: Who Does What (Infographic)

Data Science

More to see in a scatter chart

opacity

In the above chart you can notice three things – countries of different continents have different colours, there is a certain opacity in the overlaps, and India/China have their names significantly displayed against the blobs for those countries.

# Scatter plot
plt.scatter(x = gdp_cap, y = life_exp, s = np.array(pop) * 2, c = col, alpha = 0.8)

# Previous customizations
plt.xscale(‘log’)
plt.xlabel(‘GDP per Capita [in USD]’)
plt.ylabel(‘Life Expectancy [in years]’)
plt.title(‘World Development in 2007’)
plt.xticks([1000,10000,100000], [‘1k’,’10k’,’100k’])

# Additional customizations
plt.text(1550, 71, ‘India’)
plt.text(5700, 80, ‘China’)

# Add grid() call
plt.grid(True)

# Show the plot
plt.show()

=============================================================

The alpha parameter within the plt.scatter function depicts the opacity whereas the plt.text and plt.grid lines depict the gridlines in the tags to the countries.

Uncategorized

Histogram and highlight lines

Here’s an interesting representation of a histogram with a highlight line passing right through it.

axvline

Notice how the plt.axvline has been coded:

plt.axvline(x=dss_exp.mean(), linewidth=2, color = ‘r’)  It refers to the mean being depicted as a red line amidst the blue histogram draft.

Double highlight lines can be added like this:

double_xvline

It is depicted at the 5% mark and the 95% mark.

Data Science

How about a 3D scatter chart

What if you have three variables : Life Expectancy, GDP and Population and want to see how they behave with each other.

3d

So the blue blobs are how large the population is in some countries witch a certain overlap between life-expectancy and GDP.  The bigger the blob – the bigger the life expectancy.  How we achieve this is by introducing a numpy array of the third variable into the scatter chart.

Here the population (pop) is the 3rd variable:

# Import numpy as np
import numpy as np

# Store pop as a numpy array: np_pop
np_pop = np.array(pop)

# Update: set s argument to np_pop
plt.scatter(gdp_cap, life_exp, s = np_pop)

# Previous customizations
plt.xscale(‘log’)
plt.xlabel(‘GDP per Capita [in USD]’)
plt.ylabel(‘Life Expectancy [in years]’)
plt.title(‘World Development in 2007’)
plt.xticks([1000, 10000, 100000],[‘1k’, ’10k’, ‘100k’])

# Display the plot
plt.show()

 

Data Science

Usage of logarithm in Scatter graphs

The scatter charts above indicate the Life Expectancy figures Vs Per Capita GDP in some countries.  If you notice the graph to the left looks cleaner than the one to the right.  The reason for this is that in the x-axis – the data is taken on a logarithmic scale. Also if you look at the ticks on the x-axis – the one on the left looks neater than the one on the right.

The way this is achieved is by using the plt.xscale(‘log’) syntax.  Clearly, as you can see, there is really no relationship between GDP and Life expectancy. 

The entire piece of code is given here:

# Basic scatter plot, log scale

plt.scatter(gdp_cap, life_exp)

plt.xscale(‘log’)

# Strings

xlab = ‘GDP per Capita [in USD]’

ylab = ‘Life Expectancy [in years]’

title = ‘World Development in 2007’

# Add axis labels

plt.xlabel(xlab)

plt.ylabel(ylab)

# Add title

plt.title(title)

plt.show()

 

Data Science

Datacamp, Microsoft Data Science Cert and UpXAcademy

I don’t carry my personal laptop around much.  So I need a space to practice my code on the fly – somewhere online.  I’ve been pursuing a cloud based python IDE for sometime.  I stumbled on the datacamp site – which not only helps you around on coding – but also provides a code and execution environment to try out some stuff.

Its really cool.  I need to check if they have the capability to save and integrate code on github as well.

Some learning on usage of line charts with focus on the portion highlighted in red

import matplotlib.pyplot as plt
year = [1950, 1951, 1952, …, 2100]
pop = [2.538, 2.57, 2.62, …, 10.85]
# Add more data
year = [1800, 1850, 1900] + year
pop = [1.0, 1.262, 1.650] + pop

plt.plot(year, pop)
plt.xlabel(‘Year’)
plt.ylabel(‘Population’)
plt.title(‘World Population Projections’)
#Rename y ticks
plt.yticks([0, 2, 4, 6, 8, 10],
[‘0’, ‘2B’, ‘4B’, ‘6B’, ‘8B’, ’10B’])

plt.show()

 

The portion highlighted in red gives you two simple things:

a. Extending the data range of what you may already have in an input dataset

b. Renaming the yticks to something else – while still retaining the range on which it is broken.

 

Also came across the an industry recognized certification in Data Science offered by Microsoft.  Here is what they seem to coverMicrosoft Data Science

The course content seems to be limited whereas the UpX Academy 6 months program certification is exhaustive:

Module 1: Data Science Introduction & Use Cases 0/1

  • Leason1.1
    Fundamentals,Usecases
  • Module 2: Python Basics 0/2

    • Leason2.1
      Basic Syntax
    • Leason2.2
      Data Structures
  • Module 3: Python Basics 0/4

    • Leason3.1
      Loops
    • Leason3.2
      If-elif statements
    • Leason3.3
      Functions
    • Leason3.4
      Exception Handling
  • Module 4: Statistics 1 0/3

    • Leason4.1
      Measures of central tendency
    • Leason4.2
      Population
    • Leason4.3
      Sample, Probability Distribution
  • Module 5: Statistics 1 0/3

    • Leason5.1
      Normal and Binomial Distribution
    • Leason5.2
      Random Variable
    • Leason5.3
      Pictorial Representations
  • Module 6: Python Advanced 0/2

    • Leason6.1
      Numpy
    • Leason6.2
      Pandas
  • Module 7: Python Advanced 0/2

    • Leason7.1
      Data Manipulation
    • Leason7.2
      Matplotlib
  • Module 8: Exploratory Data Analysis 0/2

    • Leason8.1
      Data Cleaning
    • Leason8.2
      Data Wrangling
  • Module 9: Exploratory Data Analysis 0/1

    • Leason9.1
      Data Visualisation
  • Module 10: Exploratory Data Analysis 0/1

    • Leason10.1
      Case Study
  • Module 11: Introduction to Tableau 0/1

     
  • Module 12: Data visualisation 0/1

  • Module 13: Analytics concepts with Statistics – I 0/1

  • Module 14: Analytics concepts with Statistics – II 0/1

     
  • Module 15: Analytics concepts using calculated fields 0/1

     
  • Module 16: Analytics concepts for integrating dashboards 0/1

     
  • Module 17: Mini project workshop – Visual Analytics 0/1

  • Module 18: Integration of Tableau with Python 0/1

    • Leason18.1
      lessons will be updated soon
  • Module 19: ML Introduction & Use Cases 0/3

    • Leason19.1
      ML Intro
    • Leason19.2
      Fundamentals
    • Leason19.3
      Use Cases
  • Module 20: Statistics 2 – Inferential Statistics 0/1

    • Leason20.1
      lessons will be updated soon
  • Module 21: Linear Regression 0/1

    • Leason21.1
      lessons will be updated soon
  • Module 22: Logistic Regression 0/1

    • Leason22.1
      lessons will be updated soon
  • Module 23: Decision Trees, Random Forest 0/1

    • Leason23.1
      lessons will be updated soon
  • Module 24: Modelling Techniques(PCA, Feature Engineering) 0/1

    • Leason24.1
      lessons will be updated soon
  • Module 25: KNN, Naive Bayes 0/1

    • Leason25.1
      lessons will be updated soon
  • Module 26: Support Vector Machines(SVM) 0/1

    • Leason26.1
      lessons will be updated soon
  • Module 27: Clustering, K-means 0/1

    • Leason27.1
      lessons will be updated soon
  • Module 28: Time Series Modelling 0/1

    • Leason28.1
      lessons will be updated soon
  • Module 29: Market Basket Analysis & Apriori Algorithm 0/1

    • Leason29.1
      lessons will be updated soon
  • Module 30: Recommendation System 0/1

    • Leason30.1
      lessons will be updated soon
  • Module 31: Recommendation System – Mini Project 0/1

    • Leason31.1
      lessons will be updated soon
  • Module 32: Dimensionality Reduction (LDA,SVD) 0/1

    • Leason32.1
      lessons will be updated soon
  • Module 33: Dimensionality Reduction (Matrix optimisation) 0/1

    • Leason33.1
      lessons will be updated soon
  • Module 34: Anomaly Detection 0/1

    • Leason34.1
      lessons will be updated soon
  • Module 35: XG Boost 0/1

    • Leason35.1
      lessons will be updated soon
  • Module 36: Gradient Boosting Machine(GBM) 0/1

    • Leason36.1
      lessons will be updated soon
  • Module 37: Stochastic Gradient Descent(SGD) 0/1

    • Leason37.1
      lessons will be updated soon
  • Module 38: Ensemble Learning – I 0/1

    • Leason38.1
      lessons will be updated soon
  • Module 39: Ensemble Learning – II 0/1

    • Leason39.1
      lessons will be updated soon
  • Module 40: Introduction to Neural Networks 0/1

    • Leason40.1
      lessons will be updated soon
  • Module 41: Introduction to NLP & Deep Learning 0/1

    • Leason41.1
      lessons will be updated soon
  • Module 42: Word Embeddings 0/1

    • Leason42.1
      lessons will be updated soon
  • Module 43: Word window classification 0/1

    • Leason43.1
      lessons will be updated soon
  • Module 44: Introduction to Artifcial Neural Networks 0/1

    • Leason44.1
      lessons will be updated soon
  • Module 45: Introduction to Tensorflow 0/1

    • Leason45.1
      lessons will be updated soon
  • Module 46: Recurrent Neural Networks for Language modelling 0/1

    • Leason46.1
      lessons will be updated soon
  • Module 47: Gated Recurrent Units(GRUs), LSTMs 0/1

    • Leason47.1
      lessons will be updated soon
  • Module 48: Recursive Neural network 0/1

    • Leason48.1
      lessons will be updated soon
  • Module 49: Convolutional Neural Networks for sentence classification 0/1

    • Leason49.1
      lessons will be updated soon
  • Module 50: Dynamic Memory Networks 0/1

    • Leason50.1
      lessons will be updated soon