7 DIY Data Science Project Ideas Using Your Personal Data

Kartik Godawat
6 min readJun 11, 2021

As we get into the field of data science, many times, we have difficulty finding data which is structured and interesting enough. Many a times, if we pick a project on a dataset which doesn’t speak to us personally, the project might be left in between, or completed half-heartedly. The best projects are the ones which keeps you engaged, inspired and motivated through the entire project life cycle.

I generally feel more excited and enthusiastic to share personal-data projects with friends and family, even if it is something as simple as a webapp detecting faces of folks I know, as opposed to getting X accuracy on Y dataset. This simple realization keeps the creative juices flowing and nudges me to provide a more meaningful, impactful and resonating output than a certain X accuracy benchmark on a primitive dataset. As an added benefit, doing project like these makes for far more interesting conversations in a job interview!

In this post, I’ll attempt to list some data exports of personal usage data. Some of the exports here are specific to India, but similar alternative applications exports should exist in other locations.

A note on data cleaning before we start:

Source

Many times, as beginners, we have a tendency to use as clean data as possible and avoid getting our hands dirty with some scripts or some shell commands. For example, most datasets we find on kaggle or benchmark datasets like ImageNet are cleaned(clean-enough) datasets. However, in my personal view, I learn better when I pick a nearly-clean-enough datasets and work on it to make it clean-enough for further use. So, don’t be afraid to get your hands dirty, and don’t be afraid to make mistakes. The insights and learning gained this way will far outweigh the gains received from always working with structured and well labelled, clean data.

1. Browser History

Type: Tabular, Time-series, NLP

My browser search pattern word cloud in 2018 Source

Almost all the browsers will provide an option to download historical data. It’s a bonus for using Google Chrome, as most of the data is backed up in cloud and is available to download at Google TakeOut. TakeOut is a service which will contain data export option for all the google services. Here’s my attempt at analyzing the browser history.

Example: browser history row

2. YouTube

Type: Tabular, Time-series, NLP, Video/Vision, Audio

Sample youtube history row

Youtube browsing history could be separately downloaded from Google TakeOut. It comes with many interesting columns such as Ratings, Duration, Likes, Views etc. It’d be interesting to see how viewing interests get shaped over time. Here is an example of youtube history being used for data analysis.

3. Linkedin

Type: Tabular, Time-series, NLP

From connections, articles to entire account history, LinkedIn data could be exported for an insightful analysis of one’s job-social networking patterns. Or choose to go down the memory lane of jobs you’ve applied to. The most common observation here would be the cluster of companies you’re connected to. Here’s an example starter script to get started.

4. Music: Spotify/Apple

Type: Tabular, Time-series, NLP, Audio

Anything and everything you’ve listened to, could be downloaded in a comfortable format and easily shifted to tabular with a bit of preprocessing. While the history data itself provides many dimensions to analyze and inspect, audio and video websites dataset creation could go one step beyond for a much deeper analysis. For example, lyrics of the songs you listen to could be downloaded and analyzed. To take it a step further, not just analyze, we could also attempt content generation in audio/video/image/text domain by training the model on a personalized dataset. Here’s an example notebook to get started.

5. Whatsapp

Type: NLP, Vision, Chatbot

It is possible to get Whatsapp chat exports both from iOS and Android devices. Here’s a starter analysis notebook. Emojis and plaintext could both be analyzed and introspected upon. One couldn’t get a better dataset for a real-world chatbot, one which possibly could mimic you and your vocabulary.

Furthermore, I found it to be a fairly good playground for sentiment analysis and generative experiments, as I personally write my messages in “Hinglish”, which is Hindi-English mixed langauge. Any attempt of good-enough NLP here would require getting hands dirty and finetuning the models, as most of the models are available pretrained on either Hindi or English but aren’t very effective on Hinglish speak.

Note that there are some models available in Hinglish as well (which might or might not work for your data on a case basis, but the idea is to actually attempt training your own and learn something core and fundamental, while experiencing the joy that comes out of it).

However, WhatsApp data is not just limited to text. It also contains so many images from so many different groups of so many different types. With a very little or sometimes no manual labelling, this provides an excellent dataset for CNN based classifiers, object detectors and other vision models. Here’s an example post doing some of it.

6. Food delivery services

Type: NLP, Vision, OCR

Some popular delivery services in India

Ever wondered how much you spend each month on pizzas? What’s your most favorite dish? How much delivery fee you’ve paid till date?

Not all food delivery apps would be generous enough to allow smooth export of such data. One approach of getting such data is to write a web scraper using Selenium. Another interesting and simpler approach is to use Google TakeOut again to export all your emails. The emails then could be filtered to get invoices from any vendor we choose. Follow through with a bit of OCR/NLP/Vision pipelines and the data should get converted into tabular format for analysis.

7. Stock Trading exports

Type: Tabular, Time-series, Predictive

An example of a stock broker in India

Many stock brokers allow exports of granular trade execution data. While one could be as ambitious as predicting the stock market itself, this data itself could hold many other insights and have other applications. This data (assuming one made profits on trade), is the perfect example of the trades one likes to perform . Combining this with live stocks data, potentially, a pattern detector could be made using these past transactions, for us to manually analyze, look and possibly capitalize on.

Other simpler analysis (deceptively), could enlighten onself on perhaps one’s preference of trading activity wrt time of day or day of week, or be aware of one’s shifting preferences towards a particular sector or largecap vs smallcap.

Some other interesting exports:

  • SMS history
  • Google Maps(location and uploaded photos as Local Guide) via TakeOut
  • Google Photos via TakeOut
  • Google Assistant(including audio recordings of voice commands given) via TakeOut
  • Twitter
  • Medium (bookmarks, claps and more)
  • TakeOut ( Yes! Google has a lot of your data. It’s worth exploring what else is available)

That’s all folks!

Thank you for reading. I hope you enjoyed it and found something new. If I’ve made any errors, please let me know in the comments. Please share other interesting exports you come across or build your next data-science project on!

For discussions, please reach me out on twitter. I’ve also made a collection of such projects. Please find me on Twitter/Jovian forum if you build something which could be added to the collection.

Thanks again. Have a great day. :)

--

--