Daily Newsletter — 7th January 2021

Hemanth Janesh
Jovian
Published in
3 min readJan 7, 2021

--

180 DS/ML Projects, a Real-World Dataset for Deepfake Detection and NLP datasets for ML models in today’s Data Science Daily 📰

180 Data Science and Machine Learning Projects with Python

Here’s a curated list of 180 Machine Learning Projects solved and explained in Python by TheCleverProgrammer. They have included both beginner and advanced projects,

👉 https://medium.com/coders-camp/180-data-science-and-machine-learning-projects-with-python-6191bc7b9db9

WildDeepfake: A Challenging Real-World Dataset for Deepfake Detection

In recent years, the abuse of a face swap technique called Deepfake which has raised enormous public concerns. Deepfake manipulates deep learning techniques to replace one person’s face in a video to someone else’s without leaving obvious traces.

So far, a large number of deepfake videos (also known as “deepfakes”) have been crafted and uploaded to the internet, which calls for the development of effective countermeasures. One promising countermeasure against deepfakes is deepfake detection. Here’s a dataset for the same, more details in the link below.

GitHub: https://github.com/deepfakeinthewild/deepfake-in-the-wild

611 Text Datasets in 467 Languages by HuggingFace

HuggingFace, a Natural Language Processing startup, has just released the v1.2 of its datasets library with:

  • 611 text datasets that can be downloaded to be ready to use in one line of python,
  • 467 languages covered, 99 with at least 10 datasets,
  • efficient pre-processing to free the user from memory constraints when using very large datasets (memory-mapping by default).

Repository: https://github.com/huggingface/datasets

From the README.md of the repo:

🤗Datasets is a lightweight Python library providing two main features:

  • one-line data loaders for many public datasets: one-liners to download and pre-process any of the 611 public datasets (in 467 languages and dialects!) explorable and searchable here. With a command like squad_dataset = load_datasets(“squad”), any of these datasets is ready to use in a data loader for training/evaluating an ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
  • efficient data pre-processing: simple, fast and reproducible data pre-processing for the above public datasets as well as local datasets in CSV/JSON/text files. With simple commands like tokenized_dataset = dataset.map(tokenize_function), a dataset is efficiently prepared for inspection, evaluation or training of a predictive model.

Some additional links from the README: 🎓 Documentation 🕹 Colab tutorial 🔎 Find a dataset in the Hub 🌟 Add a new dataset to the Hub

Contact Us 📞

Reach out to us on community@jovian.ai to get featured here. Learn data science and machine learning with free hands-on data science courses on Jovian.

Follow us on Twitter, LinkedIn, and YouTube to stay updated.

--

--

Writer for

Developer Evangelist at Jovian | Smart India Hackathon (’19 & ’20) Winner