3 Python for ML & Data Science

3.1 Introduction

Programming is an essential skill for data scientists. If you are considering starting a data science career, the sooner you learn how to code, the better it will be. Most data sciences jobs rely on programming to automate cleaning and organizing data sets, design databases, fine-tune machine learning algorithms, etc. Therefore, having some experience in programming Languages such as Python, R, and SQL makes your life easier and will allow you to automate your analysis pipelines.

In this section, we will focus on Python. A general-purpose programming language that allows us to work with data and explore different algorithms and techniques that would be extremely useful to add to our analysis toolbox.

3.1.1 Why should I learn how to program?

A data scientist is a technical expert who uses mathematical and statistical techniques to manipulate, analyze, and extract patterns from raw or noisy data to produce valuable information that can help organizations make better decisions. They use a range of tools, including statistical inference, pattern recognition, machine learning, deep learning, and more, and some of their responsibilities include:

Work closely with business stakeholders to understand their goals and determine how data can be used to achieve them.
Fetching information from various sources and analyzing it to get a clear understanding of how an organization performs
Undertaking data collection, preprocessing, and analysis
Building models to address business problems
Presenting information in a way that your audience can understand using different data visualization techniques

Programming skills provide data scientists with the superpowers to automate these tasks. Although programming is not required to be a data scientist, taking advantage of the power of computers, it can facilitate the process of manipulating, processing, and analyzing big datasets, automate and develop computational algorithms to produce results (faster and more effectively), and create neat visualizations to present the data more intuitively.

3.1.2 Programming languages for data science

There are hundreds of programming languages out there, built for diverse purposes. Some are better suited for web or mobile development, others for data analysis, etc. Choosing the correct language to use will depend on your level of experience, role, and/or project goals. In the last few years, Python has been ranked as one of the top programming languages data scientists use to manipulate, process, and analyze big datasets.

But why is Python so popular? Well, I will list some reasons why data scientists love Python and what makes this language suitable for high productivity and performance in processing large amounts of data.

3.2 Why Python?

In the 2022 Stack Overflow Developer Survey, Python emerged as one of the most commonly used programming languages worldwide. Out of 71,467 responses, 68% of developers stated their love for the language and their intention to continue working with it. Additionally, around 12,000 respondents expressed their interest in learning and using Python. Python’s immense popularity stems from its simple syntax, versatility, and expressiveness. If you are considering a data science project, Python offers a range of features that you may find useful. Here is a list of features that can give you an insight into why Python may be a good choice for your next project.

Python is open source, so is freely available to everyone.You can even use it to develop commercial applications.
Python is Multi-Platform. It can be run on any platform, including Windows, Mac, Linux, and Raspberry Pi.
Python is a Multi-paradigm language, which means it can be used for both object-oriented and functional programming. It comes from you writing code in a way that is easy to read and understand.
Python is Multi-purpose, so you can use it to develop almost any kind of application. You can use it to develop web applications, game development, data analysis, machine learning, and much more.
Python syntax is easy to read and easy to write. So the learning curve is low in comparison to other languages.
Data Science packages ecosystem: Python also has PyPI package index,a python package repository, where you can find many useful packages (Tensorflow, pandas, NumPy, etc.), which facilitates and speeds up your project’s development. In PyPI, you can also publish your packages and share them with the community. The ecosystem keeps growing fast, and big companies like Google, Facebook, and IBM contribute by adding new packages.Some of the most used libraries for data science and machine learning are:
- Tensorflow, a high-performance numerical programming library for deep learning.
- Pandas, a Python library for data analysis and manipulation.
- NumPy, a Python library for scientific computing ( that offers an extensive collection of advanced mathematical functions, including linear algebra, Fourier transforms, random number generation, etc.)
- Matplotlib, a Python library for plotting graphs and charts.
- Scikit-learn, a Python library for machine learning.
- Seaborn, a Python library for statistical data visualization.
Note

The Python Package Index, abbreviated as PyPI and also known as the Cheese Shop (a reference to the Monty Python’s Flying Circus sketch “Cheese Shop”), is the official third-party software repository for Python. It is analogous to the CPAN repository for Perl and to the CRAN repository for R.[1]
High performance: Although some people complain about performance in Python (see Why Python is so slow and how to speed it up), mainly caused by some features such as dynamic typing, it is also simple to extend developing modules in other compiled languages like C++ or C which could speed up your code by 100x.

After having a brief overview of Python, let’s move on to the next sections, where we will learn how to install Python and how to use it to perform some basic operations.

3.3 Python Installation

To check if Python is already installed on our machines, open a terminal in your computer and type the command Python --version or Python3 --version. You will see the Python version if it is installed. Otherwise, you will get an error command not found or similar. If you dont have Python installed on your computer, the most straightforward way to do so is to download it from the official website. Although this is a simple process, some tools such as pyenv and anaconda enable you to run multiple versions of Python on the same machine so you can switch between versions of Python according to your project’s requirements. In the code examples presented in this material, we will use Pyenv to manage our Python installations.

!python --version

Python 3.10.9

3.4 Pyenv

Pyenv is a command line tool that enables you to have and operate multiple installations of Python on the same machine. If you come from a background in javascript, you may find that pyenv is similar to nvm (Node Version Manager). We suggest referring to the official documentation for instructions on how to install pyenv. Alternatively, if you’re using Windows, you can use pyenv-win. However, we’ll provide a brief summary of the installation process here.

# Install pyenv
curl https://pyenv.run | bash

After having installed pyenv, you can then install any python version running the command pyenv install <version>. For example, to install Python 3.9.7, you would run pyenv install 3.9.7. You can then set the global version of Python to be used by running pyenv global 3.9.7. You can also set the local version of Python to be used in a specific directory by running pyenv local 3.9.7. The global version of Python is the version that will be used by default in your machine, while the local version is the version that will be used in the directory where you run the command.Pyenv will automatically set the local version of Python when you enter the directory where you have set the local version.

3.4.1 Util commands

pyenv versions: List all the versions of Python installed on your machine.
pyenv global: Show the global version of Python.
pyenv local: Show the local version of Python.
pyenv uninstall <version>: Uninstall a specific version of Python.
pyenv rehash: Rehash pyenv shims (run this command after installing a new version of Python).
pyenv version: Show the current version of Python.
pyenv which python: Show the path of the current Python executable.
pyenv which pip: Show the path of the current pip executable.
pyenv help: Show the list of available commands.
pyenv shell <version>: Set the shell version of Python.

3.5 Python Dependency hell

Well, it sounds like Python is amazing! However, if you have been using Python for a while, you may have already noticed that handling different python-installations and dependencies(packages) can be a nightmare! An issue commonly known as dependency hell, which is a term associated with the frustration arising from problems managing our project’s dependencies.

Dependency hell in Python often happens because pip does not have a dependency resolver and because all dependencies are shared across projects. So, other projects could be affected when a given dependency may need to be updated or uninstalled.

On top of it, since Python doesn’t distinguish between different versions of the same library in the /site-packages directory, this leads to many conflicts when you have two projects requiring different versions of the same library or the global installation doesn’t match.

Thus, having tools that enable us to isolate and manage our project’s dependencies is highly convenient.

3.5.1 Virtual environments to the rescue!

Python virtual environment is a separate folder where only your project’s dependencies(packages) are located. Each virtual environment has its own Python binary (which matches the version of the binary that was used to create this environment) and its own independent set of installed Python packages in its site directories. That is a very convenient way to prevent Dependency Hell.

Note

Python virtual environment allows multiple versions of Python to coexist in the same machine, so you can test your application using different Python versions. It also keeps your project’s dependencies isolated, so they don’t interfere with the dependencies of others projects.

There are different tools out there that can be used to create Python virtual environments. In this post, I will show you how to use pyenv and poetry. However, you can also try other tools, such as virtualenv or anaconda, and based on your experience, you can choose that one you feel most comfortable with. the video below will provide you with more information about these kinds of tools.

3.6 Poetry

Poetry is a tool that allows you to manage your project’s dependencies and facilitates the process of packaging for distribution. It resolves your project dependencies and makes sure that there are no conflicts between them. Poetry integrates with the PyPI package index to find and install your environment dependencies, and pyenv to set your project python runtime.

To install poetry we follow the steps below:

# Install poetry
curl -sSL https://install.python-poetry.org | python3 -

3.6.1 Util commands

poetry new <project-name>: Create a new project.
poetry new <project-name> --src: Create a new project with a src directory.
poetry install: Install the project dependencies from the pyproject.toml file.
poetry add <package-name>: Add a new package to the project.
poetry remove <package-name>: Remove a package from the project.
poetry update: Update the project dependencies.
poetry run <command>: Run a command in the project’s virtual environment.
poetry shell: Activate the project’s virtual environment.
poetry build: Build the project.
poetry publish: Publish the project to PyPI.
poetry version: Show the current version of poetry.
poetry help: Show the list of available commands.

If you were able to run the previous commands, we can then move forward with the rest of the tutorial. Lets now then create a new project using poetry and pyenv.For this example we will create a project called my_project and we will use Python 3.9.7 as the project’s python version. We will also add the numpy package to the project’s dependencies.

Step by step: Creating a new project using poetry and pyenv

pyenv install 3.9.7 # install python 3.9.7 in your machine

mkdir my_project # create a new directory called my_project
cd my_project # enter the my_project directory

pyenv local 3.9.7 # set the local version of python to be used in this directory
poetry config virtualenvs.in-project true # set the virtual environment to be created in the project's root
poetry init -n # create a new project with default settings
poetry add numpy # add numpy to the project's dependencies

touch main.py # create a new file called main.py

After running the previous commands, you will have a new project with the following structure:

my_project
.venv
pyproject.toml
poetry.lock
main.py

Important

Note that if you want poetry to create the virtual environment(.venv) directory in the project’s root, you must change the virtualenvs.in-project setting to true by running the command poetry config virtualenvs.in-project true. This command only needs to be run once, and it will be define globally for all projects.

The primary file for your poetry project is the pyproject.toml file. This file contains the necessary information about your project’s dependencies (Python packages) and also holds the required metadata for packaging, if needed. Every time a new Python package is installed, Poetry automatically updates this file. By sharing this file with others, they can recreate the project environment and run your application. To do so, they will need to have Poetry installed on their system and run the command poetry install within the same folder where the pyproject.toml file is located.

Now our pyproject.toml file looks like:

[tool.poetry]
name = "myproject"
version = "0.1.0"
description = ""
authors = [`Henry Ruiz  <henry.ruiz.tamu@gmail.com>`]

[tool.poetry.dependencies]
python = "^3.9.7"
numpy = "^1.23.1"

[tool.poetry.dev-dependencies]
pytest = "^5.2"

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"

Lest review that file sections:

[tool.poetry]: This section contains informational metadata about our package, such as the package name, description, author details, etc. Most of the config values here are optional unless you’re planning on publishing this project as an official PyPi package.
[tool.poetry.dependencies]: This section defines the dependencies of your project. Here is where you define the python packages that your project requires to run. We can update this file manually if it is needed.
[tool.poetry.dev-dependencies]: This section defines the dev dependencies of your project. These dependencies are not required for your project to run, but they are useful for development.
[build-system]: This is rarely a section you’ll need to touch unless you upgrade your version of Poetry.

To see in a nicer format the dependencies of your project, you can use the command poetry show --tree. This command draws a graph of all of our dependencies as well as the dependencies of our dependencies. If we are not sure at some point that we have the latest version of a dependency, we can tell poetry to check on our package repository if there is a new version by using --latest option (poetry show --latest).

3.7 Python Syntax

lets open the main.py file and write our first Python code. We will start by printing the message “Hello, World!” to the console. To do so, we will use the print function as follows:

print("Hello, World!")

Hello, World!

3.7.1 Creating variables

Python is a dynamically typed language, which means that you don’t need to declare the type of a variable when you create one. The type of the variable will be determined by the value assigned to it during runtime. Python has a built-in function called type that allows you to check the type of a variable. For example, to check the type of a variable x, you would write type(x).

a = 5 # define a variable
b = 10 # define another variable
x = a + b # assign a computation result to a variable x
print(type(x)) # check the type of x
print(id(x)) # check the memory address of x
print(x) # print the value of x

<class 'int'>
4363272880
15

3.7.2 Data types

Python has several built-in primitive data types, such as int, float, str, bool. In addition to these, it also has several built-in collection data types, such as list, tuple, set, and dict. We will cover these data types in more detail in the next sections.

Some useful resources