In the past, kings and leaders used oracles and magicians to help them predict the future — or at least get some good advice due to their supposed power to perceive hidden information. Nowadays, we live in a society obsessed with quantifying everything. So we have data scientists to do this job.

Data scientists use statistical models, numerical techniques and advanced algorithms that didn’t come from statistical disciplines, along with the data that exist on databases, to find, to infer, to predict data that doesn’t exist yet. Sometimes this data is about the future. That is why we do a lot of predictive analytics and prescriptive analytics.

Here are some questions to which data scientists help find answers:

1. Who are the students with high propensity to abandon the class? For each one, what are the reasons for leaving?
2. Which house has a price above or below the fair price? What is the fair price for a certain house?
3. What are the hidden groups that my clients classify themselves?
4. Which future problems this premature child will develop?
5. How many calls will I get in my call center tomorrow 11:43 AM?
6. My bank should or should not lend money to this customer?

Note how the answer to all these question is not sitting in any database waiting to be queried. These are all data that still doesn’t exist and has to be calculated. That is part of the job we data scientists do.

Throughout this article you’ll learn how to prepare a Fedora system as a Data Scientist’s development environment and also a production system. Most of the basic software is RPM-packaged, but the most advanced parts can only be installed, nowadays, with Python’s pip tool.

## Jupyter — the IDE

Most modern data scientists use Python. And an important part of their work is EDA (exploratory data analysis). EDA is a manual and interactive process that retrieves data, explores its features, searches for correlations, and uses plotted graphics to visualize and understand how data is shaped and prototypes predictive models.

Jupyter is a web application perfect for this task. Jupyter works with Notebooks, documents that mix rich text including beautifully rendered math formulas (thanks to mathjax), blocks of code and code output, including graphics.

Notebook files have extension .ipynb, which means Interactive Python Notebook.

### Setting up and running Jupyter

First, install essential packages for Jupyter (using sudo):

$sudo dnf install python3-notebook mathjax sscg You might want to install additional and optional Python modules commonly used by data scientists: $ sudo dnf install python3-seaborn python3-lxml python3-basemap python3-scikit-image python3-scikit-learn python3-sympy python3-dask+dataframe python3-nltk

$mkdir -p$HOME/.jupyter$jupyter notebook password Now, type a password for yourself. This will create the file$HOME/.jupyter/jupyter_notebook_config.json with your encrypted password.

Next, prepare for SSLby generating a self-signed HTTPS certificate for Jupyter’s web server:

$cd$HOME/.jupyter; sscg

Finish configuring Jupyter by editing your $HOME/.jupyter/jupyter_notebook_config.json file. Make it look like this: { "NotebookApp": { "password": "sha1:abf58...87b", "ip": "*", "allow_origin": "*", "allow_remote_access": true, "open_browser": false, "websocket_compression_options": {}, "certfile": "/home/aviram/.jupyter/service.pem", "keyfile": "/home/aviram/.jupyter/service-key.pem", "notebook_dir": "/home/aviram/Notebooks" }}  The parts in red must be changed to match your folders. Parts in blue were already there after you created your password. Parts in green are the crypto-related files generated by sscg. Create a folder for your notebook files, as configured in the notebook_dir setting above: $ mkdir $HOME/Notebooks Now you are all set. Just run Jupyter Notebook from anywhere on your system by typing: $ jupyter notebook

### Imbalanced Learn

imbalanced-learn provides ways for under-sampling and over-sampling data. It is useful in fraud detection scenarios where known fraud data is very small when compared to non-fraud data. In these cases data augmentation is needed for the known fraud data, to make it more relevant to train predictors. Install it with pip:

### Keras

Keras is a library for deep learning and neural networks. Install it with pip:

$sudo dnf install python3-h5py$ pip3 install keras --user

### TensorFlow

TensorFlow is a popular neural networks builder. Install it with pip:

$pip3 install tensorflow --user Photo courtesy of FolsomNatural on Flickr (CC BY-SA 2.0). ### 18 Comments 1. #### Raul Giucich Hi Avil, thanks for your article. I’m starting in data science and this kind of reference is very useful. I found a typo in keras section, in dnf sentence, the command must be “install” an l is missing. Thanks again. Best regards • #### Paul W. Frields @Raul: Thanks, this is fixed. 2. #### Ankur Sinha "FranciscoD" I think we should probably be suggesting people use virtual environments when installing stuff using pip, but that’s another blog post (or quick-doc)? https://docs.python.org/3/library/venv.html • #### Jason Alternatively, conda can manage a lot of the data sciency stuff, including environments. https://www.anaconda.com/distribution/ • #### M. "Loffi" I’d love to see a post about that. • #### Avi Alkalay Virtual environments are useful in development scenarios where multiple projects happening in a certain machine have conflicting requirements, like different versions of same Python module. Virtual environments (in the Python sense) is very unnatural in production environments. It is a better practice to reach same objectives using methods that are closer to production environments, as multiple Docker images, one for each project or requirement set. This technique comes from the DevOps world and it’s called Shift Left, because it anticipates (shift left) to the developer’s environment a step of the delivery pipeline that would need to happen anyway. Docker, containers and shift left are the way to go in my opinion, while virtualenv is a Python-specific hack. 3. #### Avi Alkalay Here are some packages and modules that I forgot to add in the original article: LEAFLET Interactive mapping framework that uses Google Maps and other backends to render its maps. Install it with: pip3 install leaflet –user CARTOPY Plot data on top of high resolution open source maps. Cartopy provides raster and vector maps of the world (but no street maps as Google Maps). Install it with: sudo dnf install python3-cartopy.x86_64 PLOTLY Visualisation framework which is more advanced that Matplotlib. Plotly supports animations and user interactivity. Plotly is part of a family of tools which also includes Dash, a Python BI framework. Install it with: dnf install python3-retrying.noarch pip3 install plotly –user 4. #### Mehdi Love the article. Gives good insight to machine learning packages in 10 minutes. Thanks Avi. 5. #### Kiril Why not use conda for python package management? 6. #### drakkai People interested in the Jupyter may also be interested in https://colab.research.google.com/ 7. #### Avi Alkalay One more addition related to web scraping, or the technique of extracting structured data from web pages. BEAUTIFULSOUP The most common library for HTML parsing and data extraction. Install it with:$ sudo dnf install python3-beautifulsoup4

LXML
Similar to BeautifulSoup but goes lower in the XML. Get it:

$dnf install python3-lxml SCRAPY A more advanced framework which supports multiple languages and the shell command line. Get it with:$ sudo dnf install python3-scrapy

SELENIUM
If you are scrapping more dynamic and AJAX-constructed web content, Scrapy, and BeautifulSoup won’t work. Enter Selenium which uses your installed browser (Chromium, Firefox and their headless versions) to load the page and run its JavaScript code before scrapping (with Scrapy and BeautifulSoup). Get it:

$sudo dnf install python3-selenium Here is an advanced article that discusses advanced scraping techniques: https://www.codementor.io/blog/python-web-scraping-63l2v9sf2q 8. #### Musashi This is a great article! I personally work a lot on natural language processing, and recently I’ve been working with a wonderful library called spaCy. Also, for what it’s worth, I’m a fan of using anaconda as a package/environment manager. 9. #### Avi Alkalay For those looking into Conda/Anaconda, I want to recall that Conda is a Python distribution, by the way a not very stable one. And I think there is no good reason to use a Python distribution (Conda) on top of the best Python distribution (Fedora) which is fully integrated with the underlying OS (also Fedora). Besides, I had a bad time with Conda packages. They were unstable and outdated when I had to use it, for exemple TensorFlow. If something you need is not RPM-packaged from Fedora, use pip3 to get it. It’s the most natural way to work with Python. • #### Andreas T My experience with Anaconda is quite the opposite to the Avi’s comment. Using conda-forge, packages are up to date (e.g. tensorflow) and stable. This is based on use in an academic research lab. Collaborating in a team with users that run different unix versions and windows, running Anaconda provides the same base for all and makes code sharing easy. 10. #### Michael I switched to Fedora (from Ubuntu) just before SciPy2019. Everything here used Jupyter notebooks as the standard way of sharing and teachine data science. Fedora has been brilliant. I think I would still suggest people use Anaconda to work with Jupyter on Fedora, rather than trying ot have it run natively on Fedora. But that is not remotely a criticism of Fedora – it’s just that the very large majority of the people who use Jupyter use Anaconda to manage scientific python environments (including use of Jupyter). Michael 11. #### Elliott S There’s no need to install mathjax explicitly; it’s a dependency of python3-notebook already. Also, s/Numpy/NumPy/g. 12. #### Matthew Fallon Just wanted to drop in one more important extension that people may need or want for map visualization, the mapview and other widgets extension that allows for map visualization. WIDGETS NB EXTENSION: run:$pip install widgetsnbextension

With whatever command arguments or versions you need for pip

or install using the jupyter notebook commands for extensions:

$jupyter nbextension install –py widgetsnbextension$jupyter nbextension enable –py widgetsnbextension

with the –user flag or another flag for each just like with pip.

This will give you access to widgets such as MapView which will interface with other data visualization software, one of which I was recently turned on to was arcgis which allows for maps with complex data visuals overlaid on it and uses MapView for all of it.

ARCGIS:

Install with:

$pip install arcgis or again with:$jupyter nbextension install –py arcgis
\$jupyter nbextension enable –py arcgis

With corresponding flags again, this second mode allows for modules to be controlled and turned on and off through jupyter.

Just wanted to offer these because they will not be installed out of the box with the way presented here, this is sort of the very manual non-automated way of installing it and you will need to put each of these pieces on your install as one is dependent on the other. Or, as mentioned in other comments, it may be preffered to install through something like conda to make sure everything is installed by default.

13. #### r44e54

How changing language to Polish?
I’m talking about frontend language not for language similar Ruby or Perl