Static Analyzers in Python

Author: Adrian Tam

Static analyzers are tools that help you check your code without really running your code. The most basic form of static analyzers is the syntax highlighters in your favorite editors. If you need to compile your code (say, in C++), your compiler, such as LLVM, may also provide some static analyzer functions to warn you about potential issues (e.g., mistaken assignment “=” for equality “==” in C++). In Python, we have some tools to identify potential errors or point out violations of coding standards.

After finishing this tutorial, you will learn some of these tools. Specifically,

  • What can the tools Pylint, Flake8, and mypy do?
  • What are coding style violations?
  • How can we use type hints to help analyzers identify potential bugs?

Let’s get started.

Static Analyzers in Python
Photo by Skylar Kang. Some rights reserved

Overview

This tutorial is in three parts; they are:

  • Introduction to Pylint
  • Introduction to Flake8
  • Introduction to mypy

Pylint

Lint was the name of a static analyzer for C created a long time ago. Pylint borrowed its name and is one of the most widely used static analyzers. It is available as a Python package, and we can install it with pip:

$ pip install pylint

Then we have the command pylint available in our system.

Pylint can check one script or the entire directory. For example, if we have the following script saved as lenet5-notworking.py:

import numpy as np
import h5py
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, Dense, AveragePooling2D, Dropout, Flatten
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.callbacks import EarlyStopping

# Load MNIST digits
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()

# Reshape data to (n_samples, height, wiedth, n_channel)
X_train = np.expand_dims(X_train, axis=3).astype("float32")
X_test = np.expand_dims(X_test, axis=3).astype("float32")

# One-hot encode the output
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

# LeNet5 model
def createmodel(activation):
    model = Sequential([
        Conv2D(6, (5,5), input_shape=(28,28,1), padding="same", activation=activation),
        AveragePooling2D((2,2), strides=2),
        Conv2D(16, (5,5), activation=activation),
        AveragePooling2D((2,2), strides=2),
        Conv2D(120, (5,5), activation=activation),
        Flatten(),
        Dense(84, activation=activation),
        Dense(10, activation="softmax")
    ])
    return model

# Train the model
model = createmodel(tanh)
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
earlystopping = EarlyStopping(monitor="val_loss", patience=4, restore_best_weights=True)
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=100, batch_size=32, callbacks=[earlystopping])

# Evaluate the model
print(model.evaluate(X_test, y_test, verbose=0))
model.save("lenet5.h5")

We can ask Pylint to tell us how good our code is before even running it:

$ pylint lenet5-notworking.py

The output is as follows:

************* Module lenet5-notworking
lenet5-notworking.py:39:0: C0301: Line too long (115/100) (line-too-long)
lenet5-notworking.py:1:0: C0103: Module name "lenet5-notworking" doesn't conform to snake_case naming style (invalid-name)
lenet5-notworking.py:1:0: C0114: Missing module docstring (missing-module-docstring)
lenet5-notworking.py:4:0: E0611: No name 'datasets' in module 'LazyLoader' (no-name-in-module)
lenet5-notworking.py:5:0: E0611: No name 'models' in module 'LazyLoader' (no-name-in-module)
lenet5-notworking.py:6:0: E0611: No name 'layers' in module 'LazyLoader' (no-name-in-module)
lenet5-notworking.py:7:0: E0611: No name 'utils' in module 'LazyLoader' (no-name-in-module)
lenet5-notworking.py:8:0: E0611: No name 'callbacks' in module 'LazyLoader' (no-name-in-module)
lenet5-notworking.py:18:25: E0601: Using variable 'y_train' before assignment (used-before-assignment)
lenet5-notworking.py:19:24: E0601: Using variable 'y_test' before assignment (used-before-assignment)
lenet5-notworking.py:23:4: W0621: Redefining name 'model' from outer scope (line 36) (redefined-outer-name)
lenet5-notworking.py:22:0: C0116: Missing function or method docstring (missing-function-docstring)
lenet5-notworking.py:36:20: E0602: Undefined variable 'tanh' (undefined-variable)
lenet5-notworking.py:2:0: W0611: Unused import h5py (unused-import)
lenet5-notworking.py:3:0: W0611: Unused tensorflow imported as tf (unused-import)
lenet5-notworking.py:6:0: W0611: Unused Dropout imported from tensorflow.keras.layers (unused-import)

-------------------------------------
Your code has been rated at -11.82/10

If you provide the root directory of a module to Pylint, all components of the module will be checked by Pylint. In that case, you will see the path of different files at the beginning of each line.

There are several things to note here. First, the complaints from Pylint are in different categories. Most commonly we would see issues on convention (i.e., a matter of style), warnings (i.e., the code may run in a sense not consistent with what you intended to do), and error (i.e., the code may fail to run and throw exceptions). They are identified by the code such as E0601, where the first letter is the category.

Pylint may give false positives. In the example above, we see Pylint flagged the import from tensorflow.keras.datasets as an error. It is caused by an optimization in the Tensorflow package that not everything would be scanned and loaded by Python when we import Tensorflow, but a LazyLoader is created to help load only the necessary part of a large package. This saves significant time in starting the program, but it also confuses Pylint in that we seem to import something that doesn’t exist.

Furthermore, one of the key feature of Pylint is to help us make our code align with the PEP8 coding style. When we define a function without a docstring, for instance, Pylint will complain that we didn’t follow the coding convention even if the code is not doing anything wrong.

But the most important use of Pylint is to help us identify potential issues. For example, we misspelled y_train as Y_train with an uppercase Y. Pylint will tell us that we are using a variable without assigning any value to it. It is not straightforwardly telling us what went wrong, but it definitely points us to the right spot to proofread our code. Similarly, when we define the variable model on line 23, Pylint told us that there is a variable of the same name at the outer scope. Hence the reference to model later on may not be what we were thinking. Similarly, unused imports may be just that we misspelled the name of the modules.

All these are hints provided by Pylint. We still have to use our judgement to correct our code (or ignore Pylint’s complaints).

But if you know what Pylint should stop complaining about, you can request to ignore those. For example, we know the import statements are fine, so we can invoke Pylint with:

$ pylint -d E0611 lenet5-notworking.py

Now, all errors of code E0611 will be ignored by Pylint. You can disable multiple codes by a comma-separated list, e.g.,

$ pylint -d E0611,C0301 lenet5-notworking.py

If you want to disable some issues on only a specific line or a specific part of the code, you can put special comments to your code, as follows:

...
from tensorflow.keras.datasets import mnist  # pylint: disable=no-name-in-module
from tensorflow.keras.models import Sequential # pylint: disable=E0611
from tensorflow.keras.layers import Conv2D, Dense, AveragePooling2D, Dropout, Flatten
from tensorflow.keras.utils import to_categorical

The magic keyword pylint: will introduce Pylint-specific instructions. The code E0611 and the name no-name-in-module are the same. In the example above, Pylint will complain about the last two import statements but not the first two because of those special comments.

Flake8

The tool Flake8 is indeed a wrapper over PyFlakes, McCabe, and pycodestyle. When you install flake8 with:

$ pip install flake8

you will install all these dependencies.

Similar to Pylint, we have the command flake8 after installing this package, and we can pass in a script or a directory for analysis. But the focus of Flake8 is inclined toward coding style. Hence we would see the following output for the same code as above:

$ flake8 lenet5-notworking.py
lenet5-notworking.py:2:1: F401 'h5py' imported but unused
lenet5-notworking.py:3:1: F401 'tensorflow as tf' imported but unused
lenet5-notworking.py:6:1: F401 'tensorflow.keras.layers.Dropout' imported but unused
lenet5-notworking.py:6:80: E501 line too long (85 > 79 characters)
lenet5-notworking.py:18:26: F821 undefined name 'y_train'
lenet5-notworking.py:19:25: F821 undefined name 'y_test'
lenet5-notworking.py:22:1: E302 expected 2 blank lines, found 1
lenet5-notworking.py:24:21: E231 missing whitespace after ','
lenet5-notworking.py:24:41: E231 missing whitespace after ','
lenet5-notworking.py:24:44: E231 missing whitespace after ','
lenet5-notworking.py:24:80: E501 line too long (87 > 79 characters)
lenet5-notworking.py:25:28: E231 missing whitespace after ','
lenet5-notworking.py:26:22: E231 missing whitespace after ','
lenet5-notworking.py:27:28: E231 missing whitespace after ','
lenet5-notworking.py:28:23: E231 missing whitespace after ','
lenet5-notworking.py:36:1: E305 expected 2 blank lines after class or function definition, found 1
lenet5-notworking.py:36:21: F821 undefined name 'tanh'
lenet5-notworking.py:37:80: E501 line too long (86 > 79 characters)
lenet5-notworking.py:38:80: E501 line too long (88 > 79 characters)
lenet5-notworking.py:39:80: E501 line too long (115 > 79 characters)

The error codes beginning with letter E are from pycodestyle, and those beginning with letter F are from PyFlakes. We can see it complains about coding style issues such as the use of (5,5) for not having a space after the comma. We can also see it can identify the use of variables before assignment. But it does not catch some code smells such as the function createmodel()that reuses the variable model that was already defined in outer scope.

Similar to Pylint, we can also ask Flake8 to ignore some complaints. For example,

flake8 --ignore E501,E231 lenet5-notworking.py

Those lines will not be printed in the output:

lenet5-notworking.py:2:1: F401 'h5py' imported but unused
lenet5-notworking.py:3:1: F401 'tensorflow as tf' imported but unused
lenet5-notworking.py:6:1: F401 'tensorflow.keras.layers.Dropout' imported but unused
lenet5-notworking.py:18:26: F821 undefined name 'y_train'
lenet5-notworking.py:19:25: F821 undefined name 'y_test'
lenet5-notworking.py:22:1: E302 expected 2 blank lines, found 1
lenet5-notworking.py:36:1: E305 expected 2 blank lines after class or function definition, found 1
lenet5-notworking.py:36:21: F821 undefined name 'tanh'

We can also use magic comments to disable some complaints, e.g.,

...
import tensorflow as tf  # noqa: F401
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential

Flake8 will look for the comment # noqa: to skip some complaints on those particular lines.

Mypy

Python is not a typed language so, unlike C or Java, you do not need to declare the types of some functions or variables before use. But lately, Python has introduced type hint notation, so we can specify what type a function or variable intended to be without enforcing its compliance like a typed language.

One of the biggest benefits of using type hints in Python is to provide additional information for static analyzers to check. Mypy is the tool that can understand type hints. Even without type hints, Mypy can still provide complaints similar to Pylint and Flake8.

We can install Mypy from PyPI:

$ pip install mypy

Then the example above can be provided to the mypy command:

$ mypy lenet5-notworking.py
lenet5-notworking.py:2: error: Skipping analyzing "h5py": module is installed, but missing library stubs or py.typed marker
lenet5-notworking.py:2: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
lenet5-notworking.py:3: error: Skipping analyzing "tensorflow": module is installed, but missing library stubs or py.typed marker
lenet5-notworking.py:4: error: Skipping analyzing "tensorflow.keras.datasets": module is installed, but missing library stubs or py.typed marker
lenet5-notworking.py:5: error: Skipping analyzing "tensorflow.keras.models": module is installed, but missing library stubs or py.typed marker
lenet5-notworking.py:6: error: Skipping analyzing "tensorflow.keras.layers": module is installed, but missing library stubs or py.typed marker
lenet5-notworking.py:7: error: Skipping analyzing "tensorflow.keras.utils": module is installed, but missing library stubs or py.typed marker
lenet5-notworking.py:8: error: Skipping analyzing "tensorflow.keras.callbacks": module is installed, but missing library stubs or py.typed marker
lenet5-notworking.py:18: error: Cannot determine type of "y_train"
lenet5-notworking.py:19: error: Cannot determine type of "y_test"
lenet5-notworking.py:36: error: Name "tanh" is not defined
Found 10 errors in 1 file (checked 1 source file)

We see similar errors as Pylint above, although sometimes not as precise (e.g., the issue with the variable y_train). However we see one characteristic of mypy above: It expects all libraries we used to come with a stub so the type checking can be done. This is because type hints are optional. In case the code from a library does not provide type hints, the code can still work, but mypy cannot verify. Some of the libraries have typing stubs available that enables mypy to check them better.

Let’s consider another example:

import h5py

def dumphdf5(filename: str) -> int:
    """Open a HDF5 file and print all the dataset and attributes stored

    Args:
        filename: The HDF5 filename

    Returns:
        Number of dataset found in the HDF5 file
    """
    count: int = 0

    def recur_dump(obj) -> None:
        print(f"{obj.name} ({type(obj).__name__})")
        if obj.attrs.keys():
            print("tAttribs:")
            for key in obj.attrs.keys():
                print(f"tt{key}: {obj.attrs[key]}")
        if isinstance(obj, h5py.Group):
            # Group has key-value pairs
            for key, value in obj.items():
                recur_dump(value)
        elif isinstance(obj, h5py.Dataset):
            count += 1
            print(obj[()])

    with h5py.File(filename) as obj:
        recur_dump(obj)
        print(f"{count} dataset found")

with open("my_model.h5") as fp:
    dumphdf5(fp)

This program is supposed to load a HDF5 file (such as a Keras model) and print every attribute and data stored in it. We used the h5py module (which does not have a typing stub, and hence mypy cannot identify the types it used), but we added type hints to the function we defined, dumphdf5(). This function expects the filename of a HDF5 file and prints everything stored inside. At the end, the number of datasets stored will be returned.

When we save this script into dumphdf5.py and pass it into mypy, we will see the following:

$ mypy dumphdf5.py
dumphdf5.py:1: error: Skipping analyzing "h5py": module is installed, but missing library stubs or py.typed marker
dumphdf5.py:1: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
dumphdf5.py:3: error: Missing return statement
dumphdf5.py:33: error: Argument 1 to "dumphdf5" has incompatible type "TextIO"; expected "str"
Found 3 errors in 1 file (checked 1 source file)

We misused our function so that an opened file object is passed into dumphdf5() instead of just the filename (as a string). Mypy can identify this error. We also declared that the function should return an integer, but we didn’t have the return statement in the function.

However, there is one more error in this code that mypy didn’t identify. Namely, the use of the variable count in the inner function recur_dump() should be declared nonlocal because it is defined out of scope. This error can be caught by Pylint and Flake8, but mypy missed it.

The following is the complete, corrected code with no more errors. Note that we added the magic comment “# type: ignore” at the first line to mute the typing stubs warning from mypy:

import h5py # type: ignore


def dumphdf5(filename: str) -> int:
    """Open a HDF5 file and print all the dataset and attributes stored

    Args:
        filename: The HDF5 filename

    Returns:
        Number of dataset found in the HDF5 file
    """
    count: int = 0

    def recur_dump(obj) -> None:
        nonlocal count
        print(f"{obj.name} ({type(obj).__name__})")
        if obj.attrs.keys():
            print("tAttribs:")
            for key in obj.attrs.keys():
                print(f"tt{key}: {obj.attrs[key]}")
        if isinstance(obj, h5py.Group):
            # Group has key-value pairs
            for key, value in obj.items():
                recur_dump(value)
        elif isinstance(obj, h5py.Dataset):
            count += 1
            print(obj[()])

    with h5py.File(filename) as obj:
        recur_dump(obj)
        print(f"{count} dataset found")
    return count


dumphdf5("my_model.h5")

In conclusion, the three tools we introduced above can be complementary to each other. You may consider to run all of them to look for any possible bugs in your code or improve the coding style. Each tool allows some configuration, either from the command line or from a config file, to customize for your needs (e.g., how long a line should be too long to deserve a warning?). Using a static analyzer is also a way to help yourself develop better programming skills.

Further reading

This section provides more resources on the topic if you are looking to go deeper.

Articles

Software packages

Summary

In this tutorial, you’ve seen how some common static analyzers can help you write better Python code. Specifically you learned:

  • The strengths and weaknesses of three tools: Pylint, Flake8, and mypy
  • How to customize the behavior of these tools
  • How to understand the complaints made by these analyzers

The post Static Analyzers in Python appeared first on Machine Learning Mastery.

Go to Source