JDU

Class Warfare!

This is a bit of a high-level run-through of "Classes" in python, a bit about object-oriented programming, and just generally a walk through of some cool stuff you can do with them.

What is a Class?

A class can be likened to a "template", it’s a structure which defines what information an instance of that template holds, how it behaves, and how it operates.

Classes can "inherit" or be "composed" together to create ever more complex structures of logic and information. We’ll get to that in a bit but let’s go over some of the basic bits of a class.

class Animal:
    def __init__(self, name=None):
        self.name = name

This is a really basic class. Like a function, but instead of def we use the class keyword.

You will have noticed that funky-lookin' init method (a method is a function defined on a class).

This is called a "constructor", it’s called automatically by python when you create an instance of a class.

my_animal = Animal(name="Marmoset")

When we create an instance of our class (known as instantiation), in the background, python exposes the init method in a magic way to make Animal a callable (i.e. it acts like a function).

And even though our init() doesn’t specify a return, python magic makes it so that when you’re instantiating an instance of a class this way, it returns the newly created instance of the class.

So calling Animal(name="Marmoset") is akin to doing something like this:

def create_animal(name=None):
    new_animal = <a_new_uninitialized_instance_of_the_class_Animal>
    new_animal.__init__(new_animal, name=name)
    return new_animal

my_animal = create_animal(name="Marmoset")

But instead of you having to write out that weird pseudo-code, python does some magic for you so that all you have to do is:

my_animal = Animal(name="Penguin")

my_animal is now an instance of the type Animal, and we can inspect this using the type() function:

print(type(my_animal))
# <class '__main__.Animal'>

So what is it that we’re actually doing in the constructor (i.e. init)? You may have noticed that init accepts an argument self, something we aren’t passing to it, this is part of the python magic as well. When you call Animal() python is creating the instance of the class, and passing that specific instance to the constructor as the self argument like we showed in the fake create_animal function above.

When you implement methods on a class, in general, you’ll always have self as the first argument in any method signature because python wraps your method in an outer method that passes the class instance itself into your method first.

And you CAN implement your own methods:

class Animal:
    def __init__(self, name=None):
        self.name = name

    def make_noise(self):
        print("<silence>")

my_animal = Animal(name="Marmoset")

my_animal.make_noise() # Calling our custom method
# <silence>

Our animal doesn’t have a sound, but we didn’t give it one, so let’s add sound to our init.

class Animal:
    def __init__(self, name=None, sound=None):
        self.name = name
        self.sound = sound

    def make_noise(self):
        if self.sound:
            print(self.sound)
        else:
            print('<silence>')

my_animal = Animal(name="Cow", sound="Mooo")

my_animal.make_noise()
# Mooo

Cool, we added a new property to our class, and we assigned it through the constructor. But we can also change that value after the fact too:

my_animal.sound = "Squawk"

my_animal.make_sound()
# Squawk

So not only does a class act as a way to encapsulate business logic that has commonality, it also acts as a sort of dict that stores information and data that we can access from within the class (through the self argument) or from outside by accessing the properties using dot-notation (.) directly (my_animal.sound).

Now I like my Animal, but I want to add another property called collar_colour, and well, not all animals have collars, only pets really do. However, I don’t want to have to create a whole new class for "Pets" and have to maintain the same common properties and methods in two places.

This is where "sub-classing" comes in. Using sub-classing, a new class, can inherit the properties of a parent class.

class Pet(Animal): # <- we inherit Animal
    def __init__(self, collar=None, **kwargs):
        super().__init__(self, **kwargs)
        self.collar = collar

my_pet = Pet(name="Dog", sound="Woof", collar="Red")

my_pet.make_sound()
# Woof

We’re getting a bit crazy now! Our new class Pet inherits from Animal. So effectively when you create a new instance of Pet you’re also creating a new instance of Animal. Your new instance of Pet inherits the properties and methods of the Animal class.

To save ourselves some typing, we can use the special function super() to get the parent of our current class and call functions on the parent class.

In this case super().init(self. **kwargs) we’re unpacking the other keyword arguments and passing them to the init method of Animal. This saves us having to type out self.name = name again as well as ensuring we trigger any additional logic defined in the Animal classes init.

We could go on to create another sub-class, a sub-class of Pet called Dog for specific dog features, and behaviour, and we could go even further and create new classes that inherit from Dog representing individual breeds, customizing and tweaking the behaviour as we go further down the inheritance structure.

This, is a style of object-oriented programming (OOP for short). The idea is that you encapsulate common functionality into classes (objects) and inherit and extend those for specific use cases.

Composition

Another approach to using classes is Composition. Using classes composed together in order to abstract or simplify code interfaces and logic. Let’s jump into an example:

Let’s say I have an application, and I need to track the state of some stuff. What I might do is create a new class called ApplicationState.

class ApplicationState:
    def __init__(self):
        pass

I could create sub-classes of the ApplicationState but that runs diagonal to my need, I need a "centralized" point to control things, not a distributed, customized set of functionality spread around, but I don’t want one big huge monolithic monster class either, that would be hard to maintain. That’s when we reach for Composition. So lets say that my application is a library manager, keeping track of books I own. So the first thing we need is a class representing a book.

class Book:
    def __init__(self, title=None, author=None):
        self.title = title
        self.author = author

Cool, so we could then just add a property books to our application state that’s a list containing a bunch of Book instances right?

Sure, but we can go even further than that, lets create a Books class instead.

class Books:
    def __init__(self)
        self.books = []

So we create this new class Books, and one of it’s internal properties is an empty list of Book.

But how do we get the books "in there"? Well, we implement a new method called load_books and we call it from our init.

class Books:
    def __init__(self):
        self.books = []
        self.load_books()

    def load_books(self):
        """ Loads up books from our library.json file"""
        raw_books = []
        with open("library.json", "r") as f:
            raw_books = json.load(f)

        self.books = [Book(**x) for x in raw_books)

What about if we want to save our library? Let’s implement a to_dict function on our Book class to help us out with converting individual books back to dicts.

class Book:

    ...

    def to_dict(self):
        return {
            "title": self.title,
            "author": self.author,
        }

Now let’s add a save_books method to our Books class:

class Books:

    ...

    def save_books(self):
        """ flush our library to disk """
        raw_books = [x.to_dict() for x in self.books]

        with open("library.json", "w") as f:
            json.dump(raw_books)

Ok so now we can create an instance of our Books class and do stuff with it.

my_library = Books()

for book in my_library.books:
    print(book.title, book.author)

my_library.save_books()
# writes to library.json

Because we’ve wrapped our list of Book up into an encapsulating class (we composed it) we can implement methods on Books to help us manage that list.

class Books:

    ...

    def find_by_author(self, author_name):
        """ Find all books by a specific author """
        return [x for x in self.books if x.author == author_name]

    def delete_book(self, title_to_delete):
        """ Delete a book from our library """
        self.books = [x for x in self.books if x.title != title_to_delete]
        self.save_books() # call our method to persist the changes

    def has_book(self. title_to_find):
        """ check if we already have a book """
        return len([x for x in self.books if x.title == title_to_find]) > 0

    def update_author(self, title_to_edit, new_author):
        """ Update a titles author name """
        for book in self.books:
            if book.title == title_to_edit:
                book.author = new_author

        self.save_books()

The Books class lets us centralize re-usable functionality within the class to make it easier to manage, maintain and access.

Now instead of our code being littered with variations of the same logic all over the place in different areas of our code-base, it’s neatly set up in one place, next to the data it operates on, and if we need to update it, we only have to look at where we defined the class.

Now in the context of our ApplicationState we can actually wrap our Books class inside of our ApplicationState and add some additional stuff for our application that we need that might be functionally different.

class ApplicationState:
    def __init__(self):
        self.library = Books() # Instantiate our library
        self.app_title = "My Library"
        self.book_search = BookSearch() # Some other class we've encapsulated data and logic into

Now instead of having to pass around individual instances of these specific classes, we just pass around our ApplicationState instance and access its internals through dot-notation.

app_state = ApplicationState()
app_state.library.save_books()

jeffs_books = app_state.library.find_by_author("Jeff Uren")

We can also abstract those lower-level functions, or wrap more code around them to help us handle specific events for instance:

class ApplicationState:

    ...

    def on_shutdown(self):
        """ if the application shuts down make sure we save our library before we lose it """
        self.library.save_books()
        sys.exit(1)

    def do_we_have_book(self, title):
        return self.library.has_book(title)

So now I can have a really simple top-level program that wraps all this:

if __name__ = "__main__":
    app_state = ApplicationState()

    author_name = input("Find Books By Author: ")
    found_books = app_state.library.find_by_author(author_name)

    for book in found_books:
        print(book.title, book.author)

My main application / script doesn’t have to be cluttered with all the lower down, complicated logic and programming.

This is Composition.

Static Classes

Sometimes you want to wrap up a bunch of functionality together and access it in a convenient way, but you don’t actually need an instance of class, you just want to group some common functionality together thematically.

You can define classes which you never actually create an instance of:

class Utils:
    @staticmethod
    def remove_jeff(some_str):
        return some_str.replace("Jeff", "")

    @staticmethod
    def add_jeff(some_str):
        return some_str + " Jeff"

Utils.remove_jeff("This string has Jeff in it")
Utils.add_jeff("This string needs someone in it!")

Notice how we don’t have an init and in our methods we defined, we don’t have self as the first argument? This class, although you can still instantiate it using Utils(), it’s not a requirement that you do so in order to use its methods. This is an easy way to group common functionality together in a class to help you organize things.

The @staticmethod decorator tells python that the method should be available through Utils.<method_name> without having to instantiate an instance of the class and blocks it from accessing internals of the class.

Getters and Setters

When you create a class with properties in it:

class MyClass:
    def __init__(self, some_prop):
        self.some_prop = some_prop

some_prop is now editable from outside and inside of the class. But what if you don’t want that, or you want to control how it’s set / retrieved? That’s when we get into getters and setters. These allow you to layer over top of the properties on a class, or hide properties of a class from the user so they can’t make changes to them without you allowing it.

Let’s say we have a property we want to control, or that when it changes, some other logic should fire as well.

class Person:
    def __init__(self, age)
        self._age = age # See how we're using `_` here to "hide" the property
        self.birth_year = 2022 - age

    @property
    def age(self):
        print("getting their age")
        return self._age

    @setter.age
    def age(self, new_age):
        print("Updating age")
        self.birth_year = 2022 - new_age
        self._age = new_age

jeff = Person(39)

print(jeff.birth_year)
# 1983

jeff.age = 87
# Updating age
print(jeff.birth_year)
# 1935

When we do jeff.age = 87 we’re not editing the _age property underneath directly, python has let us overload the = operator to mean age(87) and mapped it to our @setter.age annotated function.

Likewise when we try to get jeff.age we’re not accessing a property directly, python is doing some magic to allow you to treat a method on the class instance, as if it were a property, when in reality jeff.age is actually calling jeff.age().

Accessing all instances of a class

Want to see something weird?

Properties declared under the class heading are common to all instances of a given class.

class Person:
    all_the_people = []

    def __init__(self):
        self.all_the_people.append(self)

Looks weird doesn’t it, but it means that no matter where you are in your code, you can do something relative cool, you can access all instances of a class from any given instance.

list_of_peeps = [
    Person(),
    Person(),
]

person_dict = {
    {"1": Person()},
}

some_other_person = Person()

for person in some_other_person.all_the_people:
    print(person)

# <__main__.Person object at xxxxxxx>
# <__main__.Person object at xxxxxxx>
# <__main__.Person object at xxxxxxx>
# <__main__.Person object at xxxxxxx>

So you can for instance, have a constant in your class that’s used in a computation in one of the methods of your class instances. And when you update that property for a given instance of the class, it will update for all instances of the class throughout your code that are reachable.

Other magic methods

repr

When you do print(some_var) ever wonder where the output of print comes from, who decides what it looks like, and when you print a class instance, why is it that horrible <main.ClassName object at xxxxx> message that tells you nothing about the thing you’re printing?

That’s where repr comes in, you can use it to customize the output of print() when an instance of your class is passed to it.

class Person:
    def __init__(self, name=None):
        self.name = name

    def __repr__(self):
        return f"<Person name={self.name}>"

We’re overriding the default implementation of repr on our class here, we just have to return a string from the method, which is what print will output to the console, log or wherever you’re sending this.

p = Person(name="Bob")

print(p)
# <Person name=Bob>

Operator Overloading

You can actually overload (i.e. overwrite) what a class instance does when you use it with a comparison operator (i.e. =, >=, ⇐, etc…​) so that you can have custom comparison logic for your class instances.

I’ll give you an example using our earlier Book class.

class Book:
    def __init__(self, title=None, Author=None):
        self.author = author
        self.title = title

There’s our book class, but let’s say we have two instances of the Book class for two different books:

book_1 = Book(title="Hitchhikers Guide to the Galaxy", author="Douglas Adams")

book_2 = Book(title="Zen and the Art of motorcycle maintenance", author="Robert M. Persig")

Now let’s say we want to check if these two books are the same book, how would you go about that?

same_book = False

if book_1.title == book_2.title and book_1.author == book_2.author:
    same_book = True

That’s not particularly intuitive, and we don’t really want to have to write that over and over all over our code. But if we could do the below, that would be much cleaner:

same_book = False

if book_1 == book_2:
    same_book = True

But in reality that’s checking if the two instance classes are the same instance, not that that the properties inside the instance are equal. But we can fix that!

class Book:

    ...

    def __eq__(self, other):
        if self.title == other.title and self.author == other.author:
            return True
        else:
            return False

Now that we’ve overloaded eq, we can now compare two books using the == operator:

same_book = book_1 == book_2

You can do this with all sorts of different operators, for instance > is gt and >= is gte.

Let’s add one more, but this time to the Books class that holds our list of Books:

class Books:
    def __init__(self)
        self.books = []
        self.load_books()

    ...

    def __iter__(self):
        for book in self.books:
            yield book

    def append(self, new_book):
        self.books.append(new_book)
        self.save_books()

We’ve implemented one overloaded operator called iter, and we’ve added a new method called append(). This allows us to make instances of the Books class act a bit like a List.

library = Books()

# we don't have to reach into library.books property anymore!
for book in library:
    print(book)

new_book = Book(title="Hitchhikers Guide to the Galaxy", author="Douglas Adams")

library.append(new_book)

This simplifies your interactions with instances of the class, and makes it so you can use it without having to know the deep internals of the class itself.

slots

Classes have a special property named slots that can help your code run more efficiently under the hood and enforce some constraints on your classes.

The slots parameter allows you to set a list of properties which should be allowed on your class.

class Book:
    __slots__ = ["title", "author"]

    def __init__(self, title=None, author=None):
        self.title = title
        self.author = author

Specifying slots is more memory efficient, as the class instantiation logic knows that it only needs to reserve enough space in memory for two properties. This is important because properties for individual instances of a class are actually stored in a python dict behind the scenes if you don’t use slots, and dicts reserve a larger amount of space in memory for themselves in order to allow you to add more data to it over time without shifting things around in memory.

When a property of a class is stored inside of a slots instead of a dict, retrieving the value from the slot is actually much faster than if it was stored in the backend in a dict.

On a small scale this might not seem that big of a deal, but if you’re playing with a large data-set where each record is loaded into an instance of your class, the performance gain and reduction in memory usage can be pretty dramatic, even over using a plain old dict and no class at all.

Classes in Data Engineering

This is all well and good, but what use are classes in Data Engineering?

There’s loads of use cases. You can use classes to represent individual data items, you can use classes to wrap an iterator on a file of records, where it loads up each file and does some processing on it before returning the file to your larger code. You can use a class to encapsulate and simplify interacting with an API.

Our airflow instance uses KubernetesPodOperator which is a class that abstract away a whole lot of complexity, so that you can spin up tasks in a kubernetes pod without having to worry about all those complicated gubbins.

You can use classes to represent different types of records within a large non-standard record set, or you can use it to group special properties together to help you realize some complex functionality or logic.

For example, you can implement a new PublicationId type which wraps around the various type of publications IDs (DOI, pmc, pmcid, dimensions id, rf_id, etc…​) and implements complex eq logic to decide if two publications are one and the same.

Hopefully this ones been useful for ya’ll!