The idea of this post is to clarify that we don’t use the actual notebook (which we use at the research stage and build the model), as our production code. Then what do we use and why?
A pipeline is a set of data processing steps connected in series, where typically, the output of one element is the input of the next one. The elements of a pipeline can be executed in parallel or in time-sliced fashion.
At the expense of repeating myself, we deploy the entire pipeline not just the model. So, in essence we have to write the production code for the entire pipeline. Notebook (python jupyter notebook) are not used to deploy the model. We deploy model using scripts (python scripts). There are, of course ways in which we can write machine learning production code (procedural, object oriented etc). Scripts contain feature creation, transformation, selection, model training and model scoring steps that we can call and run in a predefined order one after the other.
While preparing the production code, we typically include a YAML file where we have code and all the bits of information that are critical and are shared across the functions, train and score scripts.
Reproducibility is an issue with notebooks. Because of the hidden state and the potential for arbitrary execution order, generating a result in a notebook isn’t always as simple as clicking “Run All.” We use something very simple and declarative like YAML. Testing and collaborating is also simpler and easier through YAML.
What is YAML? Why it is used? How it squeezed its way into a data science..
YAML is “YAML Ain’t Markup Language” is a human-readable data-serialization language. Lets break this down. First of all we will discuss about the recursive acronym “YAML Ain’t Markup Language this is similar to the acronym of Wine which is Wine is not an Emulator. Recursive acronym is an acronym that refers to itself. With that sorted. It brings us to two more questions what is Markup Language? and What is data-serialization?.
Each programming language has its own syntax and style. But markup languages are peculiar in this respect, they use tags (just like price tag and similar tags on garments) which signify, what they hold, in order for it to display on a web page. For example <table> would signify displaying a table. why tags ? because they are human readable than a typical programming syntax.
What is serialization?
This definition I found from the Microsoft docs, Serialization is the process of converting an object into a stream of bytes to store the object or transmit it to memory, a database, or a file. Its main purpose is to save the state of an object in order to be able to recreate it when needed. The reverse process is called deserialization.
If you have a complicated data structure, its representation in memory might ordinarily be scattered throughout memory. (Think of a binary tree, for instance.)
In contrast, when you want to write it to disk, you probably want to have a representation as a (hopefully short) sequence of contiguous bytes. That’s what serialization does for you.
The trouble I have is: aren’t all variables (be it primitives like int or composite objects) already represented by a sequence of bytes?
Yes, they are. The problem here is the layout of those bytes. A simple int can be 2, 4 or 8 bits long. It can be in big or small endian. It can be unsigned, signed with 1’s complement or even in some super exotic bit coding like negabinary.
If you just dump the int binarily from memory, and call it “serialized”, you have to attach pretty much entire computer, operating system and your program for it to be deserializable. Or at least, a precise description of them.
So what makes serialization such a deep topic? To serialize a variable, can’t we just take these bytes in memory, and write them to a file? What intricacies have I missed?
Serialization of a simple object is pretty much writing it down according to some rules. Those rules are plenty and not always obvious. Eg an xs:integer in XML is written in base-10. Not base-16, not base-9, but 10. It’s not a hidden assumption, it’s an actual rule. And such rules make serialization a serialization. Because, pretty much, there are no rules about bit layout of your program in memory.
What is configuration file?
Configuration:
An arrangement of parts or elements in a particular form, figure, or combination.
Configuration file:
A file that contains configuration information for a particular program. When the program is executed, it consults the configuration file to see what parameters are in effect.
The vast majority of the computer programs we use — be they office suites, web browsers, or even video games — are configured through a system of menu interfaces. It has almost become the default way we use our machines. But some programs require you to take a step beyond that, and you actually have to edit a text file in order to get them to run as you wish.These text files are — unsurprisingly enough — called “config files”.
Config files are essentially files that contain information required for the successful operation of a program, which are structured in a particular way. Rather than being hard-coded in the program, they are user-configurable, and are typically stored in a plain text file.
WHY YML:
YAML is less verbose than XML while still allowing developers to accurately describe precisely what they want. It also allows for greater flexibility in how you store your data.
YAML files are usually used to configure something. They are used to configure docker, with docker-compose.
YAML positions itself as a “human-readable data-serialization” language. So the intent is clear – to make it easy to read (and write!) structured data.
References:
https://en.wikipedia.org/wiki/YAML
https://blog.stackpath.com/yaml/
https://circleci.com/blog/what-is-yaml-a-beginner-s-guide/
https://blog.codemagic.io/what-you-can-do-with-yaml/
https://en.wikipedia.org/wiki/Serialization
https://stackoverflow.com/questions/1726802/what-is-the-difference-between-yaml-and-json
https://www.webopedia.com/TERM/C/configuration_file.html
https://www.techopedia.com/definition/867/serialization-net
https://cs.stackexchange.com/questions/72102/understanding-serialization
https://dzone.com/articles/what-is-serialization-everything-about-java-serial
https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/serialization/
Thanks for guiding us with these precious tutorials.
LikeLike