Data structure and algorithms - introduction
What They Are and Why They Are Important
Unfortunately, most software developers outside the technology industry segment, either don’t know, don’t care to use them or just plain don’t understand their importance and their value for the software development and the impact of their efficiency on the final product.
In practical terms there are two choices when producing code of quality and scalable solutions, i.e. ad-hock practices or use of structured development and best practices combined with agile methodologies.
Surprisingly, some large companies, specifically in the automotive, agriculture, transportation and even engineering sections do not enforce such work philosophy and the final product always suffer the consequences of bad management and lack of organization when it comes to the application of technical diligence and quality concern. Who suffers the most? The user, which could be the client in large scale or in the end, people like you and I who may or may not be the final consumers.
The path usually taken when applying the first approach (i.e. ad-hock), is either fall into denial and blame the client requirements rather than its own shortfalls and repeat the cycle, needles to say, this is wrong. In my experience, bad professionals (well, professionals by right, not by fact) take advantage of two things: company size (the larger the company, the more bureaucracy one has to go through in dealing with such) and specially the lack of technical training of their managers and are able to keep on living and infecting the company’s health for many years this way.
I know this sounds a little harsh and you may even be wondering what sort of experiences I’ve had in the past that lead me to think this way, but I was surprised too and after over 30 years in the field I’ve learned this may be more common than most of us realize or want to admit. I like a lot one line of saying I’ve found sometime ago on twitter that says: “facts don’t care about your feelings”, in order words, this is reality and if this is a problem in the tech world, then we should face it head to head. On the other hand, this is not the intention of this article, rather, the goal here is to point to the important aspects of the technical tools every software developer should have in their bags and perhaps help on clarifying a bit what they are and give a head-start on searching and learning about them.
Software developers should understand and be comfortable using data structures, it is an important aspect of every professional skills one should have because it will help in making important decisions towards a more efficient development, it helps to determine at run time the cost of memory allocation and storage to the system and how data manipulation can strongly influence the output during the life time of its implementation.
Disclaimer: My opinions are based upon my own experience spread over many years of working in the industry and it’s not tied up to any particular company alone.
What Are Data Structures?
Data Structures are models used to organize, insert, remove and store data. In the world of fast changing technology, it becomes more clear that gathering data is what may set the competitors apart in an evolving market and only those who know how to control and organize the data they receive are the ones that will survive the tough competition in a long run and come out on the top of successful businesses.
Why are they important?
Considering the SDLC (Software Development Life Cycle) of every development, data structures can be used to reduce operations at run time and produce faster and more efficient results. To do this we need to first determine what type of data structure fits best to the development we are coding and then perform something called complexity analysis, usually divided into time and storage complexity analysis.
Nowadays, storage complexity calculations are important but the concern with storage is no longer much of an issue because memory is very cheap today as compared to years ago when they were very expensive. On the other hand, time complexity is critical and it’s the main concern when designing a maintainable and scalable system. It’s important to understand that data structures come along with complexity analysis, this is the intent of using them, to make the system perform faster and more efficiently and complexity analysis is the tool that allows us to understand the advantages of one design over the other and finally determine what’s the best solution for the code. As you know, one solution may have many different approaches and the judge of which approach is the best for the use case, is the complexity analysis, as this is the metrics to measure code performance at run time.
Complexity Analysis
The golden cup of developing code is trying to develop sustainable, scalable and efficient solutions, but we all know that there is no such thing as the “perfect code”. Developer do embed traces of their own personalities into the code, some of them are more aggressive, some are more mathematical, some are more inclined to verbose, every coder has a different style and ways to approach a problem. Complexity analysis or algorithm analysis, takes away all of that and it narrows down what the code should look like if one is looking for an optimal solution. There are four notation types used for this purpose:
1- Θ notation;
2- Ω notation;
3- O notation
4- w notation
These notations are used to represent algorithm’s performances. O notation is widely used and it denotes the worst case scenario (i.e. slowest outcome) and the Ω notation denotes the best case scenario (i.e. fastest outcome). The goal here when analyzing the algorithms performance is achieve as close as possible to constant time, since this would be the fastest your code can execute in the metrics scale used.
Searching some resources online you will be able to find a graph listing the gradation of all different time complexities:
In practical terms there are two choices when producing code of quality and scalable solutions, i.e. ad-hock practices or use of structured development and best practices combined with agile methodologies.
Surprisingly, some large companies, specifically in the automotive, agriculture, transportation and even engineering sections do not enforce such work philosophy and the final product always suffer the consequences of bad management and lack of organization when it comes to the application of technical diligence and quality concern. Who suffers the most? The user, which could be the client in large scale or in the end, people like you and I who may or may not be the final consumers.
The path usually taken when applying the first approach (i.e. ad-hock), is either fall into denial and blame the client requirements rather than its own shortfalls and repeat the cycle, needles to say, this is wrong. In my experience, bad professionals (well, professionals by right, not by fact) take advantage of two things: company size (the larger the company, the more bureaucracy one has to go through in dealing with such) and specially the lack of technical training of their managers and are able to keep on living and infecting the company’s health for many years this way.
I know this sounds a little harsh and you may even be wondering what sort of experiences I’ve had in the past that lead me to think this way, but I was surprised too and after over 30 years in the field I’ve learned this may be more common than most of us realize or want to admit. I like a lot one line of saying I’ve found sometime ago on twitter that says: “facts don’t care about your feelings”, in order words, this is reality and if this is a problem in the tech world, then we should face it head to head. On the other hand, this is not the intention of this article, rather, the goal here is to point to the important aspects of the technical tools every software developer should have in their bags and perhaps help on clarifying a bit what they are and give a head-start on searching and learning about them.
Software developers should understand and be comfortable using data structures, it is an important aspect of every professional skills one should have because it will help in making important decisions towards a more efficient development, it helps to determine at run time the cost of memory allocation and storage to the system and how data manipulation can strongly influence the output during the life time of its implementation.
Disclaimer: My opinions are based upon my own experience spread over many years of working in the industry and it’s not tied up to any particular company alone.
What Are Data Structures?
Data Structures are models used to organize, insert, remove and store data. In the world of fast changing technology, it becomes more clear that gathering data is what may set the competitors apart in an evolving market and only those who know how to control and organize the data they receive are the ones that will survive the tough competition in a long run and come out on the top of successful businesses.
Why are they important?
Considering the SDLC (Software Development Life Cycle) of every development, data structures can be used to reduce operations at run time and produce faster and more efficient results. To do this we need to first determine what type of data structure fits best to the development we are coding and then perform something called complexity analysis, usually divided into time and storage complexity analysis.
Nowadays, storage complexity calculations are important but the concern with storage is no longer much of an issue because memory is very cheap today as compared to years ago when they were very expensive. On the other hand, time complexity is critical and it’s the main concern when designing a maintainable and scalable system. It’s important to understand that data structures come along with complexity analysis, this is the intent of using them, to make the system perform faster and more efficiently and complexity analysis is the tool that allows us to understand the advantages of one design over the other and finally determine what’s the best solution for the code. As you know, one solution may have many different approaches and the judge of which approach is the best for the use case, is the complexity analysis, as this is the metrics to measure code performance at run time.
Complexity Analysis
The golden cup of developing code is trying to develop sustainable, scalable and efficient solutions, but we all know that there is no such thing as the “perfect code”. Developer do embed traces of their own personalities into the code, some of them are more aggressive, some are more mathematical, some are more inclined to verbose, every coder has a different style and ways to approach a problem. Complexity analysis or algorithm analysis, takes away all of that and it narrows down what the code should look like if one is looking for an optimal solution. There are four notation types used for this purpose:
1- Θ notation;
2- Ω notation;
3- O notation
4- w notation
These notations are used to represent algorithm’s performances. O notation is widely used and it denotes the worst case scenario (i.e. slowest outcome) and the Ω notation denotes the best case scenario (i.e. fastest outcome). The goal here when analyzing the algorithms performance is achieve as close as possible to constant time, since this would be the fastest your code can execute in the metrics scale used.
Searching some resources online you will be able to find a graph listing the gradation of all different time complexities:
Ideally, the code should fall into the green area but most of the time, it is acceptable that the code’s performance will fall into the orange one at no major cost to the final outcome.
If you want know more about complexity analysis and how to use it in your development, click here.
Categories of Data Structure
Data structures are basically divided into two main categories
1 – Inbuilt Data Structures
2 – Abstract Data Types
Inbuilt Data Structures
These are data structure embedded into most programming languages and they are used to back other types of data structures.
Samples of Inbuilt data structures are:
- Arrays
- Structures (data type definition)
Abstract Data Types
Abstract Data Types (ADT), are defined by its behavior rather than how they are implemented. Such behavior may be interaction between operations that use the data and/or possible values they can acquire.
ADTs are only concerned with what operations are performed but not how these operations will be implemented. This gives the developer the flexibility to choose many different answers for a particular solution and then determine which one is the most effective to the constraints of the requirements.
Because ADTs do not specify how the data is being handled in the background this process is called abstraction.
Most common ADTs are:
- Linked Lists;
- Stacks;
- Queues;
- Trees;
- Heaps;
- Hash tables;
- Graphs
Linked Lists
A linked list is a linear data structure where each element contained within is an object. These objects are aligned into a sequence of other objects that contains data and points to refer to each other. There is three types of linked-lists: singly linked list, doubly linked list and circular linked list.
The choice of which one to use will be dependent on the application being developed and the requirements at hand. One thing to bear in mind when choosing a linked list is the time complexity associated to each one of them, for instance by choosing a circular linked list over a singly linked list, one may be degrading the code’s performance from O(1) to O(n) in the worst case scenario.
Advantages:
- linked lists are great to insert and remove items from the list;
- they are a dynamic type of data structure and can grow or shrink at run time;
- there is no need of memory pre-allocation;
- time complexity is constant without the overhead;
- easy to implement linear structures such as stacks and queues.
Disadvantages:
- they use more memory space than arrays due to the need to store pointers;
- due to back pointers, reverse traversing is more difficult;
- due to sequential access nature of linked lists, nodes must be read in order from the beginning of the list;
- because nodes are store in-contiguously (not in sequential memory blocks), random access to individual elements within the list degrade time complexity and use of CPU resources.
Stacks
Stack is an abstract data structure which stores elements in a specific format called LIFO (Last In, First Out). Stacks don’t dictate the way the data is organized but rather, what type of operations can be performed within, so in that sense, Stacks can be backed up by any type of data structure, i.e. Arrays, Linked-Lists, Hash tables, etc…
The characteristics of LIFO is that the last element added to the Stack is always the first element to be removed from it. This means that there is no random access to its elements and accessing a specific element without traversing the entire list as we do in an array is not possible. Imagine a deck of playing cards for instance, if you want to remove the fifth card from the deck, you need to first remove the four cards that are on the top of the deck to get to the fifth one.
Advantages:
- elements are organized in LIFO format, the last element is promptly accessed if needed;
- depending on the backing of the stack, time complexity may be either constant or linear time for linked list and arrays respectively;
- useful to keep track of states or the order of when events have occurred;
- local variables are stored on the stack automatically at function calls and destroyed at the end of program execution with no need of garbage collection.
Disadvantages:
- no random access to elements;
- time complexity degrades when backed up by arrays;
- storage complexity degrades when backed up by linked lists;
Queues
Queue is an abstract data structure that is very similar to a stack but instead of storing data in a LIFO format, it stores data as FIFO (First In, First Out) format. Let’s use the same deck of playing cards to illustrate this format: in the first illustration we would have to take four cards of the deck to get to the fifth element of the deck in order to remove it of the stack. In the queue, however, we would have to traverse the whole deck from the element in the bottom until we reach the fifth element on the top. So, if the deck of cards contains 64 cards, we would have to take 59 cards of the deck before getting to the fifth element from the top of the deck, since that coming from the bottom we need to traverse the queue 64-k elements (where ‘k’ is the element to be removed).
Advantages:
- organizes data in a sequential format FIFO;
- it can be backed up for any type of other data structures;
- it handles multiple data types;
- time complexity is constant (O(1)) for both, insertion (at the end) or deletion (at the front) if implemented with linked lists, but linear time (O(n)) if implemented with arrays.
Disadvantages:
- adding elements in the middle is not practical and requires cumbersome implementations;
- searching elements degrade time complexity;
Trees
A tree is one of the most used data structures that simulates a hierarchical tree structure. It’s based upon a root value and sub-trees called children, which are represented by a set of linked nodes. When searching for a specific element, the tree is traversed through its children until the element is found.
Arguably, Trees can be considered by some as data types or by some as data structures, because Trees do dictate how to organize the data and one can write them using trees and tree-node classes, this is commonly done but, one can certainly back certain types of trees with an arrays.
The terms and components used with trees:
- Root: the first node in the tree, every tree must have a root node;
- Parent node: the predecessor of any node is called parent node;
- Child node: the node which is descendant from any other node, which in turn will become its parent;
- Siblings: nodes that belong to the same parent;
- Leaf node: nodes that have no children of their own, they are also known as terminal nodes;
- Internal nodes: node that has at least one child within the tree;
- External nodes: nodes that have no children;
- Degree: the total number of children of a node is called the degree of the node, the highest degree of nodes among all nodes is called the degree of the tree;
- Level: steps from top to bottom. Each step from top to bottom is called a level;
- Edges: these are the arrows that connect one node to another;
- Height: the total number of edges from the leaf nodes to a particular node. The longest height within the tree is called the height of the tree;
- Depth: the total number of the edges from the root to a specific node is called depth of that node;
- Path: the sequence of nodes and edges from one node to another.
Trees can be of different types:
- binary trees;
- multi-way search trees;
- AVL trees.
Advantages:
- Trees provide moderate access/search capabilities and the time complexity may be better than a linked list when using a binary tree search, for instance;
- insertion and deletion time are faster than arrays;
- specifically for binary trees, time complexity take O(log n) instead of O(n);
- traversal is easier in sort trees to find elements.
Disadvantages:
- Trees can quickly grow unbalanced degrading time complexity;
- they are unstable, meaning that changes in the data can lead to a large impact to the optimal arrangement of the tree structure;
- they are usually inaccurate;
- it takes O(log n) time to retrieve an element with a known location from a tree, the same operation can be done in constant time O(1) using other types of data structures (arrays with known indexes for example).
Heaps
Heaps are specialized tree-based data structures that follows a specific role for data arrangement, divided into two main formats:
- Max heap: it dictates that the value of the parent node is either greater than or equal to the value of its child;
- Min heap: it dictates that the value of the parent node is either less than or equal to the value of its child.
Advantages:
- any object created on the heap space is accessed globally and can be reference from anywhere within the application;
- particularly in Java, memory allocation is done automatically at runtime to objects;
- searching for max and min values are easy to perform;
- heap data structure efficiently uses graph algorithms.
Disadvantages:
- memory management is more complex because it’s used globally;
- storing data in the heap can degrade time complexity if not well designed;
- references to objects have automatic garbage collection and have to be destroyed manually in some languages (C# for instance), this may lead to memory leakage problems that are difficult to debug.
Hashtables
A hash table is a data structure that works with two features: a key and a value.
Keys are mapped through a hash function and converted to an integer, then they are used to indicate a location into an array by using this key as an index value of the array element. Values stored in hash tables can be searched in constant time O(1), by using the same hash function that generates the address from the key conversion.
The process of mapping these conversion values to the location in the array (i.e. their indexes) is known as hashing.
The value of a hash table is determined by its hashing function. A good hashing function will map the key as evenly as possible. It means that the range of generated hashing values in the output should be very close for each value. On the same token, the value that the hashing function returns should be deterministic, meaning that for a given input value, the hashing value generated must be the same.
Advantages:
- more efficient for searching than trees;
- synchronization is better;
- insertion and deletion can be done in constant time O(1);
Disadvantages:
- collisions: hashing functions may map the different keys to map to the same address location. Two records can not be stored in the same location, there are some techniques to solve the problem of collision;
- if the hashing function is not well designed it can lead to many collisions;
- sorting data is not very good due to random memory location.
Graphs
Graphs are non-linear data structures and they are comprised of nodes and edges. Nodes can be referred to as vertices and edges can be referred to as lines or arcs that connect two nodes in the graph (see the illustration).
For instance, in the graph the set of vertices V = {2,0,1,4,3} and set of edges E = {20,01,13,34,04,14,13}.
Graphs can be represented by an array of vertices and two-dimensional arrays of edges. Basic operations for Graphs are: add vertex, add edge and display vertex. There are some important terms to learn when working with Graphs, they are:
- Vertex: Each node of the graph is represented by a vertex, in the illustration above, the labelled circle represent vertices;
- Edge: edge represents the path between two vertices, i.e. the line between two vertices. In the illustration shown, the line connecting circle 1 to circle 3 is a path, the line connecting circle 4 to circle 0 is a path, etc…
- Adjacency: Two nodes or vertices are adjacent if they are connected to each other through an edge. In the illustration, 1 is adjacent to 0, 3 and 4.
- Path: it represents a sequence of edges between the two vertices. For instance, in the illustration, 1342 represents a path from 1 to 2.
Advantages:
- easy representation of relationships between different types of data;
- time complexity to add nodes and edges is constant, i.e. O(1);
- graphs can make large amount of data be quite simple to work with;
- they can be a very effective visual tool;
- graphs can represent trends or comparisons very easy.
Disadvantages:
- they can be very expensive at the cost of memory usage, meaning that for irregular memory access patterns it will be very slow, usually it will need somewhere between 100 and 200 cycles;
- cannot be extended to accommodate queries about the set of vertices or the set of edges;
- depending upon the compiler, it may not be efficient;
Perhaps, this was a bit long article, but on the other hand, if you managed to read all the way up to this point, it shows you are interested in learning more about the subject and that you are willing to improve your skills to become a more efficient developer, so I will include some links to extra resources for practicing and learning these important skills. Obviously, there is a lot more to learn about data structures than what this article describes, but if you put the time and effort required, it’s well worth the time.
Bottom line is, if you know data structures and are comfortable working and selecting them to evaluate the best solution for your development, you are in the right path to produce cleaner and more scalable solutions, and believe me, this is a rare feature among those who develop software and call themselves software engineers.
Check these links out, they can help you on the practice aspects of data structures and the learning process (some of them may require subscription).
https://leetcode.com/
https://www.hackerrank.com/
https://ocw.mit.edu/index.htm
https://www.topcoder.com/challenges
https://www.w3schools.com/
https://www.algoexpert.io/product
https://www.bitdegree.org/learn/
https://www.codecademy.com/
https://www.edx.org/course/subject/computer-science
https://www.khanacademy.org/computing/computer-programming
https://www.codewars.com/
https://code.org/
https://www.udemy.com/
https://dash.generalassemb.ly/
https://www.freecodecamp.org/
https://www.codeconquest.com/
https://www.skillshare.com
For instance, in the graph the set of vertices V = {2,0,1,4,3} and set of edges E = {20,01,13,34,04,14,13}.
Graphs can be represented by an array of vertices and two-dimensional arrays of edges. Basic operations for Graphs are: add vertex, add edge and display vertex. There are some important terms to learn when working with Graphs, they are:
- Vertex: Each node of the graph is represented by a vertex, in the illustration above, the labelled circle represent vertices;
- Edge: edge represents the path between two vertices, i.e. the line between two vertices. In the illustration shown, the line connecting circle 1 to circle 3 is a path, the line connecting circle 4 to circle 0 is a path, etc…
- Adjacency: Two nodes or vertices are adjacent if they are connected to each other through an edge. In the illustration, 1 is adjacent to 0, 3 and 4.
- Path: it represents a sequence of edges between the two vertices. For instance, in the illustration, 1342 represents a path from 1 to 2.
Advantages:
- easy representation of relationships between different types of data;
- time complexity to add nodes and edges is constant, i.e. O(1);
- graphs can make large amount of data be quite simple to work with;
- they can be a very effective visual tool;
- graphs can represent trends or comparisons very easy.
Disadvantages:
- they can be very expensive at the cost of memory usage, meaning that for irregular memory access patterns it will be very slow, usually it will need somewhere between 100 and 200 cycles;
- cannot be extended to accommodate queries about the set of vertices or the set of edges;
- depending upon the compiler, it may not be efficient;
Perhaps, this was a bit long article, but on the other hand, if you managed to read all the way up to this point, it shows you are interested in learning more about the subject and that you are willing to improve your skills to become a more efficient developer, so I will include some links to extra resources for practicing and learning these important skills. Obviously, there is a lot more to learn about data structures than what this article describes, but if you put the time and effort required, it’s well worth the time.
Bottom line is, if you know data structures and are comfortable working and selecting them to evaluate the best solution for your development, you are in the right path to produce cleaner and more scalable solutions, and believe me, this is a rare feature among those who develop software and call themselves software engineers.
Check these links out, they can help you on the practice aspects of data structures and the learning process (some of them may require subscription).
https://leetcode.com/
https://www.hackerrank.com/
https://ocw.mit.edu/index.htm
https://www.topcoder.com/challenges
https://www.w3schools.com/
https://www.algoexpert.io/product
https://www.bitdegree.org/learn/
https://www.codecademy.com/
https://www.edx.org/course/subject/computer-science
https://www.khanacademy.org/computing/computer-programming
https://www.codewars.com/
https://code.org/
https://www.udemy.com/
https://dash.generalassemb.ly/
https://www.freecodecamp.org/
https://www.codeconquest.com/
https://www.skillshare.com