Understanding the Foundations of Web 3.0 Website Design

June 7, 2022

You can barely go five minutes these days without hearing about Web 3.0. But what actually is it?

When it comes to designing an effective and pleasing website for the modern era, it’s essential to have a
thorough understanding of what’s coming. New tech is revolutionizing the way we interact online.

This article will explain what’s changing and how you can take advantage of it to create well-designed
websites that work well today and in the near future.

So let’s kick things off with one simple question.

What is web 3.0?

Unfortunately, there’s no one universally agreed definition of Web 3.0. There are a number of traits that
we can say for sure are included, though. It’s perhaps easiest if we start by taking a look at what came

We could say that Web 1.0 was characterized by static web pages and a focus on individual web users
consuming content written by a relatively small number of creators. Over time, this developed into what
we can call Web 2.0, which was all about sharing. Typical examples of this are social networks like
Twitter or Facebook, or the use of a cloud
collaboration tool for business

If there’s one word that sums up Web 3.0, on the other hand, it’s probably decentralization. How data is
managed will be key. The idea is that there will be fewer huge industry behemoths like Google keeping
control of the data environment.

In addition, we can expect to see the advent of the semantic web. That is, where web pages are tagged and
structured in such a way as to be directly readable to computers. Machine learning and AI-focused tools
will also become an ever more commonplace part of the online experience and web design trends.

Screen Shot 2022-06-07 at 17.03.07

So what does all this mean for those looking to implement effective website design?

Best design principles for Web 3.0

As ever with design, it’s vital to consider your project from multiple perspectives. First, it’s crucial
to think about the user experience, particularly if you’re working on a site for a specialist area such
as decentralized finance (DeFi). Using customer analytics, it’s easy to see that retaining customers is much
less expensive than attracting new ones. The design of a site makes a vital contribution to doing this

And of course, as well as nailing the visual elements and content of a website, you need to look at the
underlying technology and how best to work with it.

Design for decentralized apps platforms

There are several differences between designing for Web 2.0 and Web 3.0 when it comes to optimizing UX.
This is particularly the case when you’re working with apps built on the blockchain ecosystem. The
precise nature of the differences will vary depending on the type of project you’re designing for, but
we’ll use DeFi as an example to demonstrate the kinds of issues to be aware of.

Minimize jargon

Not all content is equal. In the world of content marketing, vast amounts of time and money are wasted fixing content marketing problems. It’s a salutary reminder of how important it
is to get it right first time.

A page stuffed with technical jargon can scare people away almost as soon as they’ve arrived. So keep the
content simple and to the point.

Show enough information but not too much

When it comes to something like DeFi, you’ll be designing for a split crowd: both experts and newcomers.
That means you’ll have to strike a careful balance. On the one hand, it’s crucial to design a site so
that it makes all the necessary information easily accessible. On the other hand, it’s easy to overwhelm
newbies with too much information.

One approach that can be useful here is to design in different layers of complexity. You can aim for a visual balance that directs users to the content they will be more comfortable
using. Allow users to toggle between different settings to access the level of information they feel is
right for them.


Image source

Design to encourage user education

While it’s ideal that users feel comfortable with the level of information they’re exposed to, it’s also
important to subtly encourage them to expand their knowledge. In the DeFi space, inexperienced users
will be meeting some concepts for the first time.

An effective way of doing this is to design the site so users are gradually exposed to more detailed
information. Lead them through concepts such as blockchain step by step, framing them in such a way that
they resonate with already familiar experiences. Furthermore, if you have a product, sticking to your
branding or complementing your product and packaging design will show uniformity and will make your business memorable.
Highlight benefits such as security and freedom from censorship.

Be transparent about security and transactions

At the same time, it’s crucial to acknowledge that hacks do sometimes happen. It’s all too easy for a
user to simply assume that the new wondertech will solve all traditional problems, but it’s important
not to get complacent.

Just as a good sales manager might let quoting software
their team up from the more mundane tasks of their role but wouldn’t want them to become
totally dependent on it, so it is with decentralized apps. Remind the readers to use their own common

Make it easy for them to understand the transaction process. Make sure to show all relevant information
clearly: the breakdown of the transaction, whether it’s pending or finalized, the value in fiat
currency, and the gas fees and time to complete.

Emphasize irreversibility

This is a big one. For anyone negotiating blockchain tech for the first time, it can be a real stumbling
block. You can’t mention too often that transactions on the blockchain cannot be reversed. Most people
have been conditioned to expect that they will always be able to reset a password or reverse a bank
payment. It takes a serious mental shift to unlearn that.

Build in design features that reduce the chance of disaster. Multiple-step confirmation processes for
transactions are a great idea, for example.


Image source

Design visuals

From enterprise VoIP
to big name retailers, every business relies on powerful branding. And although
branding is about so much more than logos, the design visuals are perhaps the most tangible aspect. Web
3.0 has a different visual feel than the previous iterations of the web.

Focus on using more illustrations and visual content than before. You can use design elements in the background like shapes and curves to encourage a
particular navigation path. Rather than depending on words to do most of the talking, explain as much as
you can via images.

More generally, you should consider integrating AR and VR elements into your design. This makes for a
more satisfying user experience and is one example of how developments in the underlying technology can
be instrumental in forming design choices.

Designing for Web 3.0 – working with the tech

In a world where MLOps open source
software and AI tools are becoming ever more common, it’s no surprise that these developments have
spilled over into design.

Machine learning is quietly revolutionizing all kinds of businesses in a thousand different ways. It’s
being used by the big online retailers throughout the sales cycle, from storage logistics to aftersales
customer service. It’s also being used to implement innovative research tools such as voice analytics
(What is voice analytics ?
It’s analysis not only of the content but the tone of voice on customer sales calls to garner improved

ADI (artificial design intelligence) technology, that uses machine learning to make websites by itself,
is on the table. However, it’s at a very early stage and nowhere near the point of replacing human
designers. Instead, it can help streamline the design process.


Image source

The advent of the semantic web should also be uppermost in your mind when designing a site. That’s
because it will have a significant impact on how search results are decided. There will be a move away
from focusing on keywords and toward more contextual search answers.

Contextual search considers the context within which the search is made rather than simply answering the
search question. This means, for example, that the results served could vary by time of day, user
behavior, or even whether it’s sunny or raining when the search is made. This obviously poses a new kind
of challenge for SEO.

For starters, make sure your site is optimized for smart voice search features. Also pay attention to
using structured data. This will help make your site more friendly to machine learning algorithms and
should ensure that it is ready for the next stage in the life of the web.

Putting it all together

These are exciting times! We’re heading toward a world where the web is going to be more personalized to
each individual. There’s still little widespread understanding of how much of an impact this will have.

We’ve covered a few of the principles it’s vital to bear in mind when designing for Web 3.0. The rest is
up to your imagination. It’s time to unleash your creativity!

Grace Lau – Director of Growth Content, Dialpad

Grace Lau is the Director of Growth Content at Dialpad, an AI-powered cloud call centers for
small businesses
platform for better and easier team collaboration. She has over 10 years of
experience in content writing and strategy. Currently, she is responsible for leading branded and
editorial content strategies, partnering with SEO and Ops teams to build and nurture content. Grace Lau
also published articles for domains such as UpCityand Soundstripe. Here is her LinkedIn.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]

Regularisation Methods to Perform Regression.

June 3, 2022

In the article where we discussed How to fit lines Using Least Square Method ,we used a simple residual error which is nothing but the difference between the observed value and the predicted value to fit the line. We try to minimise the error to get the optimal line.

When we have more data points this method is feasible but when we have less data points then this method may lead us to overfit. To avoid overfitting we can use the regularisation method or shrinkage method.



In this article we will be discussing Ridge Regression, Lassos Regression and Elastic Net Regression.

Ridge Regression:

Consider the above image where we have just two points. In that image the sum of squared error is zero. But when we compare it with the original data we can see that the line passing through two lines is not the optimal line.

In the least squares method we find the sum of squares of error and then try to determine the value of coefficient that minimises the error. Rigid regression follows the same approach except for the part where it penalises the variable which is not important and shrinks down the overall factor. To understand this, consider the below equation:

[(Observed value) – (Predicted value)] + [lambda * slope*slope]

First part of the equation is similar to the least squares method. Second part of the equation tries to shrink the coefficients of the variables so that the entire equation becomes minimum. This is the part which decides whether the variable is important or not. This is called the shrinkage penalty. Without lambda the second part of the equation will make all the coefficients reach the zero mark, damaging the model. Lambda controls the amount of penalty needed to be applied to variables.

Again Consider the second image. Using the least square method the model overfits resulting in this image.


In this image we can see that the line perfectly passes through two points. Slight change in the value of x causes drastic change in the value of y. This means it has a high slope. This model has high bias and low variance.

Now consider the below image. Here we can see that the line does not overfit and the slope of the line is less than the previous line. Here the model has high variance and low bias.


As the value of lambda increases the slope of the line decreases. The slope of the line does not go to zero even at a high value. Only when the lambda is at infinity does the slope become zero.


So basically as lambda increases, the value on the Y axis becomes less dependent on the X axis. Hence the model has low bias and high variance.

Lasso Regression:

LASSO stands for Least Absolute Shrinkage and Selection. In Ridge regression the useless features are not discarded despite being penalised. But in Lasso regression these variables are dropped. Lasso regression is a combination of Ridge regression and subset selection method [In subset selection method we use various feature selection techniques to filter out the unwanted features.].

Following assumptions are required to be followed by data sets which are similar to simple regression.

  • All the predictor variables must be independent of each other.
  • There must be some kind of conditional dependence between the predictor and the predicted variables.
  • All the independent variables must be standardised.

The equation for Lasso regression is given as:

[(Observed value) – (Predicted value)] + [lambda * slope*slope]

{ modulus(slope) => |slope|, which implies even if the slope is equal to some positive or negative value its modulus will always be positive.}

So looking at the above equation we can observe as the lambda becomes zero, the overall equation becomes equal to the sum of squared error. As the lambda increases, more coefficients are set to zero and the useless features are eliminated, thus increasing the bias.

As lambda increases slope decreases just as we saw in Ridge regression. But in this case we can see a kink at zero. We can see as the lambda increases the kink at zero becomes more sharp. This means that with the increase in lambda the variance increases too.


Lasso regression solves disadvantages of ridge regression and subset methods. It is considered a good regression model but it also has some disadvantages.

Lasso fails when the number of observations is less than the number of variables, because it can only select n variables before it saturates. If there are variables having high correlation between them, Lasso selects only one of them, leaving the other as it is and this can cause issues. Lasso’s performance is highly dominated by Ridge regression if there is a very high correlation between predictors. All of the above disadvantages are solved by ElasticNet Regression.

ElasticNet Regression:

Elastinet automatically chooses the variable, performs continuous shrinkage, and also selects from the group of highly correlated variables. Mathematically it is defined as

[(Observed value) – (Predicted value)] + [lambda1 * slope*slope]

[lambda2* modulus(slope)]

ElasticNet has two parameters instead of just one and these parameters are used for shrinkage and together they are termed as the ElasticNet penalty.

Above equation is further simplified as:

[(Observed value) – (Predicted value)] +lambda { [alpha * slope*slope]

[(1 – alpha) * modulus(slope)]}

Where alpha has value between 0 and 1.

If alpha is equal to one, ElasticNet performs like Ridge regression and if alpha is equal to zero, it performs like Lasso regression.Between zero and one, it has a combined effect of ridge and LASSO regression.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]

Blockchain Cryptography

May 30, 2022


Written words were one of the greatest inventions in human history. The ability to write down the information raised the need to conceal it. This need was the reason why we study cryptography today.

The early evidence of cryptography was found in ‘ The Histories’, a book written by Herodotus—the father of history. In this book, where he narrated the conflict between Greece and Persians, he described how steganography was used by Demaratus (a Greek who was expelled from Greece) to warn Greece about the Persian attack.

Demaratus used a wooden folding table to deliver his message secretly. He used to write messages on the wood and cover it with a layer of wax thus hiding the message. This way of hiding messages is called steganography.

Along with steganography there was also development of cryptography. In cryptography the message is not hidden, instead the meaning of the message is changed using encryption. The message is scrambled using some protocol. Both sender and receiver agree on the protocol. Sender uses the protocol to encrypt the message and the receiver uses it to decrypt the message.

Further, Cryptography is divided into two categories: 1) Transposition 2) Substitution.

In transposition letters are rearranged to hide the message. It was first used by Caesar Cipher.

In substitution letters are replaced by another letter using some protocol on which both sender and receiver agree. The first evidence of this is found in Kama-sutra, a text written in the fourth century A.D. by the Brahmin scholar Vatsyayana.

Further substitution is divided into Code and Cipher. In Code the entire word is replaced by another word whereas in Cipher, the letter is replaced by another letter.


In this article we will discuss how cryptography is used to achieve secured transitions in blockchain.

Blockchain Cryptography:


It is important to secure user information and transition data in order to encourage users to join blockchain. Digital encryption technology is the key element in blockchain technology and this gave rise to blockchain cryptography. The blockchain acts as a representative of a decentralised database by storing all user transaction information in the blockchain. So it makes sense to identify very high needs for blockchain security performance.

Since blockchain is based on a peer-to-peer distributed network model, there is no single node and the nodes do not need to trust each other and this is why blockchain also needs to protect user transaction information for unsecured channels and also maintain transactional integrity.

Before getting deep into blockchain cryptography let’s try to understand different cryptographic algorithms.

Symmetric-Key Cryptography

In symmetric key cryptography the same key is used for encryption and decryption of messages. Some famous algorithms include Advanced Encryption Standard (AES) and the Data Encryption Standard (DES).

Asymmetric-Key Cryptography

In asymmetric-key cryptography different keys are used for encryption and decryption. The key which is used for encryption is called public key and the one used for decryption is called private key.

Some algorithms that use these keys are Rivest Shamir Adleman (RSA), Diffie-Hellman key agreement, Elliptic Curve Cryptography (ECC) and Digital Signature Algorithm (DSA).

Hash Functions

In hashing, data is sent through a hashing function which as an output gives a unique hash. Basically it uses a cipher to generate the hash. Some hashing algorithms are MD5 and SHA-family. In blockchain SHA-256 algorithm is used.

Asymmetric key cryptography or public key cryptography is an important element in blockchain. It is used in Wallets and transitions. When a user opens an account on a wallet, they generate a public and private key. The address of the wallet which is nothing but the combination of numbers and letters is generated using a public key. Private key is used to prove ownership of a wallet.

The transition on blockchain is a message that is broadcasted on blockchain. The message says “ An amount of coin from my wallet is transferred to Y wallet”. Once the message is confirmed the transition is immutably written into the ledger, and balance is updated.

Public-key cryptography is also used in digital signature. Digital signatures are used to verify that the information put on the blockchain is correct.

Apart from public-key cryptography, Cryptographic hashing is another important technology used in blockchain. This technology is responsible for immutability in blockchain. Cryptographic hashing provides following advantages:

There is always going to be a unique hash for the given content. It doesn’t matter how many times you pass the same content from the hashing function, the hash for the given content will be the same. This is called deterministic property.

If the tiniest of the information is changed from the content, that hash function will generate a new hash which will be totally different from the previous hash. This property is called the avalanche effect.

It is impossible to determine the input data or content using hash. This property is called irreversibility.

No two different contents will have the same hash. This property is called collision resistance.

Hash functions also play a major role in linking the blocks to one another and maintaining the data integrity in the blockchain. This means, if suppose there are 99 blocks in the blockchain and the 100th block is added to the blockchain then the hash of the 100th block will contain all the information from previous 99 blocks. Similarly, the hash of the 99th block contains the information from the previous 98th block and so on. By traversing the hashes backwards, every block from 100 to 1 is linked by cryptographic hashing.

This makes the blockchain immutable. If someone in the blockchain tries to change even 1 bit of data in the block, this change will alter the hash of the block and all the blocks after it. Miners and nodes on the blockchain would immediately notice the resulting hashes don’t match their version of the chain and reject the change.


Three technologies namely hashing, public-key cryptography and digital signatures are used to secure the blockchain. Where Public-key encryption serves as the premise for blockchain wallets and transactions, cryptographic hash capabilities offer the trait of immutability and Digital signatures ensure the credibility of the information.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]

A Guide to the Future of Passwords: What Data Security Will Look Like Soon

May 27, 2022

Remember the last time you forgot a password and had to reset it? Annoying, right? Maybe you’re finding it doesn’t happen as much as it used to though. Data security experts have been working on improving the way applications authenticate users for many years. Today, all kinds of businesses from retail banks to ad exchanges are implementing a range of security solutions that go beyond a one-step password process.

In this guide, we’ll examine a few of those. We’ll also take a look beyond, at what the future of passwords holds in store. But why is all this change necessary? Well, here’s the thing: passwords pose a few problems.

The problems with passwords today

We’ve reached a point in the evolution of the web where passwords are becoming more of a problem than a solution. Most obviously, they represent a pain point in the user experience. They can be difficult to remember, particularly when we have to use so many to access different sites.

But that’s not the biggest problem. The most significant issue with passwords is that they are now the weakest link in the digital security chain. According to Verizon’s 2021 DBIR report, over 80% of security breaches are due to weak or compromised passwords.

All this results in the following conundrum. Passwords that are short enough to remember can easily be hacked. And passwords that are long enough to be secure are difficult to remember. This leads to people reusing the same password across multiple sites, which in itself increases users’ vulnerability to cybercriminals.

Evidently, this poses a tricky challenge for modern businesses. You could be using the the best customer support software, but all it takes is one data breach and you’ll lose customer trust. So how can companies make sure they’re keeping their customers’ details safe? Well, if it were an easy problem, it would have been solved already. But even today we’re seeing signs of a budding revolution in how we access data and keep it secure.


Image source

The future of passwords has already begun

What’s becoming obvious is that data security is developing along two parallel paths: business-oriented, and consumer-oriented. That is to say that many businesses already have enhanced solutions in place that are too tech-intensive and cost-prohibitive to be used by private individuals. There’s a crossover, of course, particularly in the B2C space.

Today, most businesses are well aware of the need to focus on data security and web app security testing. Here are a few alternatives to passwords alone that are already widely used:

Single sign-on (SSO)

This authentication method allows users to sign in securely once to multiple independent but related sites. When a user signs into a site, it sends a token to a centralized SSO system requesting authentication for that user. The system then sends a positive authentication token back to that site. When the user then moves to another application, the token is passed to the new site. This means the user only has to sign in once.

You may have logged into sites with Facebook or Google. This is essentially the same principle.

Overall, SSO is a popular choice for several reasons. First, it makes the user experience very smooth. And from the administrator’s perspective, it makes everything more straightforward. Changing password complexity requirements across the whole network is much easier, for instance. And when someone moves on from the company, their access to multiple applications can be removed in one fell swoop.

The initial sign-in can use a number of methods, from a basic password to multi-factor authentication.

Multi-factor authentication (MFA)

Multi-factor authentication has actually been around a while. The basic concept involves using a multiple-step process to prove you are who you claim to be. Have you noticed that it’s becoming less common to be asked for your mother’s maiden name as a security check? That was an early version of MFA, but it was too easy for criminals to hack.


Image source

In a world where business text messaging is commonplace, it was simple to find a way to improve this process. Nowadays, most companies favor using a code sent to your device by SMS message to confirm your identity.

However, nothing stands still for long in the world of data security. It seems that using SMS messaging for MFA authentication may be coming to an end. There’s been an upsurge in criminals hacking the SMS step of this process by porting phone numbers to new SIM cards and getting hold of MFA codes that way. If you’ve noticed more and more companies requiring you to use dedicated authentication apps, this is why.

Password managers

I don’t know about you, but password managers from the likes of Google and Apple have made my life about 1,000 times easier. Long gone are the days when I had to reset some password or other with irritating regularity. They’re a neat way of meeting the challenge of storing multiple long, complex (and therefore secure) passwords without the user having to remember them.

Of course, if you take an interest in the digital security space, you might see a problem with this. There’s no doubt that the convenience of password managers is their greatest selling point. But they can also foster consumer dependence on the big tech ecosystems. After all, there’s no need for the big tech giants to conduct RFM analysis if they know users will keep coming back because they can’t live without a password manager.

What’s more, password managers don’t actually solve all the problems associated with passwords. That’s because they are essentially just big, encrypted vaults full of passwords that can be accessed with…a password. So while they do add a layer of security, they’re not game-changers.


One simple alternative that keeps to the basic principle of using a password but is more secure is passphrases. These are just longer passwords made up of phrases that are much easier to remember but are also difficult to hack.

For example, the passphrase “Batmanatemysandwichlastwednesday” is much less intimidating to memorize than “GL4%!d9Ip;4^5H”, right? But because the combination of words used in the passphrase is unique, it’s very difficult for cybercriminals to guess.

However, many sites still impose 12-character limits on passwords. This is one of the UX design mistakes it’s vital to avoid. Increasing the character limit to encourage passphrase use would be an easy, low-cost way of improving security.


Image source

Biometric authentication

This is where the futuristic element really ramps up. For many years, the data security industry has been investing heavily in biometric research. In fact, the Biometrics Research Group estimates that the global biometrics market will be worth nearly $78bn by 2026.

It’s easy to see why. Being able to deliver secure authentication without any need for a password at all has long been the dream. Now, we’re seeing it become a reality with applications using facial and fingerprint recognition. Nevertheless, the tech isn’t quite perfect yet. If an individual’s appearance changes—say, because of injury or because they’re wearing a face mask—the authentication process will fail, and a password or PIN will have to be used as a fallback.

But in the medium term, the biggest barrier to biometrics taking over completely is more likely to be consumer resistance than tech constraints. Concerns around the death of privacy online are not unusual.

The Dazzle Club is a collection of artists in London, UK, who paint their faces in jarring, asymmetric patterns to outsmart facial recognition tech. They meet once a month just to wander around the city for an hour, protesting public surveillance. For now, this is not typical of consumers’ reaction to biometric tech more generally, which remains broadly positive.

Nevertheless, it’s possible that this kind of mistrust could become a problem. Not all consumers are being won over by the convenience of these systems. Businesses in the biometric space

should be using robust hybrid business communication processes to engage their customers. Allaying any fears about how the tech will be implemented is key.

The further future of passwords

Many experts say that the ultimate aim is to get rid of passwords altogether. What might that look like?

Given the inherent security flaws of user-generated passwords, the next stage in data security will involve different forms of user identification. We’ve already mentioned biometric authentication via facial or fingerprint recognition technology, which will become more reliable with time. But there are other intriguing possibilities.


Image source

Retail giant Amazon is already testing out software that measures your typing speed and the pressure you place on your keypad as a way of identifying users uniquely. This is one example of identification via user behavior, a genuinely innovative approach. Some of these systems will detect physical behavior like typing style or posture. Others will use behavior patterns such as how you search for information.

These systems will only challenge you if they detect any behavior that doesn’t fit with your profile in some way. At that point, you may be asked for a password or some other identifying input. If this kind of tech can be perfected, it would be liberating for users. It would also give the cybercriminals a real headache.

Or rather—it would for a while, at least. The truth is that data security will always be a continuous arms race. As data security technology advances, malign actors will try to develop ever more ingenious ways of getting around it.

Passwords may not be perfect, but they are cheap to implement, easy to reset and everybody’s familiar with them. The way we use them will evolve in the future, but it is likely to be some time before we leave them behind for good. Furthermore, you can also learn more about antiviruses, web security and technology on Cover Junction.


Jenna Bunnell – Senior Manager, Content Marketing, Dialpad

Jenna Bunnell is the Senior Manager for Content Marketing at Dialpad, an AI-incorporated cloud-hosted unified communications system that provides valuable call details for business owners and sales representatives. She is driven and passionate about communicating a brand’s design sensibility and visualizing how content can be presented in creative and comprehensive ways. Check out her LinkedIn profile. Jenna Bunnell has also written content for MacSecurity and Shift4Shop.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]

5 Upcoming Cybersecurity Risks and Concerns of Web 3.0 and The Metaverse

May 21, 2022

Web 3.0 and the metaverse are forging a new virtual digital world. Both promise an immersive virtual experience that will challenge online connectivity as we know it. Whilst this digital future promises countless benefits it raises concerns, particularly in security.

Huge investments are being made to paint this digital future. Facebook co-founder Mark Zuckerberg has rebranded to Meta Platforms. Like the evolution of webinnovation changes, Zuckerberg is banking on the future.


Image Source

Internet technology evolves rapidly. Digital innovations spawn from ideas into reality in an instant. The World Wide Web is currently experiencing one of these innovations. As we speak, the WWW is in a transitional phase, shifting from Web 2.0 to Web 3.0.

Web 3.0 adopts artificial intelligence and the metaverse. Both innovations encourage a completely immersive experience where exploration is cutting-edge and simplified.

Digital devices such as chatbot use cases will be fully realized in a virtual reality environment. Web 3.0 and the metaverse want us to experience the reality of the outside world from the inside.

With digital advancement comes digital risks. In this post, we will explore our digital future with Web 3.0 and the metaverse, and if we can secure it.

What is Web 3.0 and the metaverse

Technology shapes the internet of the future. Web 3.0 and the metaverse will shape how we’ll consume content in the future. Let’s take a deeper look into both innovations.

Web 3.0

The web has become the most valuable information resource in the world. Users can interact and share across a vast number of online applications. Apps of today are built on the web rather than inside a desktop computer. Immersion is an endless landscape on today’s internet.

Web 3.0 ambition is to power the next wave of internet applications and services. Today’s version of the internet is Web 2.0. Innovations such as social media apps, blogs, and content-sharing websites have powered Web 2.0’s development.

Web 3.0 aims to streamline app building platforms for online developers. This is performance popular apps and others will be able to benefit from moving forwards.

Web 3.0 aims to streamline app building platforms for online developers. This is performance popular apps and others will be able to benefit from moving forwards.

This high performance propels user-created content with communication being its key focus.

Screen Shot 2022-05-21 at 03.31.29

Image Source

Web 2.0, companies provide applications and services in a centralized manner. Meaning companies such as Instagram and Microsoft have complete control over their user’s content.

User preferences are then collected and used in marketing such as a podcast marketing strategy or individual online ads.

User preferences are then collected and used in marketing such as a podcast marketing strategy or individual online ads.

Web 3.0 aims to stop this involvement by introducing a fully decentralized and democratized internet. It will produce a semantic web where all data is connected.

This will be utilized through underlying blockchain technology. It will enable users to interact with online services governed by peer-to-peer networks rather than a single entity server.

Meaning there will be no centralized ownership of content and users have full control of their digital identity.

The metaverse

The metaverse is a fully immersive successor to the internet. It’s a combination of virtual reality (VR), augmented reality (AR), mixed reality (MR), gaming, cryptocurrencies, and social media.

It’s a 3D reality where the user’s digital world is immersive. In the metaverse, a company’s remote office phone system will take on the form of VR headsets and meetings between digital avatars. Simply put, the metaverse will transform the internet from 2D to 3D.

Examples of this innovation would be a virtual seat at a sports game or trying on clothes in a digital store.

Transactions in the metaverse will occur through a cryptocurrency blockchain. Blockchain is a technology that permanently records transactions, typically in a decentralized ledger.

It’s the difference between a bank keeping track of your account versus a network of computers.

Screen Shot 2022-05-21 at 03.32.25

Image Source

Cryptocurrency in the metaverse allows complete transparency of transactions. It’s a public ledger of historic transactions. Public blockchains like Bitcoin and Ethereum also promote transparency.

This transparency is in direct contrast to traditional banking books. It will make the difference between revenue operations and sales operations interchangeable. Both will adopt a transparent holistic approach where they can progress together.

This is why non-fungible tokens (NFTs) are being used in the metaverse. They are viewed as an asset as their uniqueness can be proved via the ledger.

5 cybersecurity risks and concerns

As the ecosystem of Web 3.0 and the metaverse is still in development, cybersecurity risks go beyond today’s scope of view. Let’s take a look at 5 risks and concerns facing the future of the internet.


A recent survey revealed that 74% of Americans are more concerned with their online privacy than ever before. With the technological advancement of Web 3.0, the risks are unknown and questions have arisen about future digital security.

Total immersion brings more issues to its users. With its vast digital landscape, how will regulators respond to illegal content being hosted? Who will have jurisdiction to apply corresponding laws? Who will process identifiable information?

In a fully immersive existence, safety in interactions will need to be monitored. The opportunity for someone to masquerade as someone else in a 3D environment will be a concern in many interactions.

One of the main privacy issues with the metaverse is the sheer amount of personal data available. The data collected on individual participants will be far more intimate and in-depth.

Screen Shot 2022-05-21 at 03.33.10

Image Source

Traditional social media or services such as whois domain lookup will be transported in the metaverse. Companies will be able to track biometric data such as facial expressions and vocal inflections in real-time. A user’s entire behavior has the potential to be stolen.

This means user behavior will be monitored and used for personal advertising campaigns. A user’s privacy on Web 3.0 will be non-existent if stolen.


As we have looked at earlier, Web 3.0 decentralizes the internet. It allows users to control their data and identity by stepping away from the “if it’s free, you’re a product” motto. Web 3.0 gives data back to the entities who own it.

Whilst this model benefits the user, the lack of central data access makes it more difficult to police cybercrime. This will be especially problematic when it comes to online harassment, hate speech, and child abuse images. In a decentralized web, who will enforce worldwide hosted content?

Without central control and access to data, policing cybercrime will be impossible. The advantages of a centralized web ensure governments make large corporations enforce laws.

Cybercrime will rely on users taking more responsibility for their data and online interactions. Services such as a Myraah Web 3.0 locker secure a user’s data in a private locker, allowing them to take control of their online security.

Cryptocurrency wallets

Cryptocurrency wallets store digital assets such as non-fungible tokens (NFTs) and cryptocurrencies. Carrying your wallet in the metaverse will be an essential act. Your digital wallet will include your avatars, avatar clothing, and avatar animations.

Your crypto wallet will also be linkable to real-world identities. You will be able to buy music, movies, and apps and it will be associated with your reputation scores. So your actions in the metaverse will affect your real-world reality.

Screen Shot 2022-05-21 at 03.33.38

Image Source

Most platforms in the metaverse will need a crypto wallet. The problem with this is criminals today can impersonate someone on the metaverse and gain access to their wallet with ease.

Many users aren’t tech-savvy enough to add security features to their wallets such as two-factor authentication.

The problem arises in the metaverse as it’s not monitored with rules and regulations. It could lead to the death of privacy in the digital landscape as we know it. It’s up to the users to secure their cryptocurrency and user security applications can be ignored.

Risks of decentralization

Decentralization is key to ensuring the internet remains a public resource that is available to all users. The open-source nature of Web 3.0 means that contributors can collaborate from day one.

With this transparency comes extreme security vulnerabilities of integrated data. Only one account will contain all a user’s personal data and it will be protected by a single password.

Imagine if a cybercriminal gets hold of this password. They will be able to access and control a user’s entire life. People who spend more time in the metaverse molding their online personalities will have a lot more to lose.

Then there’s also the fact that large tech companies are leading this change. Their involvement raises concerns about users’ anonymity and information.

Technical limitations

The tech powering Web 3.0 and the metaverse are still in its infancy. Centralized platforms are powered by hi-tech technologies. Decentralized networks are new and because of this they face latency issues.

Web 2.0 can operate the best small business phone systems and other remote capabilities with ease. Web 3.0 runs on a decentralized network that is facing a gradual transition rather than a rapid one. So it is behind in terms of connectivity.

These latency issues will cause adoption rates of the new internet to decrease. Not everyone will favor its benefits over the speed of Web 2.0. Technical limitations will leave people behind and more open to security risks than ever before.

Final thoughts

Web 3.0 and the metaverse is still in its infancy. We don’t fully understand its possibilities as well as its negatives. A decentralized digital landscape sounds great on paper, but will the general public want to go along for the journey?

The metaverse capabilities will allow users the chance to experience a reality that they may never have the chance to experience. It’s a revolutionary technology that will change our online experience as we know it.

But does that mean users will be lost to augmented reality. Will users prefer their online lives to their real ones? Will they stay online whilst the outside world passes them by?

It’s these questions that only time will tell. It’s a digital future, so let’s hope it’s a good one.

Jenna Bunnell – Senior Manager, Content Marketing, Dialpad

Jenna Bunnell is the Senior Manager for Content Marketing at Dialpad, an AI-incorporated cloud-hosted unified communications system which features a Dialpad conference call that provides valuable call details for business owners and sales representatives. She is driven and passionate about communicating a brand’s design sensibility and visualizing how content can be presented in creative and comprehensive ways. She has also written for sites such as LandingCube and CrankWheel. Check out her LinkedInprofile.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]

Consensus – The Way BlockChain Network Makes A Decision

May 18, 2022

What is consensus?

By Consensus, we mean that a general agreement has been reached. Consider you and your friends decide to go on a vacation and you decide to go to Goa. If there is no disagreement on the proposed place then we can say that a Consensus has been reached.

In blockchain reaching consensus means that at least 51% of the nodes on the network agree on the next global state of the network.

But the question is how do nodes on the blockchain agree on a decision, if some of the nodes are likely to be malicious or fail? This Problem is called the Byzantine Generals’ problem and solutions to this problem are given by Byzantine fault tolerance.

Byzantine Generals’ problem:

img1 (1)

The Byzantine Generals’ Problem is a generalised version of Two Generals’ problem. So the problem goes like this:

Each general is located with their army in different locations around the city they intend to attack. The generals need to decide whether to attack or to retreat. It doesn’t matter if they attack or retreat unless they agree on the same decision and once the decision is made it cannot be changed. Above are the requirements that need to be fulfilled.

The communication challenges are that one general can communicate the message to another general through courier and there are chances that message can get delayed, destroyed or lost. Then there might be a traitor among the generals who may send a fraudulent message to confuse the other generals which might result in failure of the plan.

In such uncertain conditions it becomes difficult for generals to carry out the operation (attack / retreat). In order to avoid complete failure, the majority of generals need to agree and execute some action.

If we apply the same thing in blockchain then each general can be looked at as nodes in the network. In order to agree on the next state of the global network, nodes need to agree on some decision or reach consensus and the only way to achieve this on a network is by having ⅔ or more honest nodes on the network. This means that the system is prone to failures and attacks if the majority of nodes on the network decide to act dishonestly.

Byzantine Fault Tolerance (BFT):

The failures which were derived from Byzantine Generals’ Problem are resisted by the BFT system. The Operation of the BFT system is not stopped even if some nodes fail or act dishonestly.

There are various ways for blockchain to achieve BFT and this brings us to the Consensus Algorithm.

The consensus of blockchain is that all nodes maintain the same distributed ledger. In web 2 architecture, due to the existence of the centre server, the other nodes only need to be aligned with the server and hence the consensus is hardly a problem. However, in a distributed network each node is both a host and a server, and it needs to exchange information with other nodes to reach a consensus. To avoid the damage caused by malicious nodes we need excellent consensus protocol.

There are three types of blockchain: Private, Public and consortium blockchain. Each type of blockchain has different application scenarios and according to need one needs to adopt a different consensus protocol.

Consensus Protocol:

Here we will see some consensus protocol which can effectively address the Byzantine Generals’ Problem.

PoW (Proof of Work):

img2 (1)

PoW has widely adopted consensus protocols since it was first introduced by Bitcoin in 2001. Apart from Bitcoin it is also used by Ethereum, Litecoin, Dogecoin etc. In PoW, nodes on the network use computational power to win the right to add new blocks to the blockchain. Once the block is added to the network that node receives predefined reward or transition fees.

To add a new block to the blockchain, nodes need to solve the cryptographic puzzle. The PoW puzzle is very difficult to solve. Here nodes adjust nonce in order to solve the puzzle which inturn requires much computational power. The node that first solves the puzzle has the right to add the new block to the blockchain.

Malicious attackers are capable of overthrowing one block in a chain, but since the valid blocks in the chain increase, the workload is also accumulated, hence overthrowing a long chain requires a huge amount of computational power.

PoS (Proof of Stake):

img3 (1)

In PoS, the one who will create the new block is decided by looking at the stake held by nodes in the blockchain unlike in PoW where it is decided by looking at the computational power. If someone on the blockchain held 10% of the stake (Cryptocurrency) then there are 10% chances that they might mine the next block in the blockchain. Although nodes still have to solve the cryptographic puzzle. Here nodes do not need to adjust the nonce to solve the puzzle but the solution to the puzzle is the amount of stake. Hence, PoS is an energy-saving consensus protocol. It was first used by PPcoin.

DPoS (Proof of Stake):

img4 (1)

In DPoS stakeholders vote to select the block creators. This way stakeholders give the right of creating blocks to the delegates they support instead of creating blocks themselves, thus reducing computational power consumption to zero. It is based on democratic system. If the delegates are unable to generate blocks in their turn, they will be dismissed and the stakeholders will select new nodes to replace them. DPoS is a low-cost and high-efficiency consensus protocol. Bit Shares and EOS use the DPoS to reach consensus.

Practical Byzantine Fault Tolerance:

img5 (1)

PBFT is a consensus protocol based on replication between known parties that can tolerate a failure of upto ⅓ of the parties. It is an algorithm for solving Byzantine faults resulting from a failure in achieving consensus caused by the Byzantine Generals Problem. In PBFT the primary node forwards the message sent by the client to the other three nodes. If suppose one node crashes, the message goes through all the five phases to reach a consensus among these nodes. Finally, these nodes reply to the client to complete a round of consensus. PBFT ensures network fault tolerance and allows thousands of operations per second with a negligible increase in waiting time. Tendermint, Hyperledger Fabric and many others use this protocol to achieve consensus.


All the above protocols are used according to different application scenarios. Apart from their usefulness it’s also true that the above protocols also have their downside. For example, if someone on the blockchain with PoW consensus accumulates superior computing assets compared to combined computing assets of all honest mining entities then they can disrupt the consensus. This is referred to as 51% attack. In PoS, nothing at stake is a major drawback; where multiple chains can be voted on by block generators because there is nothing to lose.

This act in PoS prevents achieving consensus. DPoS tends towards centralisation. If someone in the network has major stacks they can vote themselves and become valiators. BPFT is hard to implement as it requires a large amount of calculation.

Apart from these consensus protocols there are other consensus protocols like Ripple Protocol, Proof of Importance, Proof of Authority, Delayed Proof of Work, Delegated Byzantine Fault Tolerance, Federated Byzantine Agreement, Proof of Elapsed Time, Proof of Capacity, Proof of Burn etc.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]

Exploring Crypto Psychology

May 13, 2022

As we gradually proceed towards a cashless society, the payment system around us is abruptly transforming into a digital economy; with the majority of money being transacted electronically using various online payment applications and only a small percentage of the global money expressed in physical forms of currency.

Despite the fact that society is on the cusp of becoming an advanced economy, only a minority of people are aware of the differences between cryptocurrencies and fiat currencies.

The term ‘fiat currency’ refers to money issued by the government of a country. It is not a tangible thing or commodity, but rather a legal tender backed by the government that issues it. Cryptocurrency is a digital form of payment that is completely secure, thanks to encryption technology. Traditional fiat currencies are vastly different from cryptocurrency. One can, however, buy and sell cryptocurrency or asset just like any other commodity.


Image Source

The central bank can print any amount of money it wants under the fiat currency system, which is under their direct control. This results in the concentration of fiat currency in the hands of a selected few, which further leads to development of discrimination overtime. Instead of being scrutinised by the entire public, fiat currency has become a private property.


It has evolved from citizens’ rights to freely dictate and possess money into a reward system in which you are rewarded for your time and labour. So many people are being forced to put in more time and effort in order to obtain more fiat currency, which is quickly squandered, trapping them in a never-ending cycle of working for money.

But how did this happen?

Fear and Greed have taken over.


Image Source

Fear is a primal emotion that we may encounter. It was once utilised by our ancestors to keep them alive. Fear is not a terrible emotion, but when it is used for control, manipulation, or

intimidation, it may be harmful to people’s mental health. Where there are secrets, fear grows, giving birth to trust.

Greed is an insatiable desire for more than is required or justified, not for the greater good, but for one’s own selfish gain, at the expense of others and society as a whole.

Fear and Greed have been shown to further lead to cognitive dissonance. This refers to feelings of confusion or anxiousness, caused by a psychological conflict resulting from incongruence between beliefs and attitudes.

These are some of the outcomes of fiat currency psychology.

Let’s now take a look at the psychology of cryptocurrencies, which is a subset of Blockchain Technology.

So, how do people perceive blockchain technology?


Image Source


When it comes to cryptocurrency, transparency is a key characteristic of Blockchain Technology. This attribute is the distinguishing aspect of blockchain technology. Transparency means that all transactions on the site are irreversible, meaning that no one can change or delete any data after it has been authenticated and embedded in the network. Transparency tends to boost productivity while also improving a sense of belongingness. Belongingness refers to the feeling that one belongs in a certain environment and may succeed there.

Transparency reduces cognitive dissonance, and blockchain is structured in such a way that the user is accountable for any activity that takes place in the blockchain technology, which cannot be erased or changed and is visible to many witnesses.


Blockchain technology’s decentralised aspect is also targeted at accountability. Greed despises are being held accountable. The distributed ledger is a feature of blockchain technology that ensures that transactions on the blockchain are duplicated across various server sites, hence increasing accountability.


Image Source


This refers to eradication of third parties in transactions of crypto assets or currency. Before an account can be validated in a centralised economy that uses fiat currency, one must present a government issued ID card, a social security number, drivers licence, and other documents, without which one would be unable to conduct business. This has resulted in a large number of unbanked people around the world. Therefore breeding poverty and a sense of scarcity.

In contrast, blockchain technology does not require all of these identities or banks to possess a crypto asset. All you need is a smartphone and an internet connection, and voila! You can trade, buy, and sell any crypto asset you want.

PayPal Cryptocurrency

Image Source

Knowledge leads to freedom, and freedom leads to power.


According to Peter Diamandis, in 2017, half of the world’s population had internet access. In the years 2022–2025, the entire world will have internet access: that is four billion additional consumers who will browse on the internet and would expect digital on the blockchain on demand and free services.

Technology is well known for producing abundance. Abundance of information, improved access to healthcare, and longevity. The cryptocurrency revolution is designed to create economic and financial abundance. Crypto assets are assets that everyone, regardless of nationality, educational background, gender, or age, can access. Any crypto asset’s value rises over time, redistributing wealth to the world’s population.

“Blockchain is a new way of looking at value and a new way of creating a transaction between parties where you don’t need a third party intermediary and can track things, and really have trust”Eric Pulier


By enhancing transparency, linking parties, and rewarding individuals for their contributions to transactions, blockchain technology has the potential to make society more trustworthy and empowered.

Blockchain technology has the ability to revolutionise customer connections for marketing and technology professionals. Companies will be in the best position to benefit from what we believe will be widespread adoption if they move quickly on this far-reaching technology.

One way to enable and create awareness of blockchain technology is: the society is gradually desensitised to scarcity mentality, allowing humanity to choose how they live and work. Its goal is to maintain money in a constant circulation, making it more accessible. People will develop an abundant mindset, as well as freedom and power.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]

Gradient Descent

May 13, 2022

In a previous article of How to fit lines Using Least Square Method? We discussed how the Least Square Method can be used to fit lines. In this article we are going to discuss another optimization method called Gradient Descent.

Gradient Descent is an effective tool for adjusting a model’s parameters with the aim to minimise cost function. Cost Function is equal to sum of squared residual. In simple words the cost function is nothing but the square of difference between the observed value and the predicted value.

How does it work?

Before we start, let me introduce you to my friend Raj. Raj is blind and he lives in a 2-dimensional world. He was happily living in a valley and one day , someone took him to the mountains and now he has lost that person. Now Raj wants to go all the way down to the valley. Raj only has one stick in his hand to navigate. So how will he descend towards the valley?

img1 (1)

Despite living in a 2-dimensional world, Raj knows gradient descent. He first puts his stick on the right and finds that the ground on the right is slightly higher than the ground where he is standing. Then he puts his stick on the left and finds that the ground on the left is slightly lower than the ground where is standing. Since Raj wants to go down the hill he moves one step towards the left. After moving there he repeats the same process until he reaches the valley.

img2 (1)

Now Let’s assume that Raj lives in a 2-dimensional world where he has the power to manipulate his size. Using his power Raj increases his size and decides to go down the hill thinking that a big step might help him reach the valley quicker.

img3 (1)

But after reaching a particular height he started going up and then again down but he never reached the valley.

At this point he knows that something is going wrong since he hasn’t reached the valley yet. He realises that the problem is because of his large steps which is not getting him anywhere near the valley. He also knows that small steps may take much more time to reach there.

img3 (1)

So he comes up with an idea: Initially he starts with a large step and with every step towards the valley he decreases the step size. So in this way he finally reaches the valley.

img5 (1)

Does this example have anything to do with adjusting the model’s parameter?

In this article we already saw that in fitting lines our ultimate goal is to minimise cost function. In the Least square method we used an analytical approach to find the optimal values of the model’s parameters. In Gradient descent we will use an algebraic approach.


We know that in linear regression we use straight lines to describe the relationship between the two variables. We do this by finding optimal values for intercept and slope which minimises the cost function. Assume that we have an optimal value of slope. Now let’s see how we can use Gradient descent to find the optimal value for intercept. To get started we can randomly choose any value for intercept. Here, let’s say that the initial intercept is equal to zero.

We have drawn a line passing through zero with some slope say s1. Now let’s calculate the sum of squared residuals for intercept I1.


Similarly we can do this for various values of intercept.


After finding sum of square residual for various values of intercept we get a graph which looks like this:


Now we have to find the global minima because there the sum of squared residuals is minimum. Before you get lost let me remind you that the minimum sum of square residuals gives the line which best fits.


Now let’s use Gradient Descent. The story which I told about Raj goes here. This is what Raj was processing in his mind when he was finding the valley. Raj used a stick to find the slope but we will use a derivative to find the slope. Well don’t get scared when I say derivative . Below diagram shows what the derivative actually is.


We take derivatives at every point on the curve and decide where to move i.e. left or right. If the slope is negative we go down and if it’s positive we go up.

The rate at which we move down to the minima is called step, like Raj’s superpower that he used when he was moving down the valley. Here we can see that as we move from one end to another the slope changes drastically. This means that the step is dependent on slope and is given by slope multiplied by learning rate. Learning rate is to be decided by us. It should be chosen carefully as larger learning rate might result in larger steps and small learning rate may take forever to get to minima.


We always consider some limited number of steps to find the minima. If the number of steps gets completed before finding the minima then the algorithm will stop. This puts some upper limit or the algorithm will run forever. But If the algorithm finds minima before the number of steps are completed then the algorithm stops and gives the optimal line.

Once we get the minimum value we take the intercept corresponding to it and put it in the equation to find the line that best gives the relation between two variables.


Now In the above example we only took an intercept to find the minimum sum of squared residual but in the actual world we also need to consider the slope for which we get the minimum sum of squared residual. To find this we have to repeat the same procedure but using intercept and slope as well. We need to find the partial derivative of function with respect to intercept and slope which gives slope of curve at every point.



We Now know that Gradient descent optimises two parameters. But if there are more than two parameters we can still use gradient descent.If we were using least squares to solve for optimal value, we would simply find where the slope of the curve is equal to zero. Whereas gradient descent finds the minimum values by taking steps from initial guess until it reaches the best value.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]


April 30, 2022

What was the probability that Newton was hit by a coconut?

Remember when you were in school and you were being taught Newton’s laws in Physics? It always started with the story of Newton sitting under an apple tree and how when an apple fell on his head the idea of gravity came into his mind. While studying his theories, oh how we wished he sat under a coconut tree instead!


Now what would have been the probability of that happening?.Well considering that the UK’s climate is not suitable for coconuts to grow, there was barely any chance of that happening. So yes it was inevitable and we are stuck with Newton’s laws forever now. Well off course it has played a crucial role in shaping our understanding about the workings of nature.

How did it all start?


It all began when a French gambler Chevalier de Méré came across a problem where he wanted to know which of the two games he stood the higher chance of winning at- In Game 1 he had to throw a fair die four times and if he gets 6 he wins. In game 2 he had to throw two fair dice for maximum twenty-four times and if he gets double sixes he wins.

To find the answer Chevalier played for many times only to find that he had a greater chance of winning if he played Game 1.

We will come to the problem again in a later part of the article and see how we can find the answer without carrying out an experiment several times.

What is probability?

If we want to measure weight we measure it in Kilograms, Height in centimetre or inches or metres. But how do we measure probability? Is it even a real quantity? Well it seems that probability is not a real quantity. But we can indirectly measure it using numbers between 0 to 1.

Philosophers and statisticians have some famous suggestions which tell us what those exclusive numbers mean in probability:

Classical Probability: This is equal to the ratio of the number of outcomes favouring the event and the total number of possible outcomes. Here we make the assumption that outcomes are all equally likely. For example rolling dice or tossing coins.

Enumerative probability: Remember the problem where we had a bag and the bag contained 3 red balls and 4 white balls, and we had to calculate the probability of getting a red ball or a white ball? With the idea of random choice from a physical set of objects, we can safely say that it is an extension of classical probability.

Long-run frequency probability: If you conduct a coin tossing experiment for infinite numbers of time the probability of getting head or tail will be 0.5 . It won’t change. Doesn’t matter how many times the experiment is carried out.
This is based on the proportion of times an event occurs in an infinite sequence of identical experiments. Chevalier’s problem also comes in this category.

Subjective or ‘personal’ probability: This is a specific person’s judgement about a specific occasion, based on their current knowledge, and is roughly interpreted in terms of the betting odds that they would find reasonable. That means any numerical probability is essentially constructed according to what is known in the current situation.

Different experts prefer different alternatives to describe probability.

Probability is the result of Randomness. With any random phenomenon, the probability of a particular outcome is the proportion of times that the outcome would occur in a long run of observations.

Time to Know the Rules of Probability.

The probability of an event is a number between 0 and 1.

Complement rule says that probability of an event not happening is given by one subtracted by probability of event happening.

The addition, or the OR rule says to add probabilities of mutually exclusive events to get the total probability.

The multiplication, or the AND rule says to multiply probabilities to get the overall probability of a sequence of independent events occurring.

Now let’s get back to the problem and find why Chevalier has a greater probability of winning Game 1 than Game 2.

In game one we have to throw the dice 4 times and if we get six at least once we win the game.
So let’s answer a few questions: what is the probability of getting six on the dice?
It will be ⅙ using the definition in classical probability. So indeed the probability is between 0 and 1.

Now let’s ask the 2nd question: what is the probability of not getting a six? Using complement rule we get 1 – ⅙ which is ⅚.

Next question will be what is the probability of not getting a six after we throw the dice four times? It will be ⅚ * ⅚ * ⅚ * ⅚ = 0.48

And the final question: What is the probability of getting a six at least one time after we roll the dice four times? Well it is 1 – 0.48 = 0.52.

If we follow the same procedure to find the probability of winning Game 2 then we will get 0.49 as our answer.

So it is obvious that Chevalier had played Game 1 as there was a greater chance that he would have won Game 1 several more times as compared to Game 2.



Though Probability was invented because the gambler had a problem but it has its application in almost every field. Probability has been extensively used in Academics, computation, Stock Market, Businesses etc.
In machine learning we use Probability theories for Regression, Classification, or minimising errors.
People say the stock market is equivalent to gambling which is partially true if you invest without doing any research.
We use probability to avoid or to prepare for any natural disasters. It is used in weather forecasting or to predict the annual growth of a country’s economy.
In short, probability is used to predict the future.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]

How to fit lines Using Least Square Method?

April 20, 2022

What is Regression?

Regression is a technique under supervised learning that estimates a function that gives a relationship between variables. If we have features X = (x1, x2,…,xN)T (Independent Variables) and target variable YT (Dependent Variables) our task will be to find the function f(X,Y) which gives a relationship between the features and target.

Here we try to understand how Y changes when X changes so that we can use this understanding further to predict Y for given X.

Let’s try to understand this with an example. You are conducting a case study on a set of land plots in pune to see how the price of plot changes with respect to plot size.

First we will collect the details of each plot as shown in the table below.


Now to understand the relationship between these two variables we draw the scatter plot.


Here we can see that as the size of the plot increases, the price also increases. Above scatter plot shows the linear relationship between plot size and plot price. So it means there are greater chances that larger plots usually will be priced higher.

Linear Regression.

Linear regression is a topic which is studied in great detail. So in this section we will be exploring each topic in great detail. It is a statistical method that is used for predictive analysis. Linear regression makes predictions for continuous/real variables such as sales, salary, stock market prediction, housing price etc.

In linear regression we use a linear approach to model the relationship between two variables by finding a function that is a close fit to the data. Mathematically we need to find the function y = f(x) that best describes the relationship between variables x and y.

The modelling assumptions are that x is an independent variable or predictor variable and y is a dependent variable or response variable. There is a linear relation between x and y.

When there is a single independent variable, the method is referred to as simple linear regression and when there are multiple independent variables then the method is called multiple linear regression.

The word ‘linear’ in linear regression does not refer to fitting a line but rather it refers to the linear algebraic equations for the unknown parameters.

Under linear regression we are going to study algorithms like Least Square, Gradient Descent and Regularisation.

Fitting a line using least squares:

Let’s try to understand this with an example.

Consider below hypothetical data.


What do you think? Which line better describes the relationship between two variables?

In an attempt to find the best fit line which accurately describes the relationship between two variables, let’s start with a horizontal line whose equation is y = c


Consider point (x1, y1), distance between c and y1 is given by (c-y1). Similarly distance between c and y2 is given by (c – y2) . So far the total distance is given by (c-y1) + (c – y2).

We can keep doing it and total distance after calculating distance between c and y3 will by (c-y1) + (c – y2) + (c – y3)


The distance between c and yn is (c – yn) which is negative. That’s not good as it will subtract from the total and make the overall fit appear better than it really is. Again if we look at yn+1 it will further reduce the total.

In order to tackle this, the mathematician came up with a solution where they squared each distance term and added.

After doing this our new equation looked like this-

(c-y1)2 + (c – y2)2 + (c – y3)2 + (c-y4)2 + (c – y5)2 + (c – y6)2 +(c-y7)2 + (c – y8)2 + (c – y9)2 + (c-y10)2 + …..

This is our measure of how well this line fits the data. It’s called “sum of squared residual”

Because the residuals are the distance between the real data and the line we are summing the square of these values.

If we rotate the line in anticlockwise manner the sum of squared residual will decrease until it reaches its optimal position and after that if we still rotate the line the sum of squared residual will start to increase.

So the ultimate aim will be to find the optimal position where this sum of squared residual is minimum.

To do this let’s start with generic line equation which is given by y = ao + a1x

We want to find the optimal values of ao and a1 so that we minimise the sum of squared error.

Mathematically this is given by S(ao, ai) = Σ ei2 = (yi −a1xi − ao)2.

Where (a1xi + ao) gives value of line at position xi and yi is observed value at xi

The above equation is calculating the distance between the line and the different values at xi.

Now the reason why this method is called least square is because we want to find the best value of a0 and a1 such that we get a line which will give us the smallest value of sum of square residual.

The plot of sum of squared residual versus each rotation is given below.


From the graph we can see that as we increase the rotation the sum of square residual decreases and after some rotation it starts to increase with each rotation.

To find the optimal rotation of the line we take the partial derivative of the function. The derivative gives the slope of the function at each point.

The slope at the best point is zero where we get the least square. It will be wise to note the different values of rotation are different values of slope ao and different values of intercept a1.

This can be better undertaken if we consider one more axis for intercept as shown in below diagram.


If we choose the intercept at point d then at that intercept we can plot a curve for a different slope and see how the value of sum of squared residual changes. We can do this for various values of intercept until we find the best value of slope and intercept for which sum of squared residual value is minimum.

After we have got the values of ao and a1 we can plot the line which gives us the best relationship between two variables.



The important things we should remember is.

1) We want to minimise the square of distance between observed values and the line.

2) We do this by taking the derivative and finding where it is equal to zero.

3) The final line minimises the sum of squares.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]

Decoding the NFT Obsession – Why People are Buying it Like Crazy ?

April 9, 2022

If you were given $69 million, what would you do with it? Probably buy Kim Kardashian’s house at California Hidden Hills, or adopt thousands of villages, or maybe you could consider buying digital art sold by Beeple?

Yes, you got me right, ‘Everydays – The First 5000 Days’ digital artwork was sold for $69 million through NFT!

NFT stands for Non-Fungible Token that has lately been a buzzword in mainstream media and almost every other person is talking about it. However, as much as NFTs are becoming more and more popular, there still seems to be a lack of understanding of what they actually are amongst the vast majority.

In this blog I will explain the term NFTs, its intersectionality with blockchain, and why do people buy it.

What are NFTs?

To understand NFTs, we must understand the term ‘fungible’ – a popular term among economists. Fungibility describes goods that can be substituted or exchanged, and it will still hold the same value.

Imagine we both have called for a pair of Nike shoes. They’re exactly identical with the same colour, size and model number. If we ask a third friend to differentiate, there’s no way he will get any points. They are identical and same.

Likewise, most currency is fungible. It is an interchangeable asset like gold or casino chips. Exchanging them in the same quantity and quality would not make a difference.

Non fungible in this scenario refers to goods that cannot be substituted. There’s a unique attribute or characteristic associated with the good that makes it unique from other goods.

Imagine you and I both have booked movie tickets. I have tickets to Avengers: Endgame and you have tickets to a bollywood movie Himmatwala. If you asked me to exchange the tickets, I wouldn’t even trade them for 100 bucks more. The point here is that these cannot be really substituted, they are largely non-fungible.

Check out this phenomenal image adapted from Rhett Dashwood portraying the difference between fungible and non-fungible assets.


Hence, an NFT is a token or a digital asset that can be traded but it is non-fungible. This token refers to a digital certificate that is stored on a distributed and secured database called blockchain.

Intersection between NFT and Blockchain

NFT has been invented with the help of Blockchain technology. You can imagine and consider NFT as a chunk of data that entails images, songs, gifs, memes etc, that is authorised, identified and approved as a different one. This token contains specific information that makes it unique from other NFTs and proves ownership of the digital asset, that is, the image, text, meme, gif, sound etc.

Bitcoin network concept on digital Screen

When a token or asset is considered as non-fungible, that means:

  • It cannot be replicated since each NFT differs from others. It is unique in nature.
  • Since NFTs are digital assets, they can be copied, downloaded and shared. However, the original NFT and the proof of its ownership is embedded on the blockchain. Nowhere else can a totally identical version of the NFT be found.
  • NFTs are verifiable, that means past data is stored on the blockchain and it authenticates the original creator and owner.

Let’s understand this by considering this famous painting by Vincent Van Gogh ‘The Starry Night’


You might have seen this painting on Google or someone’s house or it may have randomly appeared somewhere.

But the funny thing is, the original painting is sitting somewhere in New York right now, and it’s almost worth $800 million.

Imagine getting a printout of this image and trying to sell it for a million dollars. Obviously you cannot do it because you don’t have its certificate of ownership since that is the only way to certify that it is an original painting by Van Gogh.

But let’s focus on something that I am sure has got all of us wondering: why do people buy an image worth $69 million, when they could simply download it from Google? Let’s decode the psychology behind this.

i) It’s not real money like Rupees and Dollars


Image Source

CMeet CrytoPunk #7610 who has been minted on the blockchain and has unique visual characteristics among 10,000 other punks. Visa Inc. bought this digital avatar for $150,000. This is probably the price of a luxurious flat in a city. However, to be on the same page, dollars were not a part of the transaction, but ethereum was. Same thing? Not really

When trading NFTs, the Bitcoin or Ethereum coin acts as a Casino chip. Psychologically, it is easier to spend a casino chip than real money. This is why casino chips exist.

Casino chips and Bitcoin have many similarities. For instance, in a casino, the first thing we would do is exchange dollars for casino chips. The same procedure applies to NFT trading. One has to figure out a digital wallet on their browser and wire some ETH to start trading it. Paper currency won’t work.

When we gamble around with casino chips or ETH, we create an abstraction layer between the physical (paper currency) and digital (NFT) asset and the value it represents. We are less afraid to lose casino chips than real money, which also explains the phrase “all in” we hear commonly in casinos.

ii) The Scarcity Effect

Products that are limited are often valuable and having these in your possession shows others that you are unique and interesting. People are likely to engage in behaviour that makes them part of an exclusive group.

Researchers have indicated that people value something more only because it is scarce. Digital currencies and NFTs in particular are driven on the principle of scarcity. It is the first time in history that a series of digital assets can be created and owned. Each of these NFTs has its own traits, so potential buyers can easily determine the rarity of it. The rarer an NFT is, the better it is, and more people will be drawn towards it.


Image Source


iii) Ownership and Possession

This drive of owning or possessing something makes the person innately feel to make what he owns better and own even more. Holding a status and being respected are always important for most people. Hence, spending a heavy amount on unique things as status symbols is a common behaviour.

For some NFT trading is just like owning the latest model of iPhone. What people gain is a sense of identity. It’s equal to social currency, a way of belonging.


Obviously, status is not the only notion behind this behaviour of trading and collecting NFTs. Completing the set and owning the whole set of unique artforms is also a strong drive for collectors.

iv) Top-down perception

Buying NFTs is definitely an investment and may also give the buyer some bragging points and cultural cache. But that’s not all. Psychologists have also found that it has something to do with our perception.

Two identical images may be differentiated pixel-by-pixel, but looking at one may feel very different from another. The reason behind this is that human perception is not fully determined by visual input (which is almost identical in this case). This also depends on our beliefs and knowledge about the image. This way our perceptual experience is the result of a two-way interaction between the visual input from the image and our knowledge about the image.

v) The Endowment Effect

According to this notion, people are more likely to retain a commodity they own than acquire the same commodity if they do not own it. Imagine you have a bottle of wine and one day you find out its value to be around $400.

Would you sell the bottle? Probably not unless you need the money urgently. However, would you buy a bottle of wine for the same amount? Probably not.

We tend to follow a similar pattern in NFT trading. When a person buys an NFT, they can set the price as high as they want. NFTs are usually highly priced because of the emotional attachment and endowment effect. When owners with limited collections apply the endowment effect and sell the NFTs at high prices, the floor level of collection in turn rises.


Image Source


It is definitely hard and fascinating to imagine the thousand dollars someone would pay for a simple GIF or 3D art. However, more than longer term investments, NFTs seems to be an obscure concept around psychological gravitation.

In a digital era, digital notions and objects will have more significance than physical ones. Under the blockchain’s custody, the world can view the digital assets and not copy it because of the unique hash that is generated. Artists are slowly gravitating towards the NFT marketplace, creating, experimenting, and buying into the movement. Some of them are doing it out of pure catharsis while others out of curiosity. But the outcome is powerful since now we have a generation of fragmented creator space.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]

How to Train ML Models With Limited Data

March 23, 2022

In the past few articles, it was stressed many times that the human brain is very good at recognizing things once they see it. For example if I show a picture of an apple to a 3 years old child and then ask him to classify the picture of the apple he could do that job with ease. He will not classify a tomato as an apple.

But when It comes to machine learning models we have to train models with thousands of images of apples that are clicked at various angles and still there is one or two percent chance that our machine learning model will classify a tomato as an apple.


We always use a large data set to train our machine learning model to achieve the desired accuracy. But there might be a situation when we would be short of data points. Like consider a hypothetical situation where we are exploring the ocean for some rare species of sharks. We want to build a classifier that we would install in the submarine which would identify different species in the ocean. Since it is a rare species it is obvious that we won’t be having many images of the shark. So we want our machine to learn everything possible from every image. We cannot afford to put a few images aside.

Now what will happen if we use all the images for training is that we won’t be having a testing set to check the accuracy of the machine. We cannot use the training set as a testing set because it will be like cheating and we cannot know the accuracy of our model. This difficulty can be resolved by using a technique called cross-validation or rotational validation.


The core idea here is that we split that data set into a temporary training and testing set and train our model on a temporary training set and evaluate the model on a temporary testing set. After noting the score, we again split the data in different temporary training and testing sets. We again train the model and note the score after evaluating using a testing set. After repeating the above step for sufficient numbers of time we take the average of all the scores which denotes the performance of our model.


As an example let’s assume that we have 100 images of that rare shark species. So we now split that dataset into let’s say 4 parts. In the first go we use the last part of the data set as a testing set and use the 1st three parts of the dataset for training. We train the model and check the accuracy using a testing set. We note the accuracy and then select the 3rd part of the dataset as a testing set and use the remaining part for the training set. Using this we again follow the same procedure until one by one we have used all the parts as a testing set. The method we just discussed is called the K-fold cross-validation method.



There are two types of cross validations.

  • Exhaustive cross-validation
  • Non-exhaustive cross-validation.

Exhaustive cross-validation:

In this method the dataset is split as training and validation set in various ways and the model is trained and evaluated using each possible combination of the training and testing set.

Further Exhaustive cross-validation consists of Leave-p-out cross-validation and Leave-one-out cross-validation.

In Leave-p-out cross-validation, validation sets consist of p observation and remaining observations are used for training sets from the original dataset. Whereas in Leave-one-out cross-validation validation set consists of only one data point i.e. p=1.

Both the techniques require more computational time but leave-one-out cross-validation requires less computational time than Leave-p-out cross-validation as Leave-p-out cross-validation require training and validating the model Cnp times whereas Leave-one-out cross-validation requires training and validating the model n times. Where n is the total number of observations.

Non-exhaustive cross-validation

Non-exhaustive cross validation methods do not compute all the ways of splitting the original sample. This technique is an approximation of Leave-p-out cross-validation.

Non-exhaustive cross-validation consists of K-fold cross-validation, holdout method and repeated random sub-sampling validation.

We have already gone through K-fold cross-validation. In the holdout method we randomly assign data points to training and testing sets. The size of both sets is arbitrary but usually the training set is bigger than the testing set. In typical cross-validation, results of multiple runs of model-testing are averaged together; in contrast, the holdout method, in isolation, involves a single run.

In repeated random sub-sampling validation also known as Monte Carlo cross-validation, multiple random splits of training and testing sets are created. Then the model is trained and evaluated using a training and testing set. The results are averaged over the split. The proportion of the training or validation split is independent of the number of partitions.


When our dataset is small we should use a cross-validation technique to predict the accuracy of our model. Few techniques we discussed here are Leave-p-out cross-validation, Leave-one-out cross-validation, K-fold cross-validation, holdout method and Repeated random sub-sampling validation. Apart from these techniques some other techniques are: Stratified k-fold cross-validation, Time Series cross-validation and Nested cross-validation.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]

Optimising Your Training and Testing Set For ML Algorithms

In the last few articles on various occasions we trained our model using a training set and checked whether our model had learned anything using the testing set. But we didn’t discuss how our training and testing set must look like, about their importance, process of training and many more things related to it . In this article we will discuss the Training Data and Testing Data.

So let’s go through them one by one.

Training Data:

Assume that you are learning mathematics in your school. School is a place where you learn, train yourself, update or enhance your skill. In this case you have been training to solve certain mathematical problems. You solve various solved examples in the textbook, go through different methods to solve, train yourself to recognise the problem and accordingly use a suitable method to solve the problem. The solved examples are nothing but the training data.The problem is the input and the right answer is the output. Here you don’t guess the answer but try to find the right answer. Every input problem has a valid right answer and you are supervised to find the right answer.

The collection of all the samples we’re going to learn from, along with their labels (answers), is called a training set. Training set generally consists of many diverse examples. Using a training set our model learns to guess the output. If the output guessed is right we move on to the next data point else we feed the right output and continue the process. Below is the flow chart which shows how models are trained-


From the training set we feed the training set sample to the model and it tries to predict the right output or label ( I will use output and label interchangeably as they both imply the same thing). If the output is correct then we go on to the next training set sample and if the output guessed is wrong then we update the output with the correct one and move on to the next training set sample. As the training of the model progresses there are some internal variables which help the model to predict the right label. This gets updated with every right and wrong prediction.

Each time we run through a complete training set, we say that we have trained for one epoch. We usually run through many such epochs so the system sees every sample many times.

After training the model it’s time to check if our model is predicting labels accurately. To do this we need a testing dataset.

Testing Data:

Before we begin, here is a story. Once upon a time, the US Army wanted to use neural networks to automatically detect camouflaged enemy tanks. So researchers trained a neural network using standard supervised learning techniques. They used 200 photos, 100 unique training dataset and 100 testing dataset. In both training and testing datasets 50% of the photos contained camouflaged tanks in trees and another 50% of photos contained trees with no tank. The researchers ran the neural network on the remaining 100 photos and without further training the neural network classified all remaining photos correctly. Success confirmed!

The researcher’s handed the finished work to the Pentagon, who soon returned it is complaining that in their own tests the neural network did no better than chance at discriminating photos. It turned out that in the researcher’s data set, photos of camouflaged tanks had been taken on cloudy days, while photos of plain forest had been taken on sunny days. The neural network had learned to distinguish cloudy days from sunny days, instead of distinguishing camouflaged tanks from empty forest.



Hence, we have to be very careful while training machine learning models. Like in the previous example if our training dataset is not diverse then there are chances of overfitting. To avoid this, we need some measure other than performance on the training set to predict how well our system is going to do if we deploy it.

It would be great if there was some algorithm or formula which would have told us how good our model is. But there isn’t. So we have to do this in a traditional way like our scientists do. We need experiments to see what actually happens in the real world. We also must run experiments to see how well our systems perform.

To achieve this we need to give new, unseen data to our model and see how well it does on new, unseen data. This unseen data is nothing but our Testing set.

We never learn from test data. By now you know that more the data points in the training set, the more the accuracy increases. Now you might think that, okay let’s train our model on the entire dataset and then split the dataset into a training and testing set, and then evaluate our model on the testing set. This indeed will give you 100% accuracy but this is what we call cheating. If you deploy your model now It will not give you good results.

It is the same as mugging up all the solutions to a mathematics problem and getting good marks in final exams, and then failing in all the entrance exams.

If we take the example of our school, testing is similar to final exams in our schools and marks tell us how much we have understood. Of course if we already know the question and its solution beforehand, we will perform well. But again this will be cheating as before where you trained your model for the entire dataset.

For this reason we split the dataset into a training set and a testing set before training our model. So now we train our model on training data and check performance on unseen dataset which replicates the real world i.e testing set. We use the testing dataset only once after the training is over. We must always ensure our model never sees a testing set during training.

The problem of accidentally learning from the test data has its own name: data leakage, also called data contamination, or contaminated data. Always make sure the test data is kept separate and that it is only used once, when training has been completed.


We often split our original data collection into two pieces: training set and testing set. Training set consists of 75 – 80 % of the original dataset and the testing set consists of 25-20% of the original dataset. During splitting, samples are chosen randomly for each set. Most machine-learning libraries offer routines to perform this splitting process for us.


Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]

How To Do Features Selection For Your ML Models

March 10, 2022

Let’s assume that you have never seen a train in your entire life and now we are in the perfect position to play a game. Well the game goes by the name of ‘Find the train’. So I will show you one photo of the train and then I’ll show you different photos and you have to guess which image has the train. So let’s begin, here is the photo of the train.


So you can see in the image a long vehicle appearing like a snake is a train. Now below are two images. One has a train and the other doesn’t.



You must have guessed it right in a snap of time. But how did you do it?Did you notice when I told you to focus on snake-like structures you immediately filtered other things from the image like trees, houses etc. You focused on just one thing that is a snake-like structure and registered that our train has a similar structure. You reduced the feature of the image which was indeed making the image beautiful but this was not our concern. Our main aim was to spot the train. So you told your brain to register just that thing and discard other useless features.

Alas Machines are not that smart. They need to be trained with millions of data points to perform this same task with 95% accuracy. Consider the dummy dataset below.


Column K is output and remaining columns are input. In a previous article we saw that too many features are nothing but curses. Also, we can see from the table that many features are similar. There may be many features which may not be useful so we must drop those features.

Not removing useless features is like feeding noise or garbage to our machine learning model. It reduces training time, avoids overfitting and improves accuracy if the right subset is chosen. The process of removing redundant, irrelevant, or otherwise useless features from a dataset that saves time is called Feature selection.

In this article we will be discussing a few models of feature selection. Supervised and unsupervised models are two models of feature selection. Supervised models are further divided into Intrinsic method, filter method and wrapper method.

In supervised models, the target variable is considered to remove redundant variables. Input is specifically selected to increase the accuracy of the model or to reduce the complexity of the model. Here outcome is used to quantify the importance of input variables.

Whereas in an unsupervised model only input variables are considered. Further under the Supervised Model we will go through the filter method and wrapper method. Let’s go through them one by one.

Filter Method:

In the dataset there are various redundant features which must be removed. So in the filter method we use a single column as input and see if there is any relation between feature and target, and correspondingly we calculate the score using statistics measures which is to be assigned to each feature. We repeat the process for each column and assign the score to each feature. After calculating the score for each feature we rank each feature according to their score and set the threshold value. We then remove features having score less than the threshold value.


These methods are often univariate and consider features independently or with respect to some dependent variable.

Statistical measures used in filter methods are Pearson’s Correlation, Spearman’s correlation, ANOVA, kendall’s, Chi-Squared and Mutual Information. The below tree diagram shows the use of each measure according to input variable and output variable.


This is how a tree diagram should be read- If the input variable is numerical and the output variable is also numerical then the statistical measure used is Pearson’s Correlation.

Pearson’s correlation is a measure of the strength of association between the two variables. It is used as a measure for quantifying linear dependence between two continuous variables X and Y. Its value varies from 1 to -1.This is a formula to calculate Pearson’s correlation.


Here X, Y = Variables,▁X,▁Y = Their respective means

Consider the Boston housing dataset below.


Using python you can calculate the correlation between the variables. So below is the correlation between variables calculated for the above dataset. It is displayed using a heatmap. Note CAT. MEDV is the target variable and others are input variables.


So from the above map we can see that MEDV has a strong correlation with the Target variable i.e 0.79 which is followed by RM. If we set the threshold as 0.45 then except MEDV, RM, and LSTAT all the other variables are to be dropped.

Now if we look carefully then RM and MEDV share a strong correlation. That means we can drop RM or MEDV from the feature as they affect the target variable in the same way. That means mathematically they are the same variable.

So which variable is to be dropped?

Look again at the heatmap and find which variable has a stronger correlation with the target variable. In this case MEDV has a stronger correlation with the target variable CAT. So we will drop RM and train or model using only two variables i.e LSTAT and MEDV.

This is how we use pearson’s correlation in feature selection.

Wrapper Method:

Let’s consider a dataset that contains many features. Now in the wrapper method we feed different combinations of these features to the machine learning algorithm and note the accuracy and error for those features. The features which predict correct output with maximum accuracy and minimum error are kept and the rest are discarded.


Under the wrapper method we will discuss forward feature selection and backward feature selection.

Consider a feature subset F. Initially F is empty. We have some dataset that contains let’s say ‘n’ features. So initially we start with an empty feature subset F. Now we feed the 1st feature from feature space to the machine learning algorithm and note the error. If the error is below threshold then we add the feature to feature subset F else we drop the feature.

Similarly we do this for n features from feature space. So the feature subset F has only those whose error was below the threshold value. After that we select a feature which has less error and along with this we use a combination of different features from subset F. Now we feed them to a machine learning algorithm and look for the combination which performs well. Again we note the error and remove those combinations which give more error. In this way we remove unimportant features from the dataset. This is how forward feature selection works.

In backward feature selection we start with features subset F containing all the features from the dataset. We feed these features to the machine learning model and keep the features which give the best evaluation measure and remove the rest of the features.


If we look at the differences between filter method and wrapper method, the former involves machine learning algorithms to remove unwanted features whereas the latter uses machine learning algorithms.

The Filter method is faster than the wrapper method. Filter method is less prone to overfitting as compared to the wrapper method.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]

Charts and Graphs

February 26, 2022

Look at the table below and then look at the graph. I don’t need to tell you which mode delivers the information more efficiently.


Screenshot from 2022-02-22 12-21-21

The main purpose of a data display is to organise and display data to make your point clearly, effectively, and correctly, and graphs often give you more of a feel for a variable and its distribution than looking at the raw data or a frequency table.

Scottish engineer and political economist William Playfair invented four types of diagrams: line graph, bar chart, pie chart, and circle graph. He is considered the founder of graphical methods of statistics.

In this section, we’ll learn about charts and graphs. We will study graphs for categorical variables, time charts and then graphs for quantitative variables.

Graphs for categorical variables

The two primary graphical displays for summarising a categorical variable arethe pie chart and the bar graph.

Pie Chart:


A pie chart is a circle having a slice of the pie for each category. It takes categorical data. The sum of all the slices of the pie should be 100% or close to it. For example, the above image shows the percentage of votes polled out of total votes counted till 18 may 2009. Here we can easily interpret that other state parties got 36.42 % of votes and INC got 28.55% votes followed by BJP with 18.80% votes and so on.

If we look at percentages it will add up to 100% or close to it. The slices of the pie called ‘other’ shows a lack of detail in the information gathered. It would be a good practice to ask the size of the data as the pie chart only shows the percentage in each group, not the number in each group.

The Bar Graph


Bar Graph is also used to summarise categorical data. A bar graph displays a vertical bar for each category. Bar graph also breaks categorical data down by group, showing how many individuals lie in each group (A bar graph with categories ordered by their frequency is called a Pareto chart, named after Italian economist Vilfredo Pareto (1848–1923), who advocated its use.), or what percentage lies in each group.

For example you can see in the above image the sales data of bikes sold in each month. 1000 bikes were sold in May followed by 900 bikes sold in the months of June and October. Instead of numbers we can also show the percentage of bikes sold in each month.

While evaluating the bar graph it is necessary to check the unit of Y-axis, make sure that they are evenly spaced. It is also wise to ask for the total number of data used to summarise the bar graph while using percentage instead of number to show counts.

Time Charts


Look at the above time chart. It shows the revenue of the company over the period of 5 years. At each period of time the amount is shown as dots and dots are connected by line. In the time chart on the X-axis there is time (months, years, hours, days etc) and on the Y-axis there is quantity to be measured over a period of time.

Sometimes time charts may be misleading. For example if we count the number of crimes being committed in some city each year. It will appear to be increasing. But instead of counting the number of crimes, if we look for the crime rate which is adjusted to increasing population then we will find it to be decreasing. So it is important to understand what statistics are being presented and examine them for fairness and appropriateness.

Graphs For Quantitative Variables

In this part we will see how to summarise quantitative variables graphically and visualise their distribution. We will go through Histogram and Box Plot.


Histogram is a more versatile way to graph the data and picture the distribution. It uses bars to show the frequencies or the relative frequencies of the possible outcomes for a quantitative variable. It is basically a bar graph which applies to numerical data.


Consider the above Histogram, it shows the distribution of weight of students. To be sure each number falls into exactly one group, the bars on a histogram touch each other but don’t overlap. On X-axis each bar is marked by a value representing its beginning and endpoints. The height of the bar represents either the frequency of each group or the relative frequency of each group.

In the above histogram it can be seen that the most common outcome lies between 120 to 130 pounds. This is the frequently occurring outcome. In histogram, selecting the interval is the crucial part. If you use too few intervals, the graph will be too crude. It may contain mainly tall bars. On the other hand if you use too many intervals we may get a graph with irregularities, with many very short bars. We can lose information about the shape of the distribution.

Box Plot.

A boxplot is a one-dimensional graph of numerical data based on the five-number summary of positions, which includes the minimum value, the 25th percentile (known as Q 1 ), the median, the 75th percentile (Q 3 ), and the maximum value. In essence, these five descriptive statistics divide the data set into four equal parts.

A line inside the box marks the median. The lines extending from each side of the box are called whiskers.

Box plots are useful for identifying potential outliers.



In this article we got to know about various types of graphs and charts used to visualise the data. Graphs are time saving tools when we are dealing with big data. When we want to explain the maximum amount of information in a short time we simply present the data in graphical format. Each graph and charts are used according to the need.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]

Shape of Distribution

February 24, 2022

We have already discussed arithmetic Mean and Standard Deviation which are powerful ways of describing most statistical distributions that involve quantity-variables.

Look at the figure below. We often encounter this shape while working with statistical data.


Such symmetry is very common in statistical distributions especially where biological variations are concerned. But it is not universal.

Skewed Distribution

Look at the two distributions below:

positive skew


Such Distributions are called skewed. The skew is the tail of observation. If the tail is on the right it is said to be positively skewed and if the tail is on the left it is said to be negatively skewed.

If we look for the mean of the distribution in the 1st case it is X and in the 2nd case it is Y. It is worth noticing the effect skewness has on the relative size and positions of mean, median and mode.

Consider the diagram below based on hypothetical data.


In this diagram mean, median and mode all lie in the same place i.e. at the centre.


But If we look at the data in the above figure we will find that the mode which has the most frequent value is below the peak. Mean is more close to the tail and median lies in between mode and mean. The 2nd figure with negative skewness is similar as well.

In skewed distribution, the relative position of the three averages is always predictable. The mode is below the peak, the mean would have been pulled out in the direction of the tail and the median is between mode and mean. The greater the skew, the greater will be the distance between mean and mode.

All the distributions we see will have some skewness.

If we assume that the above two Figures represent the incomes of families in 2010 and 2020, and combine the figures we will get a distribution that is called Bimodal distribution. In this graph we can suggest that two different groups are involved.

Normal Distribution

In school days, every year teachers used to collect the data of heights of students in class. Let’s assume that we have one such data. The 1st figure shows the distribution of Weight (Pounds) for 50 students in class. The 2nd figure shows the Weight (Pounds) of 500 students in school.



We can see in the 1st figure that there are peaks and valleys but if we draw a rough sketch on the histogram joining all the peaks it will appear like a bell curve.

In the 2nd figure we can see the peak and valley effect disappear. This is because we now have more data. So as the number of students in the sample size increases the curve of distribution becomes smoother and smoother and will end up like a bell shaped curve.

The bell-like shape of the distribution above follows what is called the normal curve of distribution. The curve is perfectly symmetrical and its mean, median and mode are in the centre. So if we cut the curve vertically upward at the centre we will get equal areas on either side of the curve. The normal curve is thin or tall or short or slumping out very flatly depending on the standard deviation.

When we call this the ‘normal’ curve, we do not mean that it is the usual curve. Rather, ‘norm’ is being used in the sense of a pattern or standard – ultimate, idealised, ‘perfect’- against which we can compare the distributions we actually find in the real world.

In the real world it is impossible to get a perfect normal distribution as the sample size does not contain infinite data points. But still a small sample can produce a fair bell-shaped curve. The distribution can look as if it is trying to be normal. This suggests that such a sample comes from a large population whose distribution could indeed be described by a normal curve.

In this case, we can interpret the sample using certain powerful characteristics of the normal distribution. The normal curve is characterised by the relationship between its mean and its standard deviation. Using the mean and the standard deviation, we can state the proportion of thepopulation that will lie between any two values of the variable. We can then regard any given value in the distribution as being ‘so many standard deviations’ away from the mean. We use the standard deviation as a unit of measurement.

For example If we consider the 2nd figure again, the mean is 127.2 and the standard deviation is 11.9 pounds. Students whose weight is greater than or equal to 139.1 pound is 1 standard deviation above the mean and students whose weight is less than or equal to 115.3 pound is 1 standard deviation below the mean, and so on.


Thus any value in a distribution can be re-expressed as so many standard deviations above or below the mean; it does not matter if the distribution is normal or not. But if it is normal we can use our knowledge of normal distribution to find how many observations lie between any two given points. The standard deviation slices up a normal distribution into standard-sized slices, each slice containing a known percentage of the total observation.

Portion Under the Normal Curve

If we mark standard deviations on the above figure we will come to know that 68% of observation is enclosed between 1 standard deviation below and above the mean. That accounts for two thirds of the area under the curve. The remaining 32% resides outside the 1 standard deviation.


Similarly 95 percent of the data resides 2 standard deviations below and above the mean and 99.7% of data resides 3 standard deviations above and below the mean.


We now know if the tail is to the right then distribution is positively skewed and if the tail is to the left then it is negatively skewed. The position of mean, median and mode can be predicted with respect to peak. It takes an infinite amount of observations to get a perfect normal distribution. In real life it is impossible to get an infinite amount of data so the distributions are close to ideal normal distribution.

Standard deviation is a great measure to find dispersion of data from the mean. We now know that 68% of data lie in between 1 standard deviation above and below the mean. In our ‘real life’ distributions are reasonably close to those predicted by the theoretical normal curve.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]

Measure of dispersion

February 15, 2022


In the previous article we discussed Central tendency and how they are used to get some insight about the centre of data. But a measure of the centre is not enough to describe the distribution of a quantitative variable adequately. It tells us nothing about the variability of data. As an example, consider this hypothetical data that shows distribution of income of the middle class in Delhi (Blue) and Mumbai (orange).


Now as you can see from the diagram, both the distributions are symmetric and the mean of both the distributions is at INR 50,000. However, the monthly income of people in Delhi goes from INR 30,000 to INR 70,000 whereas those in Mumbai go from INR 10,000 to INR 1,00,000. Income in Delhi is more similar and varies more in Mumbai. A simple way to describe this is with range.

In Delhi the range is INR 70,000 – INR 30,000 = INR 40,000. In Mumbai the range is INR 1,00,000 – INR 10,000 = INR 90,000. The range is a rough and ready measure of dispersion. However, we cannot fully put our trust in range alone. It only depends on two extreme values and this might result in error if there are outliers in the data.


As an example consider the hypothetical distribution which represents the marks of students in two different sets. If I tell you to calculate the range of two sets you will find the range for set A as 10 and for set B as 10 too. But in set B you can see that apart from two extreme values only 3 different values were observed i.e. 21, 13 and 14. On the other hand 9 different values were observed in set A. But you can see that both set A and B have the same range, thanks to the influence of the outlier in B. Its range is the same as the range in Set B.

Standard Deviation and variance

One way of getting a fairer measure of dispersion is standard deviation. Standard deviation of a distribution is a way of indicating a kind of average amount by which all the values deviate from mean. The greater the dispersion, the bigger the deviations and the bigger the standard deviation.

The deviation of an observation x from the mean μ is (x – μ), the difference between the observation and the sample mean.

Consider the data below in the table.


The mean for Set X is 30 and for Set Y is 33. From the table we can see that the values in set X are more dispersed as compared to values in set Y. So it is easy to conclude that Standard deviation of Set X is greater than set Y.

Let’s calculate the deviation of set Y.


Now if we take the average of deviation then we find that it will add up to zero. So taking the average will be a bad idea. To overcome the difficulty we take squares of each value and add them. This is how we get rid of negative value. After dividing the added squared value with total observations we get variance.



Variance has its own disadvantages. If the original value is in some units say x then variance will be in units squared x.

To get the variance in the same units as the observed value, we take the square root of variance and this is what we call standard deviation.

Standard deviation of Set K = sqrt(36)= 6

Interquartile range

Now look at the 2nd figure. We now know that the range in both distributions is the same. It wouldn’t have been the same if we would have ignored the outliers in Set B. This gets us to introduce another measure of dispersion that takes a kind of ‘mini-range’ from near the centre of a distribution, thus avoiding outliers.

This range is based on what are called the quartiles of the distribution. Quartiles are the values that cut the observations into four equal lots just like the median cuts the observation in two equal lots.


As in the diagram there are three quartiles: Q1, Q2 and Q3. The Q2 is the same as median. Q2 is the 2nd quartile. The difference between Q1 and Q3 is the mini-range. It is also called the interquartile range. Q1 is the 1st quartile and Q3 is the 3rd quartile.

Let’s look at this figure:


Since there are 16 observations, we want to cut the bottom 4 and top 4 observations. So Q1 is at 9 and Q3 is at 16 and the interquartile range is 7. No doubt interquartile range gives more indication of the dispersion than full range.

To summarise, the different ways to measure variability in distribution are range, standard deviation, variance and interquartile range.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]


February 1, 2022

How weird can dimensions get?

Let us put some oranges in a box.

Consider the problem here. We have a box, in each corner there are balloons as shown in the figure below. We have to put an orange in the middle of the box such that it is covered by a balloon and box. So what will be the maximum size of Orange which is to be placed in the middle?


Let’s start with a one dimensional box and balloons i.e. lines. From the below figure we can see there is no place left to adjust the orange. So in one dimension we cannot put the orange in the box. So let’s move on to two dimensions.


In two dimensions the box is nothing but a square and the balloons are circles. So from the below figure we can see that there is some space in the middle where we can place oranges. With the help of high school geometry we got 0.4 as the maximum radius of the orange (which is also a circle in two dimensions) that can fit inside the box.


Now let’s try this in three dimensions. In 3D it is quite difficult to find the space. But I have done this for you. In 3D you can put an orange with a maximum radius of 0.7.


Now you see the trend. As the number of dimensions increases, the size of orange also increases. The below graph shows how the shape of orange increases as we go increasing the dimension.


Now you can see that in nine dimensions the radius of our hyperorange is 2, which means that the diameter of our orange is equal to the length of the square despite being surrounded by 512 hyperspheres of radius 1, each in one of the 512 corners of this 9D hypercube. If the orange is as largeas the box, how are the balloons protecting anything?

But it doesn’t end here, It becomes even crazier when we go in 10 dimensions. In 10D the scale of hyperorange exceeds the scale of hypercube which was meant to cover the orange. It appears to be extending past the perimeters of the cube, despite the fact that we built it to suit each within the field and the balloons which can nevertheless be in each corner. It’s difficult for us with 3-D brains to picture 10 dimensions (or greater), however the equations test out: the orange is concurrently within the field and lengthening past the field.

The moral is that our intuition can fail us when we get into a space with many dimensions. This is important because we work with data that has tens of thousands of features.

Anytime we work with data with more than 3 features, we have entered the world of higher dimension and we should notreason by analogy with what we know from our experience with twoand three dimensions. We need to keep our eyes open and rely onmath and logic, rather than intuition and analogy.

Curse of Dimensionality

Consider the height of 10 people. We plot the height of those people on a 1-dimensional graph i.e line. So below is the graph. I use different colours to mark children and adults. I say that children have height less than 5 feet and adults have height more than 5 feet. So 5 feet is a boundary which separates adults from children.


But this might not be enough so we consider another feature- weight and say that children normally have a weight of less than 55 Kg. So now we can properly separate adults from childrens. But the problem is that now we are not able to decide which of the boundaries efficiently separates adults from children. There are many decision boundaries which classify adults from children.


Again If we consider another feature, let say experience which we measure on the scale of 10. Then we get a graph which looks like this:


Now you got the problem? As we go on increasing the dimension or feature we get more and more void space. Which in turn can affect our classifier to classify things. As there is so much void space between two classified things, our algorithm might not be able to guess the right decision boundary. So next time when our algorithm gets new data it may put that data on the wrong side.

How to avoid the curse of dimensionality?

Regrettably there is no fixed definition of how many features should be used to avoid this problem. It depends on the amount of training data, complexity of decision boundary and the type of classifier used.

Ideally if we have an infinite number of training data then there is no question of curse of dimensionality as we can use an infinite number of features. If it takes N amount of data to classify things in one dimension then in two dimensions it takes N^2 data and in three dimensions it takes N^3 data and so on.

Furthermore, overfitting will occur both when estimating relatively few parameters in a highly dimensional space, and when estimating a lot of parameters in a lower dimensional space.

If we have less data points it is always better to work with less features.Dimensionality reduction is used to tackle the curse of dimensionality. Dimensional reduction techniques are very useful to transform sparse features to dense features. Dimensionality reduction is also used for feature selection and feature extraction.

So what is dimensionality reduction?

In the above example of classifying people as children and adults we saw that as we increase the number of dimensions, accuracy of classification starts decreasing. We can also see there is some correlation between weight and height, which is redundant. So we can ignore any one of the features from weight and height. Hence the process of reducing the number of random variables under consideration, by obtaining a set of principal variables is dimensionality reduction.

Some Algorithms used for Dimensionality Reduction

  • Principal Component Analysis (PCA)
  • Linear Discriminant Analysis (LDA)
  • Generalised Discriminant Analysis (GDA)

Advantages of dimensionality reduction

  • It helps in data compression by reducing features
  • It makes machine learning algorithms computationally efficient

Disadvantage of dimensionality reduction

  • It may lead to some amount of data loss
  • Accuracy is compromised

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]

Introduction to Descriptive Statistics and Measures of Central Tendency.

January 24, 2022

Consider two unrelated questions:

  • Who was the best Indian cricket team captain in the history of Indian cricket?
  • What is happening to the economic health of India’s middle class?

The 1st question is trivial but the 2nd question is a profoundly important one. For the 1st question, cricket enthusiasts can argue about it endlessly. But the 2nd question is important as the middle class is the backbone of the Indian economy.

Now using these two questions I’m going to illustrate the strengths and limitations of descriptive statistics, which are the numbers and calculations we use to summarise raw data.

Let’s go ahead with the trivial question first.


So above is the screenshot of a little chat that I had with the windows chat box. When I asked it to show me more data, it simply replied by asking me to search for the data on my own. Anyways the point here is, I can go on talking more about his success rate but that would be raw data which might be hard to digest as he has played 350 ODIs and 24 world cup matches. Or I can just say that his winning average was 58.82% at the end of his career. That is a descriptive statistic, or a ‘summary statistic’.


Winning average is Dhoni’s gross simplification of 350 ODIs. It is easy to understand but limited in what it can tell us.

Now moving on to the question about the economic health of the Indian middle class. To answer this question we need to find the economic equivalent of winning average. We need a measure which is simple and accurate, and which can tell us how the economic well-being of an average Indian has changed over the last few decades. The reasonable answer, though not accurate, is to measure the change in per capita income of India over the course of a generation, which is around 30 years. Per capita income is nothing but total income divided by total population.

In 1990 the average income in India was INR 5,882 and in 2020 it is INR 1,44,476. Well we got pretty rich.


But here’s the twist. These figures might be misleading. If we adjust this figure for inflation then in 1990 rupees 5,882 was equal to current rupees 25,950. Yet another big problem is that the average income in India is not equal to the income of the average indian.

The above number doesn’t tell us about the distribution of money in different classes. There exist four classes of income groups in India: low, lower-middle, upper-middle and high income groups. The top 1% of the total population can raise per capita income without putting any money in the pocket of the other 99%. The average income can go up without helping the average Indians.

From cricket to income, the most basic task when working with data is to summarise a great deal of information. Descriptive statistics give us a manageable and meaningful summary of the underlying phenomenon. Descriptive statistics can be like online dating profiles: technically accurate and yet pretty darn misleading.

In this article I’ll discuss Central tendency.

Consider the below data:



This is a report that tells us that global warming is largely a result of human activity that produces carbon dioxide (CO2) emissions and other greenhouse gases. The CO2 emissions from fossil fuel combustion are the result of electricity, heating, industrial processes, and gas consumption in automobiles. The International Energy Agency reported the per capita CO2 emissions by country (that is, the total CO2 emissions for the country divided by the population size of that country) for the year 2011. For the nine largest countries in population size (which make up more than half the world’s population), the values were, in metric tons per person.

From the above table we can see that the average emission by each country is equal to 4.6.


But from the above table we can see that only 3 countries are emitting more than 4.6 metric tons of CO2. The mean can be highly influenced by an outlier, which is an unusually small or unusually large observation. An outlier is an observation that falls well above or well below the overall bulk of the data. Outlier in the data calls for more investigation.


Since mean does not give us an accurate picture we look for median. Median is the middle value from the data when sorted in ascending order or descending order. In the median we look for the middle value from the data. For example, if we arrange the emission in ascending order like: 0.3, 0.4, 0.8, 1.4, 1.8, 2.1, 5.9, 11.6, 16.9.

Then the fifth value is median i.e. 1.8. Five is the middle of nine. The median is not going to change even if the US starts emitting 90 metric tons of carbon but on the other hand the mean will increase.

The shape of the distribution is highly influenced by whether the mean is greater or lesser than median. For the above example the shape of the distribution is right skewed as the mean is greater than the median. Because the mean is the balance point, an extreme value on the right side pulls the mean toward the right tail. Because the median is not affected, it is said to be resistant to the effect of extreme observations. The median is resistant. The mean is not.


If the mean is lesser than the median then the distribution is going to be left skewed and if the mean is equal to the median then the distribution is going to be symmetric.

Median is not affected by Outliers. The median is determined by having an equal number of observations above and below it.

Mean uses all the numerical values in data whereas median depends on how far observations fall from the middle. Because the mean is the balance point, an extreme value on the right side pulls the mean toward the right tail. Because the median is not affected, it is said to be resistant to the effect of extreme observations.

From these properties, you might think that it’s always better to use the median rather than the mean. That’s not true. If a distribution is highly skewed, the median is usually preferred over the mean because it better represents what is typical and if the distribution is close to being symmetric or only mildly skewed, the mean is usually preferred because it uses the numerical values of all the observations.

So from the above topic we can conclude that every measure of central tendency is important as they give us insight about the centre of the data in different ways.

But what is Mode then?

Mode is the value that occurs most frequently. It describes a typical observation in terms of the most common outcome. The concept of the mode is most often used to describe the category of a categorical variable that has the highest frequency. With quantitative variables, the mode is most useful with discrete variables taking a small number of possible values.


For the CO2 data, there is no mode as all the values occur one time so consider the above histogram. It shows the number of students vs the number of hours of TV watched per day by the students. Here 4,8, 22, 32, 8 and 6 are frequency or mode. So from the above data we can say that 32 students spend 4-5 hours daily watching TV. We can similarly note this for other values as well.

The mode need not be near the centre of the distribution. It may be the largest or the smallest value. Thus, it is somewhat inaccurate to call the mode a measure of centre, but often it is useful to report the most common outcome.

So from the above topic we can conclude that every measure of central tendency is important as they give us insight about the centre of the data in different ways.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]

Basic Statistics For Data Scientists

January 22, 2022

Statistical studies are carried out everywhere today. Whether it is for growing business, reporting weather or to make vaccines. Statistics have helped us do things more efficiently.

Consider a scenario where you are the manager of a restaurant and it’s your 1st day at managing things there. Initially you have no idea about the flow of customers and how many men you might need to handle customers and how many items you might need to make food without any wastage. Due to these reasons, in your initial days you do a bad job. But you analyse things and start to optimise things from experience.


As an example initially you would make more food on Monday because you thought the flow of customers would be consistent as it was the day before. But instead on Monday the flow of customers was low because it was the start of the weekday. This resulted in the wastage in food and man power. Noting this, next Monday you order your men to make less food. As days go by, you get better ideas and you do things more smartly.

You use statistics to analyse and optimise things in order to increase the revenue of the restaurant and minimise the losses. You analyse which age group of customer is coming to your restaurant, what is the best seller on the menu so that it’s always in stock And so on. You keep collecting data and analysing them in order to optimise things.

Statistics is a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data. You go to the doctor to describe your symptoms. Based on those symptoms the doctor predicts a few problems in your body and advises you to run some tests on your body based on which he gives you a prescription. This is how doctors use statistics to cure their patients.


We have learned about data in our previous article and statistics is the science of learning from data.

There are two types of variables in statistics:

  • Numerical (age, numbers of cars, etc)
  • Categorical data (Demographic information of a population: gender, disease status)

Numerical is further divided into discrete (date, number of dogs in house, etc) and continuous (eg. Weight, height etc) variables. Categorical data is further divided into nominal and ordinal data.


Statistics is divided into two parts. i.e. descriptive statistics and Inferential statistics. In this article we will just quickly review what we have under the umbrella of descriptive statistics and inferential statistics. Let’s start with descriptive statistics.

1) Descriptive Statistics.

Let’s assume that you own a retail shop. Fortunately you had collected data of sales that had happened in the last one year. Looking at the data, what is the 1st thing that comes to your mind? Of course you will look at the profit that you made in the last one year. But along with the profit you’ll also look for important things like: average profit made in one month, highest selling product, lowest selling product, seasonal products, etc. Now here you are extracting some important information that will help you grow your business.

While extracting the above information from the data you will use median, mean, mode or various graphs to extract information. This is how you are using Descriptive statistics to study data. Descriptive Statistics are used to present quantitative descriptions in a manageable form. It helps you in understanding data effectively and efficiently. Three main types of descriptive statistics are:

  • Distribution which mainly deals with frequency of each value
  • Central tendency that deals with averages of the values
  • Dispersion concerns with how spread out the values are

Consider 9 people sitting in a Bar next to each other, each one earning INR 1 lakh a month. The average salary of the room is INR 1 lakh. If any one of the people leaves the room the average will not change. Now let’s assume that Mukesh Ambani enters the Bar who let’s say earns INR 100 crore a month and sits next to those people. Now the average salary of the room will become INR 10,00,90,000.


Now none of them (nine people) earn more than a lakh rupee but the average is about 10 crore. This might be quite misleading. It doesn’t give us the right information about the exact average value of the room. This is the problem with the mean. It is affected by extreme values. So instead of looking at the mean we look for the salary of the person who is sitting in the middle. In this case the median can be calculated by adding the salary of people sitting at position 5 and 6, and then divide them by 2 which will give us an average of 1 lakh.

Median is not affected by the extreme values. Similarly if we look at the frequency of salary, we will get that 9 people are earning INR 1 lakh a month. Now this is the mode.

Now let’s take another example where for some reason you are not feeling good. So you go to the doctor, run some tests and find that your XYZ count (a made up blood chemical ) is 200. You instantly rush to the internet and find that the ideal XYZ count for your age is 180. Now your count is 20 points higher than the ideal level. If you don’t know the statistics you might inform your near and dear ones or might take a vacation to enjoy your remaining life.

None of this would be necessary. When you call the doctor’s office back to arrange for your hospice care, the physician’s assistant informs you that your count is within the normal range. “But how could that be? My count is 20 points higher than average!”- you yell repeatedly into the receiver. “The standard deviation for XYZ count is 40 “, says the technician and this leaves you feeling confused.


The natural variation in XYZ count is 40. If it exceeds this number then it is a matter of concern. Many people have XYZ count higher than ideal level. So how to figure out the highest or lowest limit? Standard deviation is a measure of dispersion, meaning that it reflects how tightly the observations cluster around the mean. For many typical distributions of data, a high proportion of the observations lie within one standard deviation of the mean meaning that they are in the range from one standard deviation below the mean to one standard deviation above the mean.

Graphs are an effective way to analyse data. You can look for hours on tabular data and still can get nothing out of it. Plot those points on a graph and boom you can grasp huge information in just a minute.

Depending on statistics you can perform univariate analysis, bivariate and multivariate analysis.

Univariate analysis is used to describe distribution of a single variable, including its central tendency and dispersion. For visualisation you can use Histogram and the shape of the distribution can be described with skewness and kurtosis.

Bivariate and multivariate analysis is used when we have more than one variable in our data. Bivariate analysis is used to see if there is any relationship between values. The relationship between values can be determined by looking at the Contingency table, Scatter plots, etc.

2) Inferential statistics

Everyone loves to get entertained and one of the major ways is to watch T.V. Everyone watches T.V. and everyone has their favourite T.V shows to watch. If you watch Indian T.V serials, each week BARC (Broadcast Audience Research Council) releases a list of top 10 shows running on television based on their TRP (Television Rating Point). How do you think they figure out the top T.V. shows?

They don’t call or message each Indian and ask what your favourite T.V show is this week. If they do that, remember the population of India is around 140 crore. They will be dealing with 140 crore data points each week which can be tedious. Instead they have randomly installed a device called a people metre in the homes of let’s say 2,00,000 people in different regions. The show creators and T.V channels have no information about who has the people metre. BARC observes what the audience is watching on their T.V, Which show is being watched most of all, etc. and based on that data they release weekly top 10 shows.


140 crore is the population size and 2,00,000 is the sample size. Here BARC studies data collected from 2,00,000 people and infer about 140 crore population. This is what Inferential statistics is all about. Study of small sample sizes to understand the population.

Inferential Statistics uses sample data to infer which is cost effective and less tedious than collecting data from the entire population.

BARC releases this top 10 list with some confidence interval, which only means that if some study is conducted many times with a completely new sample each time, it is likely that most of the time the studies will have an estimate that lies within the same range of values .

Now let’s assume that BARC made a hypothesis that says, people above 25 years of age watch “Anupama”. This must be tested. So BARC Collects some more data on age groups and analyses the data. After Analysing if it is found that people above 25 years of age watch “Anupama” then BARC accepts the hypothesis else they will reject it. This is what Hypothesis Testing is.

Hypothesis testing makes use of inferential statistics and is used to analyse relationships between variables and makes population comparisons through the use of sample data. This falls under the category of statistical test. Some other methods of testing are correlation tests and comparison tests. Pearson’s r test, Spearman’s r test and Chi-square test are examples of correlation tests. Whereas t-test, ANOVA is an example of a comparison test.We will be exploring this topic in the coming article of Inferential statistics.

Do follow our LinkedIn page for updates: [ Myraah IO on LinkedIn ]