The course is part of this learning path
Cyber Security Fundamentals Pathway
Welcome to this video on Digital Footprints.
In it you’ll learn about the ‘Digital Footprint’, why our Digital Footprint is important to us personally and how it can have a massive impact on Cyber Security.
We will also cover a large number of resources that could be helpful to you in cyber-security investigations. We will focus on Living life online, Online identity, Personal data, Finding information online, The web, Search engines, Communities, forums and groups, as well as the Deep web.
The advent of the World Wide Web, and the relentless rise of social media platforms means that every connected person in the world is now leaving some sort of trace of their online activities.
Even the simplest online interactions can create complex trails showing where an individual is visiting, has visited previously and possibly where they are going.
The Hollywood portrayal of these trails implies that it is a matter of seconds to follow them right back to their origin – but of course, this isn’t strictly true!
So, how do we begin to build a Digital Footprint, or Digital Profile?
Firstly, an entity becomes associated with an individual. This could be a laptop or mobile phone, through to the networks we connect to, the games we play online or the information we post or engage with on social media.
The association of these entities to us is what builds the profile – we have left our digital footprints there for others to follow.
There are a few things that can be included in a Digital Footprint.
Onscreen you can see examples of the type of information regularly shared with online organizations, but is by no means exhaustive. Anything shared with an online entity can be shared with a much wider audience than you might have intended.
Social media in particular encourages us to be far more open to sharing everything about ourselves, and this type of information can be very useful to a wide range of entities, some of whom may not have your best interests at heart.
It is in this area that your own good intentions to be aware of data-security can be thwarted by friends and family members, who can unintentionally share information about you.
DNS records give details of who has registered to a particular web domain.
In the not too distant past, using a Whois query against a target domain would often result in useful information being disclosed, including e-mail addresses or phone numbers for the registrant individual or organization.
However, the EU’s General Data Protection Regulation (GDPR) has resulted in internet domain registrars hiding domain registration information to avoid fines for non-compliance, so a whois query nowadays isn’t likely to yield directly useful information.
It should be noted that Chinese websites ignore GDPR, and a domain search on for instance, the baidu.eu website will often disclose registrant information that is not available elsewhere.
As well as possibly being able to correlate registrant information to any given domain, the sites shown onscreen can allow the user to extract further information from DNS records, where it is available.
Earlier I mentioned the disparity between the Hollywood portrayal of tracing someone online to the reality of doing so.
The two websites mentioned onscreen give very clear examples of this disparity.
Both attempt to match a physical geographical location to a virtual internet location or address. In the vast majority of instances this matching process will be approximate, at best.
There are many factors that can affect the allocation of an IP address, for example, any correlation to a physical location must be treated with caution.
Someone using a Virtual Private Network, or VPN, could be on one side of the world with all of their internet activity going out onto the wider network, via an IP address that is allocated to a country that is thousands of miles away.
In 2013, Google knew of 130 trillion pages on the Web, but had not indexed them all. In the 5 years, between 2008 - 2013, 100 trillion pages were created.
Google has updated their
"How Search Works" page changing the number of pages Google knows of from 30 trillion back in March 2013 to 130 trillion today.
Search engines have become vital for finding information in an easy to view manner. Without search engines, everyday use of the internet would be severely hampered if not impossible.
The constantly changing nature of the Internet, and the Web, makes estimating its size very difficult. It is a widely-held misconception that Google captures everything on the Web. What is interesting is that in 2017 Google indexed only 100 billion webpages per month.
Using these figures means that Google only indexes at most 2.5% of the whole Web, and the real figure could be a lot smaller than that. Many people don’t realize this fact. What they also don’t realize is that the different search engines don’t necessarily index the same pages. This means that different webpages may be found using Yandex, Bing, Yahoo! or DuckDuckGo that may not necessarily appear on Google at all.
The rest of the Web, or the approximately 96% not indexed by Google and the other search engines, is what is referred to as the Deep Web.
The Web is also only one protocol that sits within the whole Internet. If you limit yourself to just using Google search then you could be missing out on the vast majority of the intelligence that may be online.
The analogy of an iceberg is very fitting when talking about the different views of the World Wide Web. The surface view is what we interact with, every time we visit a website. But there is so much more lurking below the surface.
There is a distinction between the Deep Web and the Dark Web. The media will often use the terms interchangeably, but they do not mean the same thing at all.
The Deep, or Invisible, Web references all online information that is not indexed by search engines such as Google. This information will usually be perfectly legitimate, but not necessarily something that should be available to the entire connected community.
The Dark Web, or Dark Net can be thought of as the dark underbelly of the World Wide Web. To access this network usually requires specific software or connections, and it is here that some of the worst aspects of humanity can be found, engaging in all manner of criminality.
It must be noted however, that the technologies involved in the Dark Web do have perfectly legitimate uses. In fact, one of the key technologies involved – The Onion Router, or Tor, was developed by the United States Navy to facilitate secure and anonymous communications. Many regimes around the world have very censorious attitudes to Internet access, and the use of Tor can allow pro-democracy activists to have unmonitored web access.
How do search engines know which sites match what we are looking for?
The search engines deploy automated programs that spider, or crawl, across webpages that are available on the surface web, collecting various pieces of information relating to the pages.
This information is fed into a proprietary algorithm (each search engine will use its own algorithm) which allows them to rank pages on how they match up to any particular search term submitted.
There are a huge number of search engines available, but the dominant player is Google. In fact, we now use the name of the company as a verb – ‘just Google it!’
Whilst Google is by far the largest player in the search engine market, we have already seen that there are a huge number of alternatives, and we will look at some of these over the next few slides.
Some of those listed in the top 15 are more prevalent in certain geographical locations, such as Baidu in China and Yandex in Russia.
Wolfram|Alpha is a slightly different take on the search engine concept. It still uses the same ideas of crawling and indexing pages, but it uses complex algorithms to provide answers to questions, rather than just a list of potentially relevant webpages.
While Google is the most used search engine, there are longstanding concerns about its approach to privacy and tracking of users search habits.
Duckduckgo.com is an example of a search engine designed to allay privacy concerns, but also to mitigate against the ‘search bubble’ effect – where search results become increasingly tailored to the user, limiting what the user sees.
Keyword.io is a keyword based search engine that generates suggested keywords for your search which can improve search results.
Most people use the simple front-page Google search bar, type in no more than two key words and typically never go further than the first page of results.
However, Google has a lot more going on behind the scenes that makes it a considerably powerful tool if you know how to use it.
There are a number of tips and tools that make carrying out online searches much easier and more efficient.
Many search engines don’t recognise common words such as: the, if, and, at, on, in. These are known as ‘stop words’.
Some search engines will allow stop words to be included in the query if the whole query is included within quotes – e.g.if you were looking for the band “The Who” – This is called a literal string, and the search engine will proceed to look up the entire set of characters within the quotes as a whole as opposed to treating each word as a separate entity.
A few of the bigger search engines (notably Google and Bing) now look at search terms in a more intelligent way and historic stop words are now processed, so searching for The Who on Google will return results about the 60s rock artists.
By clicking on the cog icon on the right after entering a search term and then selecting ‘Advanced Search’ you are given further search options.
Not many people take advantage of the Google advanced search page but it is very powerful, allowing you to enter a large range of parameters for your search, including language, file type, or location.
Google offers a wide set of operators that will allow the user to be very specific in the types of searches (and results) they want to use.
The collective term for the various types of searches that can be performed using the operators in Google is ‘dorks’, or ‘Google dorks’.
Onscreen you can see a number of the operators available for Google, and the types of data they can be used to search for.
If you are interested, pause the video to review and test some of the operators.
Let’s now look at other resources that are available to us as cyber security professionals, which may aid in your investigations.
There are a vast number of forums and discussion boards available, covering every conceivable topic of interest or discussion.
For instance, Usenet and other bulletin board systems actually pre-date the creation of the World Wide Web by a decade or more.
Reddit is a generic discussion forum, allowing users to start any manner of conversation tree, about any topic they care to think of. These discussions are usually placed into a Subreddit, which branches off from a broad subject area, such as Science or Sport.
4Chan is another discussion or bulletin board, biased towards the sharing of images.
It tends to focus quite heavily on the sub-cultures of the Internet and World Wide Web, and has been responsible for the propagation of many of the most well-known memes. A meme is an image, video, piece of text, etc. which is typically humorous in nature and is copied and spread rapidly by Internet users, often with slight variations.
4Chan is well known as tending towards the darker side of humour and much of its content would be quite unsuitable to access on your corporate machine!
Another source of intelligence in cyber security investigations is auction sites.
There are a number of popular auction and sale sites. Many of these sites also include discussion boards, or vendor/buyer feedback mechanisms.
There are quite a number of resources available for seeking out information relating to companies or individuals.
Many of these may require some sort of registration process, or payment to be made.
Many people will re-use usernames across multiple websites, and the resource mentioned in this slide can be used to find instances of this.
Archive.org is a non-profit that has the goal of ‘archiving the internet’.
As part of this project, archive.org holds huge libraries of documents, music and videos.
It also runs the ‘Way Back Machine’ that has taken snapshots of millions of websites throughout history. These can then be viewed at a later time. For example you can look at the CNN website on 11th September 2001, or Google’s first home page in 1998.
More importantly for investigators these snap-shots allow you to go back and check what a website looked like and what the source code was previously.
This has been used when an administrator of a website wrote his own name under the <Author> tag of the first version of a website, before he intended to do anything illegal with it.
Another example of Archive.org being used for investigative purposes was when a white supremacist group wrote racist comments on their website, then removed the comments later. Archive.org was used to find those comments which were then used as evidence in court.
Some websites are created solely for the purpose of sharing sensitive information that has been leaked from elsewhere.
They can contain a lot of useful intelligence for both defenders and attackers in the cyber world, but any information should always be regarded with some caution.
Pastebin is a website that was originally designed to allow programmers to share code extracts they had created, for re-use by others, quality assurance or just comment.
It is only possible to share textual information on Pastebin but there are ways in which non-textual data, such as pictures or executable programs, can be encoded into text form and therefore posted onto the Pastebin site. Some malware authors make use of this process to try and get their malicious code onto victim’s machines.
However, due to the fact that Pastebin is primarily aimed at sharing text, it is no surprise to find that it is frequently used to share compromised or stolen personal information.
Doxxing is the Internet-based practice of researching, broadcasting and identifying private information (especially personally identifying information) about an individual or organization, and is a regular feature of Pastebin.
Having discussed a number of resources that are aimed at retrieving information from the Surface Web, there are also resources available for searching information that is on the Deep Web, and not indexed by search engines.
The resources shown onscreen give insights into three very different areas of information that may not necessarily be available via a search engine.
Open Source Intelligence, or OSINT, relates to the process of gathering information from a wide variety of online sources, and bringing this information into your investigation process.
The OSINT Framework Tool draws together a huge number of sources into one toolkit, allowing the investigator to rapidly collate disparate sources of information.
Onscreen you can see a number of online resources that can help in carrying out OSINT.
That brings us to the end of this video.
A world-leading tech and digital skills organization, we help many of the world’s leading companies to build their tech and digital capabilities via our range of world-class training courses, reskilling bootcamps, work-based learning programs, and apprenticeships. We also create bespoke solutions, blending elements to meet specific client needs.