Categories
HTML JavaScript Marketing

Collecting all the email addresses or links on a webpage with JavaScript

Have you ever wanted, or needed, to collect all of the email addresses or links on a website?

Have you ever tried to do this manually?

I have. It’s incredibly tedious.

There’s nothing quite as dehumanizing as robotically copying and pasting hundreds of email addresses or links from a webpage to an excel sheet over and over again.

So let’s make sure you never have to do that again.

In this guide, I’ll show you how to use JavaScript to collect all of the email addresses or links on a webpage.

It’s actually pretty easy. So let’s get started.

Thinking before writing your code

Before we begin writing our program, it’s always useful to think about what we’re trying to accomplish, this helps us write a more coherent program.

Here’s what we know we’ll need to do for this program:

  1. Search through the content of the webpage for email/links.
  2. Collect the email/links.

Seems pretty straightforward right? It is!

Now let’s break down each of the steps above into actual statements that we’ll need to write to complete the task of collecting emails from a website.

Here’s what we need to do:

  1. Create an empty array to populate with email/links.
  2. Specify where in the DOM we want JavaScript to search.
  3. Convert the content of the DOM to a string.
  4. Use the .match method to specify what we’re searching for.
  5. Add the matched items to our array.

Also pretty straightforward, so let’s start writing some code.

I created a test page for you to practice with. Please open this page in a new window beside this one, activate your developer console, and follow the instructions below.

Collecting all the email addresses on a webpage

Once you’ve opened up this page and turned on your JavaScript developer console, you’ll need to create an empty array to store the email addresses.

Please type this into your JavaScript developer console:

var listOfEmails = [];

Great, now let’s determine where we want JavaScript to look for these email addresses.

In the test page I created for you, the email addresses are within the opening <body> and closing </body> HTML tags of the page.

Let’s write a variable and assign the content within these tags to it:

var contentToSearch = document.body.innerHTML;

Now let’s verify the content assigned to our variable is correct by typing the following in our JavaScript developer console:

contentToSearch;

Did you see the HTML content of the page appear in your developer console? Excellent.

Now we need to convert the content from HTML to text so we can search it. To do this we’ll use the .toString method and apply it to our contentToSearch variable, which will convert all the HTML to text:

var contentAsText = contentToSearch.toString();

All of the content within the opening <body> and closing </body> HTML tags have just been converted to text and assigned to the variable contentAsText.

Let’s now search through it for the email addresses.

To do this we’ll use the .match method on the variable contentAsText in conjunction with a regular expression which matches some standard email address patterns:

listOfEmails = contentAsText.match(/([a-zA-Z0-9._-][email protected][a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+)/gi);

Now, to access the list of emails you just need one final step, type the following in your browser console:

listOfEmails

You should see your list of emails! Well done!

Here’s the entire program we just wrote.

Final program for collecting all the email addresses on a webpage

var listOfEmails = [];
var contentToSearch = document.body.innerHTML;
var contentAsText = contentToSearch.toString();
listOfEmails = contentAsText.match(/([a-zA-Z0-9._-][email protected][a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+)/gi);

Collecting all the links on a webpage

This will be pretty similar to the program we wrote for collecting all the email addresses.

Please open this page in a new window beside this one, activate your developer console, and start by creating an array to store our links:

var listOfLinks = [];

Now let’s collect all the links on the page.

Lucky for us, there’s a default JavaScript method called .links which can collect all the links on a page for us, without the need for writing a custom function.

Let’s write a variable to use the .links method:

var collectLinks = document.links;

Now we’ll need to loop through the links one by one and add them to our array.

Let’s do this with a JavaScript loop which utilizes the .push method:

for(var i=0; i<collectLinks.length; i++) {
  listOfLinks.push(collectLinks[i].href);
}

Did you see the number 12 in your developer console?

If you did, congratulations, you’ve just collected all of the links on the page.

Now let’s take a look at the content of our array by typing the following:

listOfLinks;

Notice it collected all the email addresses and links?

That’s perfectly normal since both email addresses and links use the HTML tag a href.

Well done!

Here’s the entire program we just wrote.

Final program for collecting all the links on a webpage

var listOfLinks = [];
var collectLinks = document.links;
for(var i=0; i<collectLinks.length; i++) {
  listOfLinks.push(collectLinks[i].href);
}

Why this works

You might be wondering how this is possible, considering you don’t have the ability to edit the JavaScript files on a website you are visiting.

Well, you don’t actually need to edit a website’s JavaScript files directly as you can run the JavaScript program in your web browser’s developer console. This is actually one of JavaScript’s best features!

So, with only our web browser and a little bit of JavaScript know how we were able to collect all the email addresses and links on a webpage.

Pretty handy right?

Next, let’s learn how to add some intelligence to our programs with conditional statements which let our program make decisions on their own.

Categories
HTML

HTML basics: syntax, semantics and best practices

HTML is in many ways the most important of the front end languages used to build websites (HTML, CSS and JavaScript). Not only does it provide the architectural blueprints for how your site is structured, but it also tells your web browser what each content type on the page is (lists, headings, pull quotes, etc.).

This post will help you understand how to read HTML. Specifically, it will focus on HTML syntax, semantics and best practices.

HTML stands for HyperText Markup Language which basically means versatile text. And it is versatile — more versatile than text you read in a book, at least. And it has many features that standard book text cannot provide: links are one of these.

Links look something like this in HTML:

<a href="www.danielpuiatti.com" target="_blank">This is my website!</a>

You are probably familiar with some parts of the HTML link above. I’m certain you recognize the URL part: www.danielpuiatti.com. But what about the < > and </a> or the target=“_blank”, what’s that doing?

To read HTML you must understand its syntax. These aforementioned parts of the link make up its syntax.

What’s syntax?

It’s how to properly structures sentences in a written language. From punctuation to capitalization, proper syntax allows you to easily understand a sentence.

Here’s an example of incorrect English syntax:

.hellO nice to Meet yoU

The structure of the sentence above is syntactically wrong. The period is at the start when it should be at the end, and there is random capitalization strewn about. These errors make the sentence difficult to read.

To be syntactically correct it should look like this:

Hello nice to meet you.

Whether English or Inuktitut, every written language has syntax. HTML included. Without correct syntax a browser cannot properly read your HTML.

HTML syntax

For HTML, syntax relies on tags. These tags wrap around content and when the syntax is correct it allows the browser to

  1. determine what the content is
  2. understand the meaning of content.

In other words, meaning (semantics) and type are derived from the browser’s ability to understand tags, and it can only understand tags if they are semantically correct.

Let’s look at an example.

This is an example of a paragraph tag wrapped around some text:

<p>This is a paragraph, hooray!</p>

When the browser looks at the example above it reads the opening <p> tag and thinks to itself: “Ah-ha! This is the start of a paragraph, the content within this tag should be presented and structured as a paragraph.”

The browser then moves through the content until it reaches the closing </p> tag at which point it says: “Ah-ha! This is the end of the paragraph. I’ll conclude presenting the content within as a paragraph and look for what tag is next.”

Every paragraph on this page, even this one right now, is wrapped in paragraph tags.

HTML from this page

Notice that every paragraph has three distinct parts:

  1. the opening tag: <p>
  2. the closing tag: </p>
  3. the content within the tag: This is a paragraph, hooray!

The correct syntax for a paragraph in HTML is an opening and closing tag wrapped around some content.

When this syntax is correctly written it is referred to as an element. Elements allow the browser to determine the type of content on the page and also helps it to derive the meaning of content (semantics).

This is a paragraph element: <p>Hello world!</p>
This is not a paragraph element: <p>Hello world!
This is not a paragraph element: Hello world!</p>
This is not a paragraph element: <p></p>

HTML Rules

HTML is not case sensitive

<p>This is a paragraph, hooray!</p> is the same as:<P>This is a paragraph, hooray!</P> is the same as
<p>This is a paragraph, hooray!</P> is the same as
<P>This is a paragraph, hooray!</p> is the same as`

It’s extremely bad form to use uppercase, and absolutely no one mixes capitals with lower case.

My suggestion is to always use lowercase, otherwise developers will look at you funny.

HTML tags must have the correct syntax

The examples below will not work:

<p>This is bad HTML
<p>This is also bad HTML</p
`<p>This is bad HTML<p>

Tags on their own are sort of meaningless, but when wrapped around content, like in the example above, they tell the browser: “Hey! Browser! Listen up, this is a paragraph! So you should make it look and act like a paragraph”.

The browser has no choice but to obey the instructions HTML gives it, and as such, displays and lets you interact with it as a paragraph.

HTML tag attributes

HTML tags can have attributes. These provide additional information about an element.

Attributes are always specified in the start tag and come in name/value pairs like target=“_blank”. Where target is the name and _blank is the value.

Here’s an example of a link with attributes:

<a href="www.danielpuiatti.com" target="_blank">This is my website!</a>

Here’s an example of a list with attributes:

<ul style="list-style-type:disc;">
<li>List item one</li>
<li>List item two</li>
<li>List item three</li>
<li>List item four</li>
<li>List item five</li>
</ul>

Attributes are always wrapped in quotes and must be contained within the start tag. These examples will not work:

<a href="www.danielpuiatti.com">My site <target="_blank" />
<a href="www.danielpuiatti.com" target=_blank>My site</>

Tag types

Since there are multiple content types on a website (paragraphs, lists, titles, links, etc.) you must wrap your content in tags for the browser to understand what each one is, otherwise, it has no way to know which is which. For example:

This content is a paragraph.

  • This content over here?
  • It’s different from the paragraph!
  • It’s a list!

And this? It’s a blockquote!

There are many HTML tags, each with specific semantic meaning and proper syntax. w3schools has a fantastic guide on HTML tags. I recommend taking a moment to read up on the various types, some good ones to get started with: lists, links, headings and paragraphs.

Ok, I hope this post will help you understand how to read HTML and you now have a better grasp on HTML syntax, semantics and best practices.

Here’s a handy image I found that bring everything together from this post.

Image result for html element syntax
source: Wikipedia

If you are ready for the next step, I’d suggest reading my beginner’s guide to CSS basics: syntax, semantics and best practices.

Categories
CSS HTML JavaScript Web browsers

How a web browser builds a website

If you want to learn how to build websites, program in JavaScript, become a front end web developer or are just genuinely curious about how websites work  —  you need to understand how a web browser (browser) builds a website.

This post will help you do just that.

Have you ever read something online? I’m willing to bet you have. Was it a news article or a long-form travel story? A Wikipedia article or maybe a serialized choose your own adventure Harry Potter fan fiction? Perhaps it was this very blog post?

Regardless, do you remember when it was written and what it was about? Do you remember how it looked and who wrote it? Or, better yet, can you tell me what wrote it?

That’s right, what.

Not sure? Here’s a hint: when you look at a blog post, news article, or any written content on your browser you are looking at the end result of a set of precisely followed instructions. These instructions tell your browser how to build, present and interact with the content on your screen. So, in answer to the question, your browser is the what. The who, I’ll leave to you.

The how? The precisely followed instructions!

These instructions come in three flavours: HTML, CSS and JavaScript, each of which contains directions that your browser follows to build a webpage. Coincidentally, someone who specializes in the three aforementioned technologies is called a front end developer or a web developer (I guess I didn’t leave the who to you after all).

Let’s take this one step further by using an analogy.

Imagine for a moment that your browser is a construction worker who needs to build a house. The house, in this case, is a website, and HTML, CSS and JavaScript are the blueprints for different parts of the house: the architecture, the style and the interactivity.

HTML (hypertext markup language)

This is the architecture of your house — it tells your browser how to organize the content on the page and what that content is.

It also tells your browser where it can find the other resources necessary to finish building the page, specifically the location of the CSS and JavaScript files. 

HTML is the first thing a browser reads when building a website and while it has has a specific initial structure, this structure can be expanded, reduced or transformed by CSS and JavaScript, in the same way you can add rooms, remove them or transform them on a blueprint. A kitchen can become a living room after all — especially on a blueprint.

HTML tells your browser the semantic meaning of the content on your website. This is similar to a legend on a map. Without this legend, your browser would not be able to distinguish the difference between content types on your web page (paragraphs, lists, titles, etc.). Consequentially, without this legend, your brows would not be able to correctly assign the styles and behaviours that distinguish one content type from another. You don’t want your lists displaying as paragraphs, do you?

In other words, HTML provides a reference so the browser can understand what the content is:

That content is a paragraph.

  • This content over here?
  • It’s different from the paragraph!
  • It’s a list!

HTML provides the distin level of detail so your browser can distinguish the content types.

For these reasons, I like to think of HTML as the architectural blueprint of your house.

CSS (Cascading Style Sheets)

This is the style of your house  —  from the carpet to the curtains, wallpaper to shingles, CSS is responsible for telling the browser how your web page should look and, if you desire, how it should transform.

CSS builds upon the architecture which HTML provides and references it to know which items it needs to apply styles to. Things like colours, spacing, animation and layout are controlled by CSS — without it, your webpage is boring.

Progressively, after the HTML is loaded, your browser then consults with CSS to understand how it should style all of the architecture laid out by HTML. Specifically, your browser references CSS to determine what the content defined in HTML should look like.

Everything from how the page should look on a phone, tablet or laptop to the colour of text, the spacing between lines and, as mentioned, the animation is provided by CSS.

For these reasons, I like to think of CSS as the style blueprint for your house.

JavaScript

This is the interactivity of your house — from when the lights should turn on to when the thermostat should lower, what time your alarm clock goes off to what temperature your oven should pre-heat to.

If it’s something that you can interact with, it’s probably governed by JavaScript.

Like CSS, JavaScript builds upon the architecture that HTML provides, and after the architecture of your house is built and styled, JavaScript goes to work defining what can be interacted with and how.

JavaScript governs interactions through event triggers and outputs:

An event trigger is something that needs to happen before an output can take place. JavaScript can tell your browser to listen for a specific event trigger (perhaps a click or a scroll), and when this trigger takes place, to perform a specific action.

An output is the action that happens after the trigger is activated. For example, let’s say someone pressed the doorbell at your house (event trigger) the output would be the specific sound that plays. Another example: you turn your stove on. The temperature which it rises to before turning off is the output.

Like cause and effect — the web browser equivalent is JavaScript — and through event triggers and outputs, JavaScript brings interactivity to your page. And interactivity is an absolutely essential component of modern websites

For these reasons, I like to think of JavaScript as the blueprint for the interactivity of your house.

Wrapping it all up: HTML, CSS and JavaScript

  • HTML is the architecture
  • CSS is the style
  • JavaScript interactivity
  • Each of these technologies makes up the front end of your website and each is essential to what most users expect from a modern website.
  • Your web browser follows the blueprints laid out in HTML, CSS and JavaScript to determine how to build a webpage, how this webpage should look and how you’ll be allowed to interact with it.
  • Someone who specializes in these technologies is called a web developer or front end developer.

So, hopefully after reading this post you now feel more familiar with how your browser constructs a web page.

Next, it’s time to learn about HTML, the most important part of a website.