Scraping Telegram: Alternative News Sources
- Difficulty: Medium
- Requirements:
- Basic knowledge of the python syntax
- Knowledge on how to use
pip
- A favourite code editor (or so called “IDE”)1
- A Telegram account (alongside a phone with telegram installed)
Introduction
In this tutorial, we show how journalistic media texts can be retrieved from Telegram. Many news outlets, especially alternative media platforms use Telegram to engage with their audiences. The low degree of moderation and supervision makes Telegram not only an interesting platform for media outlets spreading (mis-)information, but also for fringe-groups of the political spectrum. They tend to use public channels on Telegram to establish their narratives or to recruit and to mobilise supporters.
This tutorial shows how to use the official API while respecting Telegram’s terms of services. You can even retrieve many data points beyond only the message content. Because you need to specify only the channel name, you can easily get started and study information flows between channels or different social media platforms.
Overview
- Installing the python module
- Testing the API
- Connecting to the chat
- Retrieving messages
- Downloading and storing all messages
- Summary
A set of finished sample scripts can be downloaded at the bottom of this page.
Preparations & Installation
To keep things neat and tidy, create a new directory for this tutorial and name it telegram_scraping
. For the remainder of the tutorial, we assume that all commands are run from this directory.
Telegram API Account
Before we jump into this tutorial, make sure that you have a Telegram account and retrieved your API credentials. Please note that a valid phone number is required to open an account.
The exact procedures for obtaining API credentials might change from time to time. Therefore, please follow the steps outlined in the official documentation for Telegram developers and then come back to this tutorial with a valid API ID and API Hash.
Required Packages
To install the Telethon library open a terminal and install it via pip
pip install telethon
Testing the API
Create a new file called secrets.json
, this is where you store your Telegram API credentials. Saving them in a separate file keeps your credentials detached from the scraping scripts. Open secrets.json
and edit it accordingly:
secrets.json
{
"api_id": "<your api id>",
"api_hash": "<your api hash>"
}
Next, we create a new python script and name it check_api.py
in which we import the following:
check_api.py
import json # to load our API credentials
from telethon.sync import TelegramClient # the Telegram client
If everything is installed correctly, these imports should run without errors.
Now, we load our API credentials:
with open('secrets.json') as f:
= json.load(f) credentials
Next, we instantiate a new TelegramClient
where we pass in our API credentials:
= TelegramClient('user',
client 'api_id'],
credentials['api_hash']).start() credentials[
The TelegramClient
class needs three arguments: the session name ('user'
), the API ID and the API hash. The session name tells Telethon where to store the session credentials. The first time you run above code, you will be prompted to enter your phone number to verify it is really you:
Please enter your phone (or bot token): +01 355-1337
Once you entered your phone number, Telegram will send you a verification code to your phone. You get prompted again to enter the verification code in python:
Please enter the code you received: 12345
Signed in successfully as User
We have to complete this only once. If you observe your current directory carefully, you will notice that a new file appeared: user.session
. Telethon stores the session information in this file. The file name is specified in the first argument when instantiating the TelegramClient
(remember, we passed in 'user'
). If you delete user.session
, you will have to repeat the login verification.
A great way of checking whether the client is configured correctly is by sending a message to yourself:
'me', 'Hello to myself!') client.send_message(
If you check Telegram on your phone, you should have received a message.
Connecting to a chat
In this tutorial we will explore the Telegram account by ovalmedia (see the entry on Meteor for more information). It is an alternative media outlet that mainly offers content on alternative video streaming platforms. They promote their recent videos on telegram alongside a summary of each video. Of course, you could also use any other Telegram account for this tutorial; feel free to experiment.
We create a new script connect.py
where import the same modules as before and instantiate the TelegramClient
:
connect.py
import json # to load our API credentials
from telethon.sync import TelegramClient # the Telegram client
from pprint import pprint # helper to pretty print
with open('secrets.json') as f:
= json.load(f)
credentials
= TelegramClient('user',
client 'api_id'],
credentials['api_hash']).start() credentials[
Next, we save the username of ovalmedia in an object and use the get_entity()
method to retrieve the account:
= 'ovalmedia_english'
account_name = client.get_entity(account_name) chat
We can now check the official name of the account by accessing the title
attribute of the chat
object:
chat.title# 'OVALmedia | English'
With this step, we can also verify whether we really got the account that we were looking for.
Retrieving messages
Before we scrape all messages in a channel, let’s check a single message first. We can retrieve messages with the get_messages()
method, which takes the chat
object as its argument:
= client.get_messages(chat) messages
The messages
object behaves like a python list, but additionally as a total
attribute that shows the total number of messages in the chat:
print(messages.total)
The first element in the list is also the most recent message, and we can access it like this:
= messages[0] message
message
the message object has a series of attributes, the most interesting for us are:
id
(int
): message id within this chatdate
(datetime
): timestamp when message was sentmessage
(str
): the actual message contentforwards
(int
): number of times the message was forwardedviews
(int
): number of chat members who have seen the message
Let’s have a look at the most recent message in the channel:
print('Newest message:')
print(message.id, message.date)
print('Message content:')
print(message.message)
print('Total views:', message.views, 'Total forwards:', message.forwards)
We can also use the to_dict()
method to get all message contents and attributes as a python dictionary:
pprint(message.to_dict())
Reformatting a message
Of course, we could just take the message.to_dict()
object and store it as a JSON file. However, then we would also store less interesting information and would store the nested structure. Therefore, we reformat some fields and also unnest the content.
= {'channel_name': chat.username, # keep track on where we got the message from
cleaned_message 'id': message.id,
'date': message.date,
'content': message.message,
'forwards': message.forwards,
'views': message.views}
Downloading and storing all messages
If we want to retrieve all messages in from a telegram channel, we could use the get_messages()
method and iterate over the resulting list. This works well for small chats, but less well for chats with thousands of messages. Instead, we use the iter_messages()
method, which retrieves messages in batches and also has useful features such as limiting the number of messages to retrieve, or to search for keywords.
We create a new script that we name scrape_channel.py
and we make the same imports as before, but also import some utility modules
scrape_channel.py
import json # to load our API credentials
from telethon.sync import TelegramClient # the Telegram client
from pprint import pprint # helper to pretty print
from pathlib import Path # makes handling file paths a breeze
with open('secrets.json') as f:
= json.load(f)
credentials
= TelegramClient('user',
client 'api_id'],
credentials['api_hash']).start()
credentials[
= 'ovalmedia_english' account_name
We prepare an output folder with the Path
class:
= Path.cwd()
p
= p / 'output' / account_name
output_dir if not output_dir.exists():
=True) output_dir.mkdir(parents
Next, we get the chat as we did before:
= client.get_entity(account_name) chat
Now, we use the iter_messages()
method to incrementally retrieve messages (from newest to oldest). We use the method as for-loop generator, where with every iteration, we get a message, reformat it and then store it as a JSON file:
for message in client.iter_messages(chat):
print('Retrieving message:', message.id)
= {'channel_name': chat.username,
cleaned_message 'id': message.id,
'date': message.date,
'content': message.message,
'forwards': message.forwards,
'views': message.views}
= output_dir / f'{message.id}.json'
file_name
with open(file_name, 'w') as f:
=True, default=str) json.dump(cleaned_message, f, indent
When you execute the code block above, you can observe how the output directory is filled with single JSON files, where each file represents a single message. Of course you can also use pandas instead and construct a data frame:
import pandas as pd
= []
tmp
for message in client.iter_messages(chat):
print('Retrieving message:', message.id)
= {'channel_name': chat.username,
cleaned_message 'id': message.id,
'date': message.date,
'content': message.message,
'forwards': message.forwards,
'views': message.views}
tmp.append(cleaned_message)
= pd.DataFrame(tmp) df
Filtering Messages
We could also filter the results by using the search
keyword argument, e.g.:
for message in client.iter_messages(chat, search='europe'):
...
Or set an offset date to retrieve only messages before a certain date, e.g.:
from datetime import datetime
= datetime.strptime('2022-12-31', '%Y-%m-%d')
my_datetime
for message in client.iter_messages(chat, offset_date=my_datetime):
...
Summary
We have shown how to use the Telethon library to download media text data in a structure and clean format. The example shown here was an alternative media outlet from Germany. The advantage of this method is that the above code works on a large number of accounts, since Telegram is also popular among political parties to engage with their supporters.
There are three scripts attached to this tutorial:
Footnotes
If you are not sure about which IDE to use, we recommend VSCode, PyCharm, or Jupyter Notebook↩︎