Apache Cassandra and Python Step by Step Guide with Example

Getting Started with Apache Cassandra and Python

0 589

For people coming from traditional RDBMs, the Cassandra data model can be strange, confusing and maybe even a bit difficult to understand. There are some terms such as keyspace completely new in Cassandra and some terms such as column does not match the meaning in the RDBMs. In this blog post we will cover Apache Cassandra and Python Step by Step Guide On Ubuntu with Example.

Apache Cassandra and Python


1 What is Apache Cassandra

This is an open source distributed NoSQL database management system. It’s designed to handle large amounts of data across many different commodity servers, hence providing high availability with no single point of failure. It offers strong support for clusters that span various data centres, with its asynchronous master less replication allowing low latency operations for all clients.

Cassandra from ten thousand feet

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

[Source Apache Cassandra]


  • It supports replication and multiple data centre replication.
  • It has immense scalability.
  • It is fault-tolerant.
  • It is decentralised.
  • It has tunable consistency.
  • It provides MapReduce support.
  • It supports Cassandra Query Language (CQL) as an alternative to the Structured Query Language (SQL).

The list of companies using Cassandra is vast and constantly growing. This list includes:

  • Twitter is using Cassandra for analytics.
  • Mahalo uses it for its primary near-time data store.
  •  Facebook still uses it for inbox search, though they are using a proprietary fork.
  • Digg uses it for its primary near-time data store.
  • Rackspace uses it for its cloud service, monitoring, and logging.
  • Reddit uses it as a persistent cache.
  • Cloudkick uses it for monitoring statistics and analytics.
  • Ooyala uses it to store and serve near real-time video analytics data.
  • SimpleGeo uses it as the main data store for its real-time location infrastructure.
  • Onespot uses it for a subset of its main data store.

Users can interact with Cassandra in multiple ways.

  • Command Line Interface (CLI) The latest version is 1.2.4. For the purpose of learning, we worked on this tutorial  in CLI.
  • Cassandra Query Language (CQL) It supports subset of SQL features. Here standard DDL, DML commands can be used. Specific functions like Group by and Order by are not supported. The latest version is 3.0. Because of trivialness, this report will not discuss CQL in detail.
  • The DataStax Community Package is a software package which supports both the CLI and CQL together and is a quick way to interface with Cassandra.


2 Data Model – Bottom up approach

Data Model – Bottom up approach
Cassandra – Data Model

For people coming from traditional RDBMs, the Cassandra data model can be strange, confusing and maybe even a bit difficult to understand. There are some terms such as keyspace completely new in Cassandra and some terms such as column does not match the meaning in the RDBMs.

Before we dig into some of key data model concepts in Cassandra following bottom up approach we would like to illustrate how Cassandra data model can be mapped to RDBMs.

This analogy helps make the transition from the relational to non-relational world. But don’t use this analogy while designing Cassandra column families. Instead, think of the Cassandra column family as a map of a map: an outer map keyed by a row key, and an inner map keyed by a column key. Both maps are sorted.

SortedMap<RowKey, SortedMap<ColumnKey, ColumnValue>>

A nested sorted map is a more accurate analogy than a relational table, and will help you make the right decisions about your Cassandra data model.


  • A map gives efficient key lookup, and the sorted nature gives efficient scans. In Cassandra, we can use row keys and column keys to do efficient lookups and range scans.
  • The number of column keys is unbounded. In other words, you can have wide rows.
  • A key can itself hold a value. In other words, you can have a valueless column.

Range scan on row keys is possible only when data is partitioned in a cluster using Order Preserving Partitioner (OOP). OOP is almost never used. So, you can think of the outer map as unsorted:

Map<RowKey, SortedMap<ColumnKey, ColumnValue>>

As mentioned earlier, there is something called a “Super Column” in Cassandra. Think of this as a grouping of columns, which turns our two nested maps into three nested maps as follows:

Map<RowKey, SortedMap<SuperColumnKey, SortedMap<ColumnKey, ColumnValue>>>


  • You need to pass the timestamp with each column value, for Cassandra to use internally for conflict resolution. However, the timestamp can be safely ignored during modeling. Also, do not plan to use timestamps as data in your application. They’re not for you, and they do not define new versions of your data (unlike in HBase).

[ Read More.. Source: Data Modeling Practise ]

3 Cassandra Installation on Ubuntu

Development is done in the Apache Check here

Installation from Debian packages

  • Add the Apache repository of Cassandra to /etc/apt/sources.list.d/cassandra.sources.list, for example for the latest 3.11 version:
echo "deb http://www.apache.org/dist/cassandra/debian 311x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
  • Add the Apache Cassandra repository keys:
curl https://www.apache.org/dist/cassandra/KEYS | sudo apt-key add -
  • Update the repositories:
sudo apt-get update
  • If you encounter this error:
GPG error: http://www.apache.org 311x InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A278B781FE4B2BDA

Then add the public key A278B781FE4B2BDA as follows:

sudo apt-key adv --keyserver pool.sks-keyservers.net --recv-key A278B781FE4B2BDA

and repeat sudo apt-get update. The actual key may be different, you get it from the error message itself. For a full list of Apache contributors public keys, you can refer to https://www.apache.org/dist/cassandra/KEYS.

  • Install Cassandra:
sudo apt-get install cassandra
  • You can start Cassandra with sudo service cassandra start and stop it with sudo service cassandra stop. However, normally the service will start automatically. For this reason be sure to stop it if you need to make any configuration changes.
  • Verify that Cassandra is running by invoking nodetool status from the command line.
  • The default location of configuration files is /etc/cassandra.
  • The default location of log and data directories is /var/log/cassandra/ and /var/lib/cassandra.
  • Start-up options (heap size, etc) can be configured in /etc/default/cassandra.

If everything goes well you can see output like below

techfossguru@techfossguru:~$ nodetool status
Datacenter: datacenter1
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 103.66 KiB 256 100.0% e3c08fec-a280-4447-8f7f-a7656cce1d50 rack1


4 Python Cassandra Driver

A Python client driver for Apache Cassandra. This driver works exclusively with the Cassandra Query Language v3 (CQL3) and Cassandra’s native protocol. Cassandra 2.1+ is supported.

This driver is open source under the Apache v2 License. The source code for this driver can be found on GitHub.

Python 2.6, 2.7, 3.3, and 3.4 are supported. Both CPython (the standard Python implementation) and PyPy are supported and tested.

Linux, OSX, and Windows are supported.

5 Installation through pip

pip is the suggested tool for installing packages. It will handle installing all Python dependencies for the driver at the same time as the driver itself. To install the driver*:

pip install cassandra-driver

You can use pip install --pre cassandra-driver if you need to install a beta version.

*Note: if intending to use optional extensions, install the dependencies first. The driver may need to be reinstalled if dependencies are added after the initial installation.


6 About this Big Tutorial On Casandra and Python.

About this Big Tutorial On Casandra and Python.

CQL for Apache Cassandra

This tutorial gives you just enough information to get you up and running quickly with Apache Cassandra and Python Driver. Learn how to install the driver, connect to a Cassandra cluster, create a session and execute some basic CQL statements.

Cassandra CQLsh

Cassandra CQLsh stands for Cassandra CQL shell. CQLsh specifies how to use Cassandra commands. After installation, Cassandra provides a prompt Cassandra query language shell (cqlsh). It facilitates users to communicate with it.

Cassandra commands are executed on CQLsh. It looks like this:


CQLsh provides a lot of options which you can see in the following table:

Options Usage
help This command is used to show help topics about the options of CQLsh commands.
version it is used to see the version of the CQLsh you are using.
color it is used for colored output.
debug It shows additional debugging information.
execute It is used to direct the shell to accept and execute a CQL command.
file= “file name” By using this option, cassandra executes the command in the given file and exits.
no-color It directs cassandra not to use colored output.
u “username” Using this option, you can authenticate a user. The default user name is: cassandra.
p “password” Using this option, you can authenticate a user with a password. The default password is: cassandra.

techfossguru@techfossguru:~$ cqlsh
Connected to Test Cluster at
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> CREATE KEYSPACE techfossguru 
 ... WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3};
cqlsh> use techfossguru;
cqlsh:techfossguru> CREATE TABLE student_info( student_roll_no int PRIMARY KEY, 
student_name text, student_city text, student_fees varint, student_phone varint );
cqlsh:techfossguru> SELECT * FROM student_info;

student_roll_no | student_city | student_fees | student_name | student_phone

(0 rows)
cqlsh:techfossguru> INSERT INTO techfossguru.student_info JSON '{
 ... "student_roll_no" : "001", 
 ... "student_name" : "Satish Prasad", 
 ... "student_city" : "Delhi",
 ... "student_phone" : "9999912345" }';
cqlsh:techfossguru> select * from techfossguru.student_info
 ... ;

student_roll_no | student_city | student_fees | student_name | student_phone
 1 | Delhi | null | Satish Prasad | 9999912345

(1 rows)
cqlsh:techfossguru>DELETE techfossguru.student_phone FROM techfossguru.student_info 
WHERE student_roll_no=001;

UPDATE techfossguru.student_info SET student_city = 'London' WHERE student_roll_no = 001;

So far so good..in next part of this blog we will work upon two examples.

  • Python Crud Example with Casandra
  • Flask-CQLAlchemy example


Please feel free to reach me in case more information is required.