Apache Cassandra and Python Step by Step Guide with Example

Getting Started with Apache Cassandra and Python

0 2,929

For people coming from traditional RDBMs, the Cassandra data model can be strange, confusing and maybe even a bit difficult to understand. There are some terms such as keyspace completely new in Cassandra and some terms such as column does not match the meaning in the RDBMs. In this blog post we will cover Apache Cassandra and Python Step by Step Guide On Ubuntu with Example.

Apache Cassandra and Python

Introduction 

This tutorial gives you just enough information to get you up and running quickly with Apache Cassandra and Python Driver. Learn how to install the driver, connect to a Cassandra cluster, create a session and execute some basic CQL statements.

By the end of this blog post on Apache Cassandra and Python Step by Step Guide we will go through some basic theory around the Apache Cassandra, Key Difference with Other RDBMS, Installing required Packages on Ubuntu, Cassandra  Driver for Python and CRUD Operation example.

let’s cover some details around Apache Cassandra.

1 So, What is Apache Cassandra

This is an open source distributed NoSQL database management system. It’s designed to handle large amounts of data across many different commodity servers, hence providing high availability with no single point of failure. It offers strong support for clusters that span various data centres, with its asynchronous master less replication allowing low latency operations for all clients.

Cassandra from ten thousand feet

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

[Source Apache Cassandra]

Features:

  • It supports replication and multiple data centre replication.
  • It has immense scalability.
  • It is fault-tolerant.
  • It is decentralised.
  • It has tunable consistency.
  • It provides MapReduce support.
  • It supports Cassandra Query Language (CQL) as an alternative to the Structured Query Language (SQL).

The list of companies using Cassandra is vast and constantly growing. This list includes:

  • Twitter is using Cassandra for analytics.
  • Mahalo uses it for its primary near-time data store.
  •  Facebook still uses it for inbox search, though they are using a proprietary fork.
  • Digg uses it for its primary near-time data store.
  • Rackspace uses it for its cloud service, monitoring, and logging.
  • Reddit uses it as a persistent cache.
  • Cloudkick uses it for monitoring statistics and analytics.
  • Ooyala uses it to store and serve near real-time video analytics data.
  • SimpleGeo uses it as the main data store for its real-time location infrastructure.
  • Onespot uses it for a subset of its main data store.

Users can interact with Cassandra in multiple ways.

  • Command Line Interface (CLI) The latest version is 1.2.4. For the purpose of learning, we worked on this tutorial  in CLI.
  • Cassandra Query Language (CQL) It supports subset of SQL features. Here standard DDL, DML commands can be used. Specific functions like Group by and Order by are not supported. The latest version is 3.0. Because of trivialness, this report will not discuss CQL in detail.
  • The DataStax Community Package is a software package which supports both the CLI and CQL together and is a quick way to interface with Cassandra.

 

2 Data Model – Bottom-up approach

Data Model – Bottom-up approach
Cassandra – Data Model

For people coming from traditional RDBMs, the Cassandra data model can be strange, confusing and maybe even a bit difficult to understand. There are some terms such as keyspace completely new in Cassandra and some terms such as column does not match the meaning in the RDBMs.

Before we dig into some of key data model concepts in Cassandra following bottom up approach we would like to illustrate how Cassandra data model can be mapped to RDBMs.

This analogy helps make the transition from the relational to non-relational world. But don’t use this analogy while designing Cassandra column families. Instead, think of the Cassandra column family as a map of a map: an outer map keyed by a row key, and an inner map keyed by a column key. Both maps are sorted.

SortedMap<RowKey, SortedMap<ColumnKey, ColumnValue>>

A nested sorted map is a more accurate analogy than a relational table, and will help you make the right decisions about your Cassandra data model.

How?

  • A map gives efficient key lookup, and the sorted nature gives efficient scans. In Cassandra, we can use row keys and column keys to do efficient lookups and range scans.
  • The number of column keys is unbounded. In other words, you can have wide rows.
  • A key can itself hold a value. In other words, you can have a valueless column.

Range scan on row keys is possible only when data is partitioned in a cluster using Order Preserving Partitioner (OOP). OOP is almost never used. So, you can think of the outer map as unsorted:

Map<RowKey, SortedMap<ColumnKey, ColumnValue>>

As mentioned earlier, there is something called a “Super Column” in Cassandra. Think of this as a grouping of columns, which turns our two nested maps into three nested maps as follows:

Map<RowKey, SortedMap<SuperColumnKey, SortedMap<ColumnKey, ColumnValue>>>

Notes:

  • You need to pass the timestamp with each column value, for Cassandra to use internally for conflict resolution. However, the timestamp can be safely ignored during modeling. Also, do not plan to use timestamps as data in your application. They’re not for you, and they do not define new versions of your data (unlike in HBase).

[ Read More.. Source: Data Modeling Practise ]

3 How to install Cassandra on Ubuntu

Development is done in the Apache Check here

Installation from Debian packages

  • Add the Apache repository of Cassandra to /etc/apt/sources.list.d/cassandra.sources.list, for example for the latest 3.11 version:
echo "deb http://www.apache.org/dist/cassandra/debian 311x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
  • Add the Apache Cassandra repository keys:
curl https://www.apache.org/dist/cassandra/KEYS | sudo apt-key add -
  • Update the repositories:
sudo apt-get update
  • If you encounter this error:
GPG error: http://www.apache.org 311x InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A278B781FE4B2BDA

Then add the public key A278B781FE4B2BDA as follows:

sudo apt-key adv --keyserver pool.sks-keyservers.net --recv-key A278B781FE4B2BDA

and repeat sudo apt-get update. The actual key may be different, you get it from the error message itself. For a full list of Apache contributors public keys, you can refer to https://www.apache.org/dist/cassandra/KEYS.

  • Install Cassandra:
sudo apt-get install cassandra
  • You can start Cassandra with sudo service cassandra start and stop it with sudo service cassandra stop. However, normally the service will start automatically. For this reason be sure to stop it if you need to make any configuration changes.
  • Verify that Cassandra is running by invoking nodetool status from the command line.
  • The default location of configuration files is /etc/cassandra.
  • The default location of log and data directories is /var/log/cassandra/ and /var/lib/cassandra.
  • Start-up options (heap size, etc) can be configured in /etc/default/cassandra.

If everything goes well you can see output like below

techfossguru@techfossguru:~$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 103.66 KiB 256 100.0% e3c08fec-a280-4447-8f7f-a7656cce1d50 rack1

techfossguru@techfossguru:~$

4 How to Install Python Cassandra Driver

A Python client driver for Apache Cassandra. This driver works exclusively with the Cassandra Query Language v3 (CQL3) and Cassandra’s native protocol. Cassandra 2.1+ is supported.

This driver is open source under the Apache v2 License. The source code for this driver can be found on GitHub.

Python 2.6, 2.7, 3.3, and 3.4 are supported. Both CPython (the standard Python implementation) and PyPy are supported and tested.

Linux, OSX, and Windows are supported.

5 Installation through pip

pip is the suggested tool for installing packages. It will handle installing all Python dependencies for the driver at the same time as the driver itself. To install the driver*:

pip install cassandra-driver

You can use pip install --pre cassandra-driver if you need to install a beta version.

*Note: if intending to use optional extensions, install the dependencies first. The driver may need to be reinstalled if dependencies are added after the initial installation.

6 Cassandra CQLsh

Cassandra CQLsh stands for Cassandra CQL shell. CQLsh specifies how to use Cassandra commands. After installation, Cassandra provides a prompt Cassandra query language shell (cqlsh). It facilitates users to communicate with it.

Cassandra commands are executed on CQLsh. It looks like this:

 

CQLsh provides a lot of options which you can see in the following table:

Options Usage
help This command is used to show help topics about the options of CQLsh commands.
version it is used to see the version of the CQLsh you are using.
color it is used for colored output.
debug It shows additional debugging information.
execute It is used to direct the shell to accept and execute a CQL command.
file= “file name” By using this option, cassandra executes the command in the given file and exits.
no-color It directs cassandra not to use colored output.
u “username” Using this option, you can authenticate a user. The default user name is: cassandra.
p “password” Using this option, you can authenticate a user with a password. The default password is: cassandra.
						

7 Lets try hands on Cassandra CQLsh

Lets try hands on Cassandra CQLsh

CQL for Apache Cassandra

techfossguru@techfossguru:~$ cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> CREATE KEYSPACE techfossguru 
 ... WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 3};
cqlsh> use techfossguru;
cqlsh:techfossguru> CREATE TABLE student_info( student_roll_no int PRIMARY KEY, 
student_name text, student_city text, student_fees varint, student_phone varint );
cqlsh:techfossguru> SELECT * FROM student_info;

student_roll_no | student_city | student_fees | student_name | student_phone
-----------------+--------------+--------------+--------------+---------------

(0 rows)
cqlsh:techfossguru> INSERT INTO techfossguru.student_info JSON '{
 ... "student_roll_no" : "001", 
 ... "student_name" : "Satish Prasad", 
 ... "student_city" : "Delhi",
 ... "student_phone" : "9999912345" }';
cqlsh:techfossguru> select * from techfossguru.student_info
 ... ;

student_roll_no | student_city | student_fees | student_name | student_phone
-----------------+--------------+--------------+---------------+---------------
 1 | Delhi | null | Satish Prasad | 9999912345

(1 rows)
cqlsh:techfossguru>DELETE techfossguru.student_phone FROM techfossguru.student_info 
WHERE student_roll_no=001;

cqlsh:techfossguru>
UPDATE techfossguru.student_info SET student_city = 'London' WHERE student_roll_no = 001;



You can always refer below sheet to see how the above syntax is different from other RDBMS/SQL 


So far so good..let move to actual manipulation of data using python program..these examples are simple and for introduction purpose only; they are not optimised to work in real time production environment !

You can read more details around the Python driver for Cassandra here. I will cover two example.

  • Python Crud Example with Casandra – To show the basic usages of Python driver
  • Another Example using Flask-CQLAlchemy  – This will guide how Flask-CQLAlchemy intracts with Flask in real world example

8 Example : Python Crud Example with Casandra

"""
Python  by Techfossguru
Copyright (C) 2017  Satish Prasad

"""
import logging
from cassandra import ConsistencyLevel
from cassandra.cluster import Cluster, BatchStatement
from cassandra.query import SimpleStatement


class PythonCassandraExample:

    def __init__(self):
        self.cluster = None
        self.session = None
        self.keyspace = None
        self.log = None

    def __del__(self):
        self.cluster.shutdown()

    def createsession(self):
        self.cluster = Cluster(['localhost'])
        self.session = self.cluster.connect(self.keyspace)

    def getsession(self):
        return self.session

    # How about Adding some log info to see what went wrong
    def setlogger(self):
        log = logging.getLogger()
        log.setLevel('INFO')
        handler = logging.StreamHandler()
        handler.setFormatter(logging.Formatter("%(asctime)s [%(levelname)s] %(name)s: %(message)s"))
        log.addHandler(handler)
        self.log = log

    # Create Keyspace based on Given Name
    def createkeyspace(self, keyspace):
        """
        :param keyspace:  The Name of Keyspace to be created
        :return:
        """
        # Before we create new lets check if exiting keyspace; we will drop that and create new
        rows = self.session.execute("SELECT keyspace_name FROM system_schema.keyspaces")
        if keyspace in [row[0] for row in rows]:
            self.log.info("dropping existing keyspace...")
            self.session.execute("DROP KEYSPACE " + keyspace)

        self.log.info("creating keyspace...")
        self.session.execute("""
                CREATE KEYSPACE %s
                WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '2' }
                """ % keyspace)

        self.log.info("setting keyspace...")
        self.session.set_keyspace(keyspace)

    def create_table(self):
        c_sql = """
                CREATE TABLE IF NOT EXISTS employee (emp_id int PRIMARY KEY,
                                              ename varchar,
                                              sal double,
                                              city varchar);
                 """
        self.session.execute(c_sql)
        self.log.info("Employee Table Created !!!")

    # lets do some batch insert
    def insert_data(self):
        insert_sql = self.session.prepare("INSERT INTO  employee (emp_id, ename , sal,city) VALUES (?,?,?,?)")
        batch = BatchStatement()
        batch.add(insert_sql, (1, 'LyubovK', 2555, 'Dubai'))
        batch.add(insert_sql, (2, 'JiriK', 5660, 'Toronto'))
        batch.add(insert_sql, (3, 'IvanH', 2547, 'Mumbai'))
        batch.add(insert_sql, (4, 'YuliaT', 2547, 'Seattle'))
        self.session.execute(batch)
        self.log.info('Batch Insert Completed')

    def select_data(self):
        rows = self.session.execute('select * from employee limit 5;')
        for row in rows:
            print(row.ename, row.sal)

    def update_data(self):
        pass

    def delete_data(self):
        pass


if __name__ == '__main__':
    example1 = PythonCassandraExample()
    example1.createsession()
    example1.setlogger()
    example1.createkeyspace('techfossguru')
    example1.create_table()
    example1.insert_data()
    example1.select_data()

if every thing goes well ..you will see output like below

/home/techfossguru/anaconda3/bin/python /home/techfossguru/PycharmProjects/techfossguru/py-cassandra/python-cassandra-example.py
2018-04-30 22:57:52,560 [INFO] root: creating keyspace...
2018-04-30 22:57:53,029 [INFO] root: setting keyspace...
2018-04-30 22:57:54,795 [INFO] root: Employee Table Created !!!
2018-04-30 22:57:54,818 [INFO] root: Batch Insert Completed
LyubovK 2555.0
JiriK 5660.0
YuliaT 2547.0
IvanH 2547.0

Process finished with exit code 0

 

Some time you might face issue like below…which mostly happens due to connection timeout issue

 

/home/techfossguru/anaconda3/bin/python /home/techfossguru/PycharmProjects/techfossguru/py-cassandra/python-cassandra-example.py
2018-04-30 22:55:21,436 [INFO] root: dropping existing keyspace…
Traceback (most recent call last):
File “/home/techfossguru/PycharmProjects/techfossguru/py-cassandra/python-cassandra-example.py”, line 97, in <module>
example1.createkeyspace(‘techfossguru’)
File “/home/techfossguru/PycharmProjects/techfossguru/py-cassandra/python-cassandra-example.py”, line 49, in createkeyspace
self.session.execute(“DROP KEYSPACE ” + keyspace)
File “cassandra/cluster.py”, line 2141, in cassandra.cluster.Session.execute
File “cassandra/cluster.py”, line 4033, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={‘127.0.0.1’: ‘Client request timeout. See Session.execute_async‘}, last_host=127.0.0.1

Process finished with exit code 1

 

9 Example : Another Example using Flask-CQLAlchemy [Will be updated soon…]

Flask-CQLAlchemy handles connections to Cassandra clusters and gives a unified easier way to declare models and their columns

  • Flask-CQLAlchemy depends only on the cassandra-driver. [It is assumed that you already have flask installed]

 

 

 

Comments
Loading...

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More